History log of /linux-master/fs/ext4/file.c
Revision Date Author Comments
# 2ffd2a6a 05-Jan-2024 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: remove unnecessary parameter "needed" in ext4_discard_preallocations

The "needed" controls the number of ext4_prealloc_space to discard in
ext4_discard_preallocations. Function ext4_discard_preallocations is
supposed to discard all non-used preallocated blocks when "needed"
is 0 and now ext4_discard_preallocations is always called with "needed"
= 0. Remove unnecessary parameter "needed" and remove all non-used
preallocated spaces in ext4_discard_preallocations to simplify the
code.

Note: If count of non-used preallocated spaces could be more than
UINT_MAX, there was a memory leak as some non-used preallocated
spaces are left ununsed and this commit will fix it. Otherwise,
there is no behavior change.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240105092102.496631-9-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 619f75da 30-Nov-2023 Jan Kara <jack@suse.cz>

ext4: fix warning in ext4_dio_write_end_io()

The syzbot has reported that it can hit the warning in
ext4_dio_write_end_io() because i_size < i_disksize. Indeed the
reproducer creates a race between DIO IO completion and truncate
expanding the file and thus ext4_dio_write_end_io() sees an inconsistent
inode state where i_disksize is already updated but i_size is not
updated yet. Since we are careful when setting up DIO write and consider
it extending (and thus performing the IO synchronously with i_rwsem held
exclusively) whenever it goes past either of i_size or i_disksize, we
can use the same test during IO completion without risking entering
ext4_handle_inode_extension() without i_rwsem held. This way we make it
obvious both i_size and i_disksize are large enough when we report DIO
completion without relying on unreliable WARN_ON.

Reported-by: <syzbot+47479b71cdfc78f56d30@syzkaller.appspotmail.com>
Fixes: 91562895f803 ("ext4: properly sync file size update after O_SYNC direct IO")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20231130095653.22679-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 91562895 13-Oct-2023 Jan Kara <jack@suse.cz>

ext4: properly sync file size update after O_SYNC direct IO

Gao Xiang has reported that on ext4 O_SYNC direct IO does not properly
sync file size update and thus if we crash at unfortunate moment, the
file can have smaller size although O_SYNC IO has reported successful
completion. The problem happens because update of on-disk inode size is
handled in ext4_dio_write_iter() *after* iomap_dio_rw() (and thus
dio_complete() in particular) has returned and generic_file_sync() gets
called by dio_complete(). Fix the problem by handling on-disk inode size
update directly in our ->end_io completion handler.

References: https://lore.kernel.org/all/02d18236-26ef-09b0-90ad-030c4fe3ee20@linux.alibaba.com
Reported-by: Gao Xiang <hsiangkao@linux.alibaba.com>
CC: stable@vger.kernel.org
Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure")
Signed-off-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20231013121350.26872-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# ce56d213 02-Oct-2023 Brian Foster <bfoster@redhat.com>

ext4: fix racy may inline data check in dio write

syzbot reports that the following warning from ext4_iomap_begin()
triggers as of the commit referenced below:

if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
return -ERANGE;

This occurs during a dio write, which is never expected to encounter
an inode with inline data. To enforce this behavior,
ext4_dio_write_iter() checks the current inline state of the inode
and clears the MAY_INLINE_DATA state flag to either fall back to
buffered writes, or enforce that any other writers in progress on
the inode are not allowed to create inline data.

The problem is that the check for existing inline data and the state
flag can span a lock cycle. For example, if the ilock is originally
locked shared and subsequently upgraded to exclusive, another writer
may have reacquired the lock and created inline data before the dio
write task acquires the lock and proceeds.

The commit referenced below loosens the lock requirements to allow
some forms of unaligned dio writes to occur under shared lock, but
AFAICT the inline data check was technically already racy for any
dio write that would have involved a lock cycle. Regardless, lift
clearing of the state bit to the same lock critical section that
checks for preexisting inline data on the inode to close the race.

Cc: stable@kernel.org
Reported-by: syzbot+307da6ca5cb0d01d581a@syzkaller.appspotmail.com
Fixes: 310ee0902b8d ("ext4: allow concurrent unaligned dio overwrites")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20231002185020.531537-1-bfoster@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 194505b5 09-Aug-2023 Brian Foster <bfoster@redhat.com>

ext4: drop dio overwrite only flag and associated warning

The commit referenced below opened up concurrent unaligned dio under
shared locking for pure overwrites. In doing so, it enabled use of
the IOMAP_DIO_OVERWRITE_ONLY flag and added a warning on unexpected
-EAGAIN returns as an extra precaution, since ext4 does not retry
writes in such cases. The flag itself is advisory in this case since
ext4 checks for unaligned I/Os and uses appropriate locking up
front, rather than on a retry in response to -EAGAIN.

As it turns out, the warning check is susceptible to false positives
because there are scenarios where -EAGAIN can be expected from lower
layers without necessarily having IOCB_NOWAIT set on the iocb. For
example, one instance of the warning has been seen where io_uring
sets IOCB_HIPRI, which in turn results in REQ_POLLED|REQ_NOWAIT on
the bio. This results in -EAGAIN if the block layer is unable to
allocate a request, etc. [Note that there is an outstanding patch to
untangle REQ_POLLED and REQ_NOWAIT such that the latter relies on
IOCB_NOWAIT, which would also address this instance of the warning.]

Another instance of the warning has been reproduced by syzbot. A dio
write is interrupted down in __get_user_pages_locked() waiting on
the mm lock and returns -EAGAIN up the stack. If the iomap dio
iteration layer has made no progress on the write to this point,
-EAGAIN returns up to the filesystem and triggers the warning.

This use of the overwrite flag in ext4 is precautionary and
half-baked. I.e., ext4 doesn't actually implement overwrite checking
in the iomap callbacks when the flag is set, so the only extra
verification it provides are i_size checks in the generic iomap dio
layer. Combined with the tendency for false positives, the added
verification is not worth the extra trouble. Remove the flag,
associated warning, and update the comments to document when
concurrent unaligned dio writes are allowed and why said flag is not
used.

Cc: stable@kernel.org
Reported-by: syzbot+5050ad0fb47527b1808a@syzkaller.appspotmail.com
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Fixes: 310ee0902b8d ("ext4: allow concurrent unaligned dio overwrites")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230810165559.946222-1-bfoster@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# eb8ab444 16-Jun-2023 Jan Kara <jack@suse.cz>

ext4: make ext4_forced_shutdown() take struct super_block

Currently ext4_forced_shutdown() takes struct ext4_sb_info but most
callers need to get it from struct super_block anyway. So just pass in
struct super_block to save all callers from some boilerplate code. No
functional changes.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230616165109.21695-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 1d024e7a 18-Aug-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: remove enum page_entry_size

Remove the unnecessary encoding of page order into an enum and pass the
page order directly. That lets us get rid of pe_order().

The switch constructs have to be changed to if/else constructs to prevent
GCC from warning on builds with 3-level page tables where PMD_ORDER and
PUD_ORDER have the same value.

If you are looking at this commit because your driver stopped compiling,
look at the previous commit as well and audit your driver to be sure it
doesn't depend on mmap_lock being held in its ->huge_fault method.

[willy@infradead.org: use "order %u" to match the (non dev_t) style]
Link: https://lkml.kernel.org/r/ZOUYekbtTv+n8hYf@casper.infradead.org
Link: https://lkml.kernel.org/r/20230818202335.2739663-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 310ee090 14-Mar-2023 Brian Foster <bfoster@redhat.com>

ext4: allow concurrent unaligned dio overwrites

We've had reports of significant performance regression of sub-block
(unaligned) direct writes due to the added exclusivity restrictions
in ext4. The purpose of the exclusivity requirement for unaligned
direct writes is to avoid data corruption caused by unserialized
partial block zeroing in the iomap dio layer across overlapping
writes.

XFS has similar requirements for the same underlying reasons, yet
doesn't suffer the extreme performance regression that ext4 does.
The reason for this is that XFS utilizes IOMAP_DIO_OVERWRITE_ONLY
mode, which allows for optimistic submission of concurrent unaligned
I/O and kicks back writes that require partial block zeroing such
that they can be submitted in a safe, exclusive context. Since ext4
already performs most of these checks pre-submission, it can support
something similar without necessarily relying on the iomap flag and
associated retry mechanism.

Update the dio write submission path to allow concurrent submission
of unaligned direct writes that are purely overwrite and so will not
require block zeroing. To improve readability of the various related
checks, move the unaligned I/O handling down into
ext4_dio_write_checks(), where the dio draining and force wait logic
can immediately follow the locking requirement checks. Finally, the
IOMAP_DIO_OVERWRITE_ONLY flag is set to enable a warning check as a
precaution should the ext4 overwrite logic ever become inconsistent
with the zeroing expectations of iomap dio.

The performance improvement of sub-block direct write I/O is shown
in the following fio test on a 64xcpu guest vm:

Test: fio --name=test --ioengine=libaio --direct=1 --group_reporting
--overwrite=1 --thread --size=10G --filename=/mnt/fio
--readwrite=write --ramp_time=10s --runtime=60s --numjobs=8
--blocksize=2k --iodepth=256 --allow_file_create=0

v6.2: write: IOPS=4328, BW=8724KiB/s
v6.2 (patched): write: IOPS=801k, BW=1565MiB/s

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230314130759.642710-1-bfoster@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 182c25e9 01-Jun-2023 Christoph Hellwig <hch@lst.de>

filemap: update ki_pos in generic_perform_write

All callers of generic_perform_write need to updated ki_pos, move it into
common code.

Link: https://lkml.kernel.org/r/20230601145904.1385409-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0d625446 01-Jun-2023 Christoph Hellwig <hch@lst.de>

backing_dev: remove current->backing_dev_info

Patch series "cleanup the filemap / direct I/O interaction", v4.

This series cleans up some of the generic write helper calling conventions
and the page cache writeback / invalidation for direct I/O. This is a
spinoff from the no-bufferhead kernel project, for which we'll want to an
use iomap based buffered write path in the block layer.


This patch (of 12):

The last user of current->backing_dev_info disappeared in commit
b9b1335e6403 ("remove bdi_congested() and wb_congested() and related
functions"). Remove the field and all assignments to it.

Link: https://lkml.kernel.org/r/20230601145904.1385409-1-hch@lst.de
Link: https://lkml.kernel.org/r/20230601145904.1385409-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 2cb1e089 22-May-2023 David Howells <dhowells@redhat.com>

splice: Use filemap_splice_read() instead of generic_file_splice_read()

Replace pointers to generic_file_splice_read() with calls to
filemap_splice_read().

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230522135018.2742245-29-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fa6c46e7 22-May-2023 David Howells <dhowells@redhat.com>

ext4: Provide a splice-read wrapper

Provide a splice_read wrapper for Ext4. This does the inode shutdown check
before proceeding. Splicing from DAX files and O_DIRECT fds is handled by
the caller.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Theodore Ts'o <tytso@mit.edu>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Andreas Dilger <adilger.kernel@dilger.ca>
cc: linux-ext4@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20230522135018.2742245-19-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d8aeb44a 07-Mar-2023 Jens Axboe <axboe@kernel.dk>

fs: add FMODE_DIO_PARALLEL_WRITE flag

Some filesystems support multiple threads writing to the same file with
O_DIRECT without requiring exclusive access to it. io_uring can use this
hint to avoid serializing dio writes to this inode, instead allowing them
to run in parallel.

XFS and ext4 both fall into this category, so set the flag for both of
them.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 240930fb 25-Dec-2022 Zhang Yi <yi.zhang@huawei.com>

ext4: dio take shared inode lock when overwriting preallocated blocks

In the dio write path, we only take shared inode lock for the case of
aligned overwriting initialized blocks inside EOF. But for overwriting
preallocated blocks, it may only need to split unwritten extents, this
procedure has been protected under i_data_sem lock, it's safe to
release the exclusive inode lock and take shared inode lock.

This could give a significant speed up for multi-threaded writes. Test
on Intel Xeon Gold 6140 and nvme SSD with below fio parameters.

direct=1
ioengine=libaio
iodepth=10
numjobs=10
runtime=60
rw=randwrite
size=100G

And the test result are:
Before:
bs=4k IOPS=11.1k, BW=43.2MiB/s
bs=16k IOPS=11.1k, BW=173MiB/s
bs=64k IOPS=11.2k, BW=697MiB/s

After:
bs=4k IOPS=41.4k, BW=162MiB/s
bs=16k IOPS=41.3k, BW=646MiB/s
bs=64k IOPS=13.5k, BW=843MiB/s

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221226062015.3479416-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 1c71222e 26-Jan-2023 Suren Baghdasaryan <surenb@google.com>

mm: replace vma->vm_flags direct modifications with modifier calls

Replace direct modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.

[akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo]
Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# cac2f8b8 22-Sep-2022 Christian Brauner <brauner@kernel.org>

fs: rename current get acl method

The current way of setting and getting posix acls through the generic
xattr interface is error prone and type unsafe. The vfs needs to
interpret and fixup posix acls before storing or reporting it to
userspace. Various hacks exist to make this work. The code is hard to
understand and difficult to maintain in it's current form. Instead of
making this work by hacking posix acls through xattr handlers we are
building a dedicated posix acl api around the get and set inode
operations. This removes a lot of hackiness and makes the codepaths
easier to maintain. A lot of background can be found in [1].

The current inode operation for getting posix acls takes an inode
argument but various filesystems (e.g., 9p, cifs, overlayfs) need access
to the dentry. In contrast to the ->set_acl() inode operation we cannot
simply extend ->get_acl() to take a dentry argument. The ->get_acl()
inode operation is called from:

acl_permission_check()
-> check_acl()
-> get_acl()

which is part of generic_permission() which in turn is part of
inode_permission(). Both generic_permission() and inode_permission() are
called in the ->permission() handler of various filesystems (e.g.,
overlayfs). So simply passing a dentry argument to ->get_acl() would
amount to also having to pass a dentry argument to ->permission(). We
should avoid this unnecessary change.

So instead of extending the existing inode operation rename it from
->get_acl() to ->get_inode_acl() and add a ->get_acl() method later that
passes a dentry argument and which filesystems that need access to the
dentry can implement instead of ->get_inode_acl(). Filesystems like cifs
which allow setting and getting posix acls but not using them for
permission checking during lookup can simply not implement
->get_inode_acl().

This is intended to be a non-functional change.

Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1]
Suggested-by/Inspired-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 4bb26f28 27-Jul-2022 Jan Kara <jack@suse.cz>

ext4: avoid crash when inline data creation follows DIO write

When inode is created and written to using direct IO, there is nothing
to clear the EXT4_STATE_MAY_INLINE_DATA flag. Thus when inode gets
truncated later to say 1 byte and written using normal write, we will
try to store the data as inline data. This confuses the code later
because the inode now has both normal block and inline data allocated
and the confusion manifests for example as:

kernel BUG at fs/ext4/inode.c:2721!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 359 Comm: repro Not tainted 5.19.0-rc8-00001-g31ba1e3b8305-dirty #15
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
RIP: 0010:ext4_writepages+0x363d/0x3660
RSP: 0018:ffffc90000ccf260 EFLAGS: 00010293
RAX: ffffffff81e1abcd RBX: 0000008000000000 RCX: ffff88810842a180
RDX: 0000000000000000 RSI: 0000008000000000 RDI: 0000000000000000
RBP: ffffc90000ccf650 R08: ffffffff81e17d58 R09: ffffed10222c680b
R10: dfffe910222c680c R11: 1ffff110222c680a R12: ffff888111634128
R13: ffffc90000ccf880 R14: 0000008410000000 R15: 0000000000000001
FS: 00007f72635d2640(0000) GS:ffff88811b000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000565243379180 CR3: 000000010aa74000 CR4: 0000000000150eb0
Call Trace:
<TASK>
do_writepages+0x397/0x640
filemap_fdatawrite_wbc+0x151/0x1b0
file_write_and_wait_range+0x1c9/0x2b0
ext4_sync_file+0x19e/0xa00
vfs_fsync_range+0x17b/0x190
ext4_buffered_write_iter+0x488/0x530
ext4_file_write_iter+0x449/0x1b90
vfs_write+0xbcd/0xf40
ksys_write+0x198/0x2c0
__x64_sys_write+0x7b/0x90
do_syscall_64+0x3d/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
</TASK>

Fix the problem by clearing EXT4_STATE_MAY_INLINE_DATA when we are doing
direct IO write to a file.

Cc: stable@kernel.org
Reported-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Reported-by: syzbot+bd13648a53ed6933ca49@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=a1e89d09bbbcbd5c4cb45db230ee28c822953984
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Tested-by: Tadeusz Struk<tadeusz.struk@linaro.org>
Link: https://lore.kernel.org/r/20220727155753.13969-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 8434ef1d 27-Aug-2022 Eric Biggers <ebiggers@google.com>

ext4: support STATX_DIOALIGN

Add support for STATX_DIOALIGN to ext4, so that direct I/O alignment
restrictions are exposed to userspace in a generic way.

Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20220827065851.135710-5-ebiggers@kernel.org


# 53dd3f80 27-Aug-2022 Eric Biggers <ebiggers@google.com>

fscrypt: change fscrypt_dio_supported() to prepare for STATX_DIOALIGN

To prepare for STATX_DIOALIGN support, make two changes to
fscrypt_dio_supported().

First, remove the filesystem-block-alignment check and make the
filesystems handle it instead. It previously made sense to have it in
fs/crypto/; however, to support STATX_DIOALIGN the alignment restriction
would have to be returned to filesystems. It ends up being simpler if
filesystems handle this part themselves, especially for f2fs which only
allows fs-block-aligned DIO in the first place.

Second, make fscrypt_dio_supported() work on inodes whose encryption key
hasn't been set up yet, by making it set up the key if needed. This is
required for statx(), since statx() doesn't require a file descriptor.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20220827065851.135710-4-ebiggers@kernel.org


# 786f847f 05-May-2022 Christoph Hellwig <hch@lst.de>

iomap: add per-iomap_iter private data

Allow the file system to keep state for all iterations. For now only
wire it up for direct I/O as there is an immediate need for it there.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 800ba295 19-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Pass an iocb to generic_perform_write()

We can extract both the file pointer and the pos from the iocb.
This simplifies each caller as well as allowing generic_perform_write()
to see more of the iocb contents in the future.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>


# 38ea50da 28-Jan-2022 Eric Biggers <ebiggers@google.com>

ext4: support direct I/O with fscrypt using blk-crypto

Encrypted files traditionally haven't supported DIO, due to the need to
encrypt/decrypt the data. However, when the encryption is implemented
using inline encryption (blk-crypto) instead of the traditional
filesystem-layer encryption, it is straightforward to support DIO.

Therefore, make ext4 support DIO on files that are using inline
encryption. Since ext4 uses iomap for DIO, and fscrypt support was
already added to iomap DIO, this just requires two small changes:

- Let DIO proceed when supported, by checking fscrypt_dio_supported()
instead of assuming that encrypted files never support DIO.

- In ext4_iomap_begin(), use fscrypt_limit_io_blocks() to limit the
length of the mapping in the rare case where a DUN discontiguity
occurs in the middle of an extent. The iomap DIO implementation
requires this, since it assumes that it can submit a bio covering (up
to) the whole mapping, without checking fscrypt constraints itself.

Co-developed-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Link: https://lore.kernel.org/r/20220128233940.79464-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>


# 2729cfdc 23-Dec-2021 Harshad Shirwadkar <harshadshirwadkar@gmail.com>

ext4: use ext4_journal_start/stop for fast commit transactions

This patch drops all calls to ext4_fc_start_update() and
ext4_fc_stop_update(). To ensure that there are no ongoing journal
updates during fast commit, we also make jbd2_fc_begin_commit() lock
journal for updates. This way we don't have to maintain two different
transaction start stop APIs for fast commit and full commit. This
patch doesn't remove the functions altogether since in future we want
to have inode level locking for fast commits.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211223202140.2061101-2-harshads@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 4fdccaa0 23-Jul-2021 Andreas Gruenbacher <agruenba@redhat.com>

iomap: Add done_before argument to iomap_dio_rw

Add a done_before argument to iomap_dio_rw that indicates how much of
the request has already been transferred. When the request succeeds, we
report that done_before additional bytes were tranferred. This is
useful for finishing a request asynchronously when part of the request
has already been completed synchronously.

We'll use that to allow iomap_dio_rw to be used with page faults
disabled: when a page fault occurs while submitting a request, we
synchronously complete the part of the request that has already been
submitted. The caller can then take care of the page fault and call
iomap_dio_rw again for the rest of the request, passing in the number of
bytes already tranferred.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>


# 3e08773c 12-Oct-2021 Christoph Hellwig <hch@lst.de>

block: switch polling to be bio based

Replace the blk_poll interface that requires the caller to keep a queue
and cookie from the submissions with polling based on the bio.

Polling for the bio itself leads to a few advantages:

- the cookie construction can made entirely private in blk-mq.c
- the caller does not need to remember the request_queue and cookie
separately and thus sidesteps their lifetime issues
- keeping the device and the cookie inside the bio allows to trivially
support polling BIOs remapping by stacking drivers
- a lot of code to propagate the cookie back up the submission path can
be removed entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 188c299e 16-Aug-2021 Jan Kara <jack@suse.cz>

ext4: Support for checksumming from journal triggers

JBD2 layer support triggers which are called when journaling layer moves
buffer to a certain state. We can use the frozen trigger, which gets
called when buffer data is frozen and about to be written out to the
journal, to compute block checksums for some buffer types (similarly as
does ocfs2). This avoids unnecessary repeated recomputation of the
checksum (at the cost of larger window where memory corruption won't be
caught by checksumming) and is even necessary when there are
unsynchronized updaters of the checksummed data.

So add superblock and journal trigger type arguments to
ext4_journal_get_write_access() and ext4_journal_get_create_access() so
that frozen triggers can be set accordingly. Also add inode argument to
ext4_walk_page_buffers() and all the callbacks used with that function
for the same purpose. This patch is mostly only a change of prototype of
the above mentioned functions and a few small helpers. Real checksumming
will come later.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210816095713.16537-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# d4f5258e 04-Feb-2021 Jan Kara <jack@suse.cz>

ext4: Convert to use mapping->invalidate_lock

Convert ext4 to use mapping->invalidate_lock instead of its private
EXT4_I(inode)->i_mmap_sem. This is mostly search-and-replace. By this
conversion we fix a long standing race between hole punching and read(2)
/ readahead(2) paths that can lead to stale page cache contents.

CC: <linux-ext4@vger.kernel.org>
CC: Ted Tso <tytso@mit.edu>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>


# 5899593f 15-Apr-2021 Jan Kara <jack@suse.cz>

ext4: Fix occasional generic/418 failure

Eric has noticed that after pagecache read rework, generic/418 is
occasionally failing for ext4 when blocksize < pagesize. In fact, the
pagecache rework just made hard to hit race in ext4 more likely. The
problem is that since ext4 conversion of direct IO writes to iomap
framework (commit 378f32bab371), we update inode size after direct IO
write only after invalidating page cache. Thus if buffered read sneaks
at unfortunate moment like:

CPU1 - write at offset 1k CPU2 - read from offset 0
iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT);
ext4_readpage();
ext4_handle_inode_extension()

the read will zero out tail of the page as it still sees smaller inode
size and thus page cache becomes inconsistent with on-disk contents with
all the consequences.

Fix the problem by moving inode size update into end_io handler which
gets called before the page cache is invalidated.

Reported-and-tested-by: Eric Whitney <enwlinux@gmail.com>
Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20210415155417.4734-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 4db5c2e6 07-Apr-2021 Miklos Szeredi <mszeredi@redhat.com>

ext4: convert to fileattr

Use the fileattr API to let the VFS handle locking, permission checking and
conversion.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>


# 2f632965 23-Jan-2021 Christoph Hellwig <hch@lst.de>

iomap: pass a flags argument to iomap_dio_rw

Pass a set of flags to iomap_dio_rw instead of the boolean
wait_for_completion argument. The IOMAP_DIO_FORCE_WAIT flag
replaces the wait_for_completion, but only needs to be passed
when the iocb isn't synchronous to start with to simplify the
callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
[djwong: rework xfs_file.c so that we can push iomap changes separately]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>


# 5a3b590d 17-Dec-2020 Theodore Ts'o <tytso@mit.edu>

ext4: don't leak old mountpoint samples

When the first file is opened, ext4 samples the mountpoint of the
filesystem in 64 bytes of the super block. It does so using
strlcpy(), this means that the remaining bytes in the super block
string buffer are untouched. If the mount point before had a longer
path than the current one, it can be reconstructed.

Consider the case where the fs was mounted to "/media/johnjdeveloper"
and later to "/". The super block buffer then contains
"/\x00edia/johnjdeveloper".

This case was seen in the wild and caused confusion how the name
of a developer ands up on the super block of a filesystem used
in production...

Fix this by using strncpy() instead of strlcpy(). The superblock
field is defined to be a fixed-size char array, and it is already
marked using __nonstring in fs/ext4/ext4.h. The consumer of the field
in e2fsprogs already assumes that in the case of a 64+ byte mount
path, that s_last_mounted will not be NUL terminated.

Link: https://lore.kernel.org/r/X9ujIOJG/HqMr88R@mit.edu
Reported-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org


# a3f5cf14 16-Dec-2020 Jan Kara <jack@suse.cz>

ext4: drop ext4_handle_dirty_super()

The wrapper is now useless since it does what
ext4_handle_dirty_metadata() does. Just remove it.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201216101844.22917-9-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 05c2c00f 16-Dec-2020 Jan Kara <jack@suse.cz>

ext4: protect superblock modifications with a buffer lock

Protect all superblock modifications (including checksum computation)
with a superblock buffer lock. That way we are sure computed checksum
matches current superblock contents (a mismatch could cause checksum
failures in nojournal mode or if an unjournalled superblock update races
with a journalled one). Also we avoid modifying superblock contents
while it is being written out (which can cause DIF/DIX failures if we
are running in nojournal mode).

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201216101844.22917-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 9b5f6c9b 05-Nov-2020 Harshad Shirwadkar <harshadshirwadkar@gmail.com>

ext4: make s_mount_flags modifications atomic

Fast commit file system states are recorded in
sbi->s_mount_flags. Fast commit expects these bit manipulations to be
atomic. This patch adds helpers to make those modifications atomic.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201106035911.1942128-21-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# a3114fe7 05-Nov-2020 Harshad Shirwadkar <harshadshirwadkar@gmail.com>

ext4: remove unnecessary fast commit calls from ext4_file_mmap

Remove unnecessary calls to ext4_fc_start_update() and
ext4_fc_stop_update() from ext4_file_mmap().

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201106035911.1942128-17-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# aa75f4d3 15-Oct-2020 Harshad Shirwadkar <harshadshirwadkar@gmail.com>

ext4: main fast-commit commit path

This patch adds main fast commit commit path handlers. The overall
patch can be divided into two inter-related parts:

(A) Metadata updates tracking

This part consists of helper functions to track changes that need
to be committed during a commit operation. These updates are
maintained by Ext4 in different in-memory queues. Following are
the APIs and their short description that are implemented in this
patch:

- ext4_fc_track_link/unlink/creat() - Track unlink. link and creat
operations
- ext4_fc_track_range() - Track changed logical block offsets
inodes
- ext4_fc_track_inode() - Track inodes
- ext4_fc_mark_ineligible() - Mark file system fast commit
ineligible()
- ext4_fc_start_update() / ext4_fc_stop_update() /
ext4_fc_start_ineligible() / ext4_fc_stop_ineligible() These
functions are useful for co-ordinating inode updates with
commits.

(B) Main commit Path

This part consists of functions to convert updates tracked in
in-memory data structures into on-disk commits. Function
ext4_fc_commit() is the main entry point to commit path.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201015203802.3597742-6-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 766ef1e1 03-Aug-2020 Jens Axboe <axboe@kernel.dk>

ext4: flag as supporting buffered async reads

ext4 uses generic_file_read_iter(), which already supports this.

Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Link: https://lore.kernel.org/r/fb90cc2d-b12c-738f-21a4-dd7a8ae0556a@kernel.dk
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 27bc446e 17-Aug-2020 brookxu <brookxu.cn@gmail.com>

ext4: limit the length of per-inode prealloc list

In the scenario of writing sparse files, the per-inode prealloc list may
be very long, resulting in high overhead for ext4_mb_use_preallocated().
To circumvent this problem, we limit the maximum length of per-inode
prealloc list to 512 and allow users to modify it.

After patching, we observed that the sys ratio of cpu has dropped, and
the system throughput has increased significantly. We created a process
to write the sparse file, and the running time of the process on the
fixed kernel was significantly reduced, as follows:

Running time on unfixed kernel:
[root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
real 0m2.051s
user 0m0.008s
sys 0m2.026s

Running time on fixed kernel:
[root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
real 0m0.471s
user 0m0.004s
sys 0m0.395s

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/d7a98178-056b-6db5-6bce-4ead23f4a257@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 0b3171b6 08-Jul-2020 Jan Kara <jack@suse.cz>

ext4: do not block RWF_NOWAIT dio write on unallocated space

Since commit 378f32bab371 ("ext4: introduce direct I/O write using iomap
infrastructure") we don't properly bail out of RWF_NOWAIT direct IO
write if underlying blocks are not allocated. Also
ext4_dio_write_checks() does not honor RWF_NOWAIT when re-acquiring
i_rwsem. Fix both issues.

Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure")
Cc: stable@kernel.org
Reported-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200708153516.9507-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# e030a288 13-Jun-2020 Dio Putra <dioput12@gmail.com>

ext4: fix coding style in file.c

Fixed a few coding style issues in file.c

Signed-off-by: Dio Putra <dioput12@gmail.com>
Link: https://lore.kernel.org/r/239fcd8f-d33f-8621-9e82-0416dd3f9c94@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 60263d58 23-Jul-2020 Christoph Hellwig <hch@lst.de>

iomap: fall back to buffered writes for invalidation failures

Failing to invalid the page cache means data in incoherent, which is
a very bad state for the system. Always fall back to buffered I/O
through the page cache if we can't invalidate mappings.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu> # for ext4
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> # for gfs2
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>


# 6e014c62 24-May-2020 Jens Axboe <axboe@kernel.dk>

ext4: don't block for O_DIRECT if IOCB_NOWAIT is set

Running with some debug patches to detect illegal blocking triggered the
extend/unaligned condition in ext4. If ext4 needs to extend the file (and
hence go to buffered IO), or if the app is doing unaligned IO, then ext4
asks the iomap code to wait for IO completion. If the caller asked for
no-wait semantics by setting IOCB_NOWAIT, then ext4 should return -EAGAIN
instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

Link: https://lore.kernel.org/r/76152096-2bbb-7682-8fce-4cb498bcd909@kernel.dk
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 4209ae12 26-Apr-2020 Harshad Shirwadkar <harshadshirwadkar@gmail.com>

ext4: handle ext4_mark_inode_dirty errors

ext4_mark_inode_dirty() can fail for real reasons. Ignoring its return
value may lead ext4 to ignore real failures that would result in
corruption / crashes. Harden ext4_mark_inode_dirty error paths to fail
as soon as possible and return errors to the caller whenever
appropriate.

One of the possible scnearios when this bug could affected is that
while creating a new inode, its directory entry gets added
successfully but while writing the inode itself mark_inode_dirty
returns error which is ignored. This would result in inconsistency
that the directory entry points to a non-existent inode.

Ran gce-xfstests smoke tests and verified that there were no
regressions.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20200427013438.219117-1-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 72f9da1d 07-Feb-2020 Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

ext4: start to support iopoll method

Since commit "b1b4705d54ab ext4: introduce direct I/O read using
iomap infrastructure", we can easily make ext4 support iopoll
method, just use iomap_dio_iopoll().

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20200207120758.2411-1-xiaoguang.wang@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 8cd115bd 18-Dec-2019 Jan Kara <jack@suse.cz>

ext4: Optimize ext4 DIO overwrites

Currently we start transaction for mapping every extent for writing
using direct IO. This is unnecessary when we know we are overwriting
already allocated blocks and the overhead of starting a transaction can
be significant especially for multithreaded workloads doing small writes.
Use iomap operations that avoid starting a transaction for direct IO
overwrites.

This improves throughput of 4k random writes - fio jobfile:
[global]
rw=randrw
norandommap=1
invalidate=0
bs=4k
numjobs=16
time_based=1
ramp_time=30
runtime=120
group_reporting=1
ioengine=psync
direct=1
size=16G
filename=file1.0.0:file1.0.1:file1.0.2:file1.0.3:file1.0.4:file1.0.5:file1.0.6:file1.0.7:file1.0.8:file1.0.9:file1.0.10:file1.0.11:file1.0.12:file1.0.13:file1.0.14:file1.0.15:file1.0.16:file1.0.17:file1.0.18:file1.0.19:file1.0.20:file1.0.21:file1.0.22:file1.0.23:file1.0.24:file1.0.25:file1.0.26:file1.0.27:file1.0.28:file1.0.29:file1.0.30:file1.0.31
file_service_type=random
nrfiles=32

from 3018MB/s to 4059MB/s in my test VM running test against simulated
pmem device (note that before iomap conversion, this workload was able
to achieve 3708MB/s because old direct IO path avoided transaction start
for overwrites as well). For dax, the win is even larger improving
throughput from 3042MB/s to 4311MB/s.

Reported-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191218174433.19380-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# bc6385da 11-Dec-2019 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: Move to shared i_rwsem even without dioread_nolock mount opt

We were using shared locking only in case of dioread_nolock mount option in case
of DIO overwrites. This mount condition is not needed anymore with current code,
since:-

1. No race between buffered writes & DIO overwrites. Since buffIO writes takes
exclusive lock & DIO overwrites will take shared locking. Also DIO path will
make sure to flush and wait for any dirty page cache data.

2. No race between buffered reads & DIO overwrites, since there is no block
allocation that is possible with DIO overwrites. So no stale data exposure
should happen. Same is the case between DIO reads & DIO overwrites.

3. Also other paths like truncate is protected, since we wait there for any DIO
in flight to be over.

Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-4-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# aa9714d0 11-Dec-2019 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: Start with shared i_rwsem in case of DIO instead of exclusive

Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.

But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.

So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().

Other than that, it also simplifies below cases:-

1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.

Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.

2. Added ext4_extending_io(). This checks if the IO is extending the file.

3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.

Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# f629afe3 11-Dec-2019 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: fix ext4_dax_read/write inode locking sequence for IOCB_NOWAIT

Apparently our current rwsem code doesn't like doing the trylock, then
lock for real scheme. So change our dax read/write methods to just do the
trylock for the RWF_NOWAIT case.
This seems to fix AIM7 regression in some scalable filesystems upto ~25%
in some cases. Claimed in commit 942491c9e6d6 ("xfs: fix AIM7 regression")

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-2-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 378f32ba 05-Nov-2019 Matthew Bobrowski <mbobrowski@mbobrowski.org>

ext4: introduce direct I/O write using iomap infrastructure

This patch introduces a new direct I/O write path which makes use of
the iomap infrastructure.

All direct I/O writes are now passed from the ->write_iter() callback
through to the new direct I/O handler ext4_dio_write_iter(). This
function is responsible for calling into the iomap infrastructure via
iomap_dio_rw().

Code snippets from the existing direct I/O write code within
ext4_file_write_iter() such as, checking whether the I/O request is
unaligned asynchronous I/O, or whether the write will result in an
overwrite have effectively been moved out and into the new direct I/O
->write_iter() handler.
The block mapping flags that are eventually passed down to
ext4_map_blocks() from the *_get_block_*() suite of routines have been
taken out and introduced within ext4_iomap_alloc().

For inode extension cases, ext4_handle_inode_extension() is
effectively the function responsible for performing such metadata
updates. This is called after iomap_dio_rw() has returned so that we
can safely determine whether we need to potentially truncate any
allocated blocks that may have been prepared for this direct I/O
write. We don't perform the inode extension, or truncate operations
from the ->end_io() handler as we don't have the original I/O 'length'
available there. The ->end_io() however is responsible fo converting
allocated unwritten extents to written extents.

In the instance of a short write, we fallback and complete the
remainder of the I/O using buffered I/O via
ext4_buffered_write_iter().

The existing buffer_head direct I/O implementation has been removed as
it's now redundant.

[ Fix up ext4_dio_write_iter() per Jan's comments at
https://lore.kernel.org/r/20191105135932.GN22379@quack2.suse.cz -- TYT ]

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/e55db6f12ae6ff017f36774135e79f3e7b0333da.1572949325.git.mbobrowski@mbobrowski.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 0b9f230b 05-Nov-2019 Matthew Bobrowski <mbobrowski@mbobrowski.org>

ext4: move inode extension check out from ext4_iomap_alloc()

Lift the inode extension/orphan list handling code out from
ext4_iomap_alloc() and apply it within the ext4_dax_write_iter().

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/fd5c84db25d5d0da87d97ed4c36fd844f57da759.1572949325.git.mbobrowski@mbobrowski.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 569342dc 05-Nov-2019 Matthew Bobrowski <mbobrowski@mbobrowski.org>

ext4: move inode extension/truncate code out from ->iomap_end() callback

In preparation for implementing the iomap direct I/O modifications,
the inode extension/truncate code needs to be moved out from the
ext4_iomap_end() callback. For direct I/O, if the current code
remained, it would behave incorrrectly. Updating the inode size prior
to converting unwritten extents would potentially allow a racing
direct I/O read to find unwritten extents before being converted
correctly.

The inode extension/truncate code now resides within a new helper
ext4_handle_inode_extension(). This function has been designed so that
it can accommodate for both DAX and direct I/O extension/truncate
operations.

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/d41ffa26e20b15b12895812c3cad7c91a6a59bc6.1572949325.git.mbobrowski@mbobrowski.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b1b4705d 05-Nov-2019 Matthew Bobrowski <mbobrowski@mbobrowski.org>

ext4: introduce direct I/O read using iomap infrastructure

This patch introduces a new direct I/O read path which makes use of
the iomap infrastructure.

The new function ext4_do_read_iter() is responsible for calling into
the iomap infrastructure via iomap_dio_rw(). If the read operation
performed on the inode is not supported, which is checked via
ext4_dio_supported(), then we simply fallback and complete the I/O
using buffered I/O.

Existing direct I/O read code path has been removed, as it is now
redundant.

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/f98a6f73fadddbfbad0fc5ed04f712ca0b799f37.1572949325.git.mbobrowski@mbobrowski.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 09edf4d3 05-Nov-2019 Matthew Bobrowski <mbobrowski@mbobrowski.org>

ext4: introduce new callback for IOMAP_REPORT

As part of the ext4_iomap_begin() cleanups that precede this patch, we
also split up the IOMAP_REPORT branch into a completely separate
->iomap_begin() callback named ext4_iomap_begin_report(). Again, the
raionale for this change is to reduce the overall clutter within
ext4_iomap_begin().

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/5c97a569e26ddb6696e3d3ac9fbde41317e029a0.1572949325.git.mbobrowski@mbobrowski.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# c93d8f88 22-Jul-2019 Eric Biggers <ebiggers@google.com>

ext4: add basic fs-verity support

Add most of fs-verity support to ext4. fs-verity is a filesystem
feature that enables transparent integrity protection and authentication
of read-only files. It uses a dm-verity like mechanism at the file
level: a Merkle tree is used to verify any block in the file in
log(filesize) time. It is implemented mainly by helper functions in
fs/verity/. See Documentation/filesystems/fsverity.rst for the full
documentation.

This commit adds all of ext4 fs-verity support except for the actual
data verification, including:

- Adding a filesystem feature flag and an inode flag for fs-verity.

- Implementing the fsverity_operations to support enabling verity on an
inode and reading/writing the verity metadata.

- Updating ->write_begin(), ->write_end(), and ->writepages() to support
writing verity metadata pages.

- Calling the fs-verity hooks for ->open(), ->setattr(), and ->ioctl().

ext4 stores the verity metadata (Merkle tree and fsverity_descriptor)
past the end of the file, starting at the first 64K boundary beyond
i_size. This approach works because (a) verity files are readonly, and
(b) pages fully beyond i_size aren't visible to userspace but can be
read/written internally by ext4 with only some relatively small changes
to ext4. This approach avoids having to depend on the EA_INODE feature
and on rearchitecturing ext4's xattr support to support paging
multi-gigabyte xattrs into memory, and to support encrypting xattrs.
Note that the verity metadata *must* be encrypted when the file is,
since it contains hashes of the plaintext data.

This patch incorporates work by Theodore Ts'o and Chandan Rajendra.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>


# 991f5230 11-Aug-2019 Shi Siyuan <shisiyuan@xiaomi.com>

ext4: remove unnecessary error check

Remove unnecessary error check in ext4_file_write_iter(),
because this check will be done in upcoming later function --
ext4_write_checks() -> generic_write_checks()

Change-Id: I7b0ab27f693a50765c15b5eaa3f4e7c38f42e01e
Signed-off-by: shisiyuan <shisiyuan@xiaomi.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# e46bfc3f 05-Jul-2019 Pankaj Gupta <pagupta@redhat.com>

ext4: disable map_sync for async flush

Dont support 'MAP_SYNC' with non-DAX files and DAX files
with asynchronous dax_device. Virtio pmem provides
asynchronous host page cache flush mechanism. We don't
support 'MAP_SYNC' with virtio pmem and ext4.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 02b016ca 09-Jun-2019 Theodore Ts'o <tytso@mit.edu>

ext4: enforce the immutable flag on open files

According to the chattr man page, "a file with the 'i' attribute
cannot be modified..." Historically, this was only enforced when the
file was opened, per the rest of the description, "... and the file
can not be opened in write mode".

There is general agreement that we should standardize all file systems
to prevent modifications even for files that were opened at the time
the immutable flag is set. Eventually, a change to enforce this at
the VFS layer should be landing in mainline. Until then, enforce this
at the ext4 level to prevent xfstests generic/553 from failing.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: stable@kernel.org


# 57a0da28 10-May-2019 Lukas Czerner <lczerner@redhat.com>

ext4: fix data corruption caused by overlapping unaligned and aligned IO

Unaligned AIO must be serialized because the zeroing of partial blocks
of unaligned AIO can result in data corruption in case it's overlapping
another in flight IO.

Currently we wait for all unwritten extents before we submit unaligned
AIO which protects data in case of unaligned AIO is following overlapping
IO. However if a unaligned AIO is followed by overlapping aligned AIO we
can still end up corrupting data.

To fix this, we must make sure that the unaligned AIO is the only IO in
flight by waiting for unwritten extents conversion not just before the
IO submission, but right after it as well.

This problem can be reproduced by xfstest generic/538

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org


# 372a03e0 14-Mar-2019 Lukas Czerner <lczerner@redhat.com>

ext4: fix data corruption caused by unaligned direct AIO

Ext4 needs to serialize unaligned direct AIO because the zeroing of
partial blocks of two competing unaligned AIOs can result in data
corruption.

However it decides not to serialize if the potentially unaligned aio is
past i_size with the rationale that no pending writes are possible past
i_size. Unfortunately if the i_size is not block aligned and the second
unaligned write lands past i_size, but still into the same block, it has
the potential of corrupting the previous unaligned write to the same
block.

This is (very simplified) reproducer from Frank

// 41472 = (10 * 4096) + 512
// 37376 = 41472 - 4096

ftruncate(fd, 41472);
io_prep_pwrite(iocbs[0], fd, buf[0], 4096, 37376);
io_prep_pwrite(iocbs[1], fd, buf[1], 4096, 41472);

io_submit(io_ctx, 1, &iocbs[1]);
io_submit(io_ctx, 1, &iocbs[2]);

io_getevents(io_ctx, 2, 2, events, NULL);

Without this patch the 512B range from 40960 up to the start of the
second unaligned write (41472) is going to be zeroed overwriting the data
written by the first write. This is a data corruption.

00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
00009200 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
*
0000a000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0000a200 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31

With this patch the data corruption is avoided because we will recognize
the unaligned_aio and wait for the unwritten extent conversion.

00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
00009200 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
*
0000a200 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31
*
0000b200

Reported-by: Frank Sorenson <fsorenso@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fixes: e9e3bcecf44c ("ext4: serialize unaligned asynchronous DIO")
Cc: stable@vger.kernel.org


# e1fb4a08 17-Aug-2018 Dave Jiang <dave.jiang@intel.com>

dax: remove VM_MIXEDMAP for fsdax and device dax

This patch is reworked from an earlier patch that Dan has posted:
https://patchwork.kernel.org/patch/10131727/

VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
the memory page it is dealing with is not typical memory from the linear
map. The get_user_pages_fast() path, since it does not resolve the vma,
is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
use that as a VM_MIXEDMAP replacement in some locations. In the cases
where there is no pte to consult we fallback to using vma_is_dax() to
detect the VM_MIXEDMAP special case.

Now that we have explicit driver pfn_t-flag opt-in/opt-out for
get_user_pages() support for DAX we can stop setting VM_MIXEDMAP. This
also means we no longer need to worry about safely manipulating vm_flags
in a future where we support dynamically changing the dax mode of a
file.

DAX should also now be supported with madvise_behavior(), vma_merge(),
and copy_page_range().

This patch has been tested against ndctl unit test. It has also been
tested against xfstests commit: 625515d using fake pmem created by
memmap and no additional issues have been observed.

Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# db6516a5 13-May-2018 Amir Goldstein <amir73il@gmail.com>

ext4: do not update s_last_mounted of a frozen fs

If fs is frozen after mount and before the first file open, the
update of s_last_mounted bypasses freeze protection and prints out
a WARNING splat:

$ mount /vdf
$ fsfreeze -f /vdf
$ cat /vdf/foo

[ 31.578555] WARNING: CPU: 1 PID: 1415 at
fs/ext4/ext4_jbd2.c:53 ext4_journal_check_start+0x48/0x82

[ 31.614016] Call Trace:
[ 31.614997] __ext4_journal_start_sb+0xe4/0x1a4
[ 31.616771] ? ext4_file_open+0xb6/0x189
[ 31.618094] ext4_file_open+0xb6/0x189

If fs is frozen, skip s_last_mounted update.

[backport hint: to apply to stable tree, need to apply also patches
vfs: add the sb_start_intwrite_trylock() helper
ext4: factor out helper ext4_sample_last_mounted()]

Cc: stable@vger.kernel.org
Fixes: bc0b0d6d69ee ("ext4: update the s_last_mounted field in the superblock")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# 833a9508 13-May-2018 Amir Goldstein <amir73il@gmail.com>

ext4: factor out helper ext4_sample_last_mounted()

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# 71fe9899 13-May-2018 Souptick Joarder <jrdr.linux@gmail.com>

fs: ext4: add new return type vm_fault_t

Use new return type vm_fault_t for fault handler. For now,
this is just documenting that the function returns a
VM_FAULT value rather than an errno. Once all instances are
converted, vm_fault_t will become a distinct type.

commit 1c8f422059ae ("mm: change return type to vm_fault_t")

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>


# 22446423 07-Jan-2018 Jan Kara <jack@suse.cz>

ext4: fix ENOSPC handling in DAX page fault handler

When allocation of underlying block for a page fault fails, we fail the
fault with SIGBUS. However we may well hit ENOSPC just due to lots of
free blocks being held by the running / committing transaction. So
propagate the error from ext4_iomap_begin() and implement do standard
allocation retry loop in ext4_dax_huge_fault().

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# c0b24625 07-Jan-2018 Jan Kara <jack@suse.cz>

dax: pass detailed error code from dax_iomap_fault()

Ext4 needs to pass through error from its iomap handler to the page
fault handler so that it can properly detect ENOSPC and force
transaction commit and retry the fault (and block allocation). Add
argument to dax_iomap_fault() for passing such error.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b8a6176c 01-Nov-2017 Jan Kara <jack@suse.cz>

ext4: Support for synchronous DAX faults

We return IOMAP_F_DIRTY flag from ext4_iomap_begin() when asked to
prepare blocks for writing and the inode has some uncommitted metadata
changes. In the fault handler ext4_dax_fault() we then detect this case
(through VM_FAULT_NEEDDSYNC return value) and call helper
dax_finish_sync_fault() to flush metadata changes and insert page table
entry. Note that this will also dirty corresponding radix tree entry
which is what we want - fsync(2) will still provide data integrity
guarantees for applications not using userspace flushing. And
applications using userspace flushing can avoid calling fsync(2) and
thus avoid the performance overhead.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 497f6926 01-Nov-2017 Jan Kara <jack@suse.cz>

ext4: Simplify error handling in ext4_dax_huge_fault()

If transaction starting fails, just bail out of the function immediately
instead of checking for that condition throughout the function.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 9a0dd422 01-Nov-2017 Jan Kara <jack@suse.cz>

dax: Allow dax_iomap_fault() to return pfn

For synchronous page fault dax_iomap_fault() will need to return PFN
which will then need to be inserted into page tables after fsync()
completes. Add necessary parameter to dax_iomap_fault().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# b2441318 01-Nov-2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 09a5c31c 18-Oct-2017 Eric Biggers <ebiggers@google.com>

ext4: switch to fscrypt_file_open()

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 545052e9 01-Oct-2017 Christoph Hellwig <hch@lst.de>

ext4: Switch to iomap for SEEK_HOLE / SEEK_DATA

Switch to the iomap_seek_hole and iomap_seek_data helpers for
implementing lseek SEEK_HOLE / SEEK_DATA, and remove all the code that
isn't needed any more.

Note that with this patch ext4 will now always depend on the iomap code
instead of only when CONFIG_DAX is enabled, and it requires adding a
call into the extent status tree for iomap_begin as well to properly
deal with delalloc extents.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
[More fixes and cleanups by Andreas]


# 397162ff 06-Sep-2017 Jan Kara <jack@suse.cz>

mm: remove nr_pages argument from pagevec_lookup{,_range}()

All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages.

Just drop the argument.

Link: http://lkml.kernel.org/r/20170726114704.7626-11-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dec0da7b 06-Sep-2017 Jan Kara <jack@suse.cz>

ext4: use pagevec_lookup_range() in ext4_find_unwritten_pgoff()

Use pagevec_lookup_range() in ext4_find_unwritten_pgoff() since we are
interested only in pages in the given range. Simplify the logic as a
result of not getting pages out of range and index getting automatically
advanced.

Link: http://lkml.kernel.org/r/20170726114704.7626-6-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d72dc8a2 06-Sep-2017 Jan Kara <jack@suse.cz>

mm: make pagevec_lookup() update index

Make pagevec_lookup() (and underlying find_get_pages()) update index to
the next page where iteration should continue. Most callers want this
and also pagevec_lookup_tag() already does this.

Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 91d25ba8 06-Sep-2017 Ross Zwisler <zwisler@kernel.org>

dax: use common 4k zero page for dax mmap reads

When servicing mmap() reads from file holes the current DAX code
allocates a page cache page of all zeroes and places the struct page
pointer in the mapping->page_tree radix tree.

This has three major drawbacks:

1) It consumes memory unnecessarily. For every 4k page that is read via
a DAX mmap() over a hole, we allocate a new page cache page. This
means that if you read 1GiB worth of pages, you end up using 1GiB of
zeroed memory. This is easily visible by looking at the overall
memory consumption of the system or by looking at /proc/[pid]/smaps:

7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12 /root/dax/data
Size: 1048576 kB
Rss: 1048576 kB
Pss: 1048576 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 1048576 kB
Private_Dirty: 0 kB
Referenced: 1048576 kB
Anonymous: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB

2) It is slower than using a common zero page because each page fault
has more work to do. Instead of just inserting a common zero page we
have to allocate a page cache page, zero it, and then insert it. Here
are the average latencies of dax_load_hole() as measured by ftrace on
a random test box:

Old method, using zeroed page cache pages: 3.4 us
New method, using the common 4k zero page: 0.8 us

This was the average latency over 1 GiB of sequential reads done by
this simple fio script:

[global]
size=1G
filename=/root/dax/data
fallocate=none
[io]
rw=read
ioengine=mmap

3) The fact that we had to check for both DAX exceptional entries and
for page cache pages in the radix tree made the DAX code more
complex.

Solve these issues by following the lead of the DAX PMD code and using a
common 4k zero page instead. As with the PMD code we will now insert a
DAX exceptional entry into the radix tree instead of a struct page
pointer which allows us to remove all the special casing in the DAX
code.

Note that we do still pretty aggressively check for regular pages in the
DAX radix tree, especially where we take action based on the bits set in
the page. If we ever find a regular page in our radix tree now that
most likely means that someone besides DAX is inserting pages (which has
happened lots of times in the past), and we want to find that out early
and fail loudly.

This solution also removes the extra memory consumption. Here is that
same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
code:

7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12 /root/dax/data
Size: 1048576 kB
Rss: 0 kB
Pss: 0 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 0 kB
Anonymous: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB

Overall system memory consumption is similarly improved.

Another major change is that we remove dax_pfn_mkwrite() from our fault
flow, and instead rely on the page fault itself to make the PTE dirty
and writeable. The following description from the patch adding the
vm_insert_mixed_mkwrite() call explains this a little more:

"To be able to use the common 4k zero page in DAX we need to have our
PTE fault path look more like our PMD fault path where a PTE entry
can be marked as dirty and writeable as it is first inserted rather
than waiting for a follow-up dax_pfn_mkwrite() =>
finish_mkwrite_fault() call.

Right now we can rely on having a dax_pfn_mkwrite() call because we
can distinguish between these two cases in do_wp_page():

case 1: 4k zero page => writable DAX storage
case 2: read-only DAX storage => writeable DAX storage

This distinction is made by via vm_normal_page(). vm_normal_page()
returns false for the common 4k zero page, though, just as it does
for DAX ptes. Instead of special casing the DAX + 4k zero page case
we will simplify our DAX PTE page fault sequence so that it matches
our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
We will instead use dax_iomap_fault() to handle write-protection
faults.

This means that insert_pfn() needs to follow the lead of
insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
'mkwrite' is set insert_pfn() will do the work that was previously
done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"

Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 91f9943e 29-Aug-2017 Christoph Hellwig <hch@lst.de>

fs: support RWF_NOWAIT for buffered reads

This is based on the old idea and code from Milosz Tanski. With the aio
nowait code it becomes mostly trivial now. Buffered writes continue to
return -EOPNOTSUPP if RWF_NOWAIT is passed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# fd96b8da 24-Aug-2017 Randy Dodgen <dodgen@google.com>

ext4: fix fault handling when mounted with -o dax,ro

If an ext4 filesystem is mounted with both the DAX and read-only
options, executables on that filesystem will fail to start (claiming
'Segmentation fault') due to the fault handler returning
VM_FAULT_SIGBUS.

This is due to the DAX fault handler (see ext4_dax_huge_fault)
attempting to write to the journal when FAULT_FLAG_WRITE is set. This is
the wrong behavior for write faults which will lead to a COW page; in
particular, this fails for readonly mounts.

This change avoids journal writes for faults that are expected to COW.

It might be the case that this could be better handled in
ext4_iomap_begin / ext4_iomap_end (called via iomap_ops inside
dax_iomap_fault). These is some overlap already (e.g. grabbing journal
handles).

Signed-off-by: Randy Dodgen <dodgen@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>


# 1bd8d6cd 24-Aug-2017 Darrick J. Wong <darrick.wong@oracle.com>

ext4: in ext4_seek_{hole,data}, return -ENXIO for negative offsets

In the ext4 implementations of SEEK_HOLE and SEEK_DATA, make sure we
return -ENXIO for negative offsets instead of banging around inside
the extent code and returning -EFSCORRUPTED.

Reported-by: Mateusz S <muttdini@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org # 4.6


# fcf5ea10 05-Aug-2017 Jan Kara <jack@suse.cz>

ext4: fix SEEK_HOLE/SEEK_DATA for blocksize < pagesize

ext4_find_unwritten_pgoff() does not properly handle a situation when
starting index is in the middle of a page and blocksize < pagesize. The
following command shows the bug on filesystem with 1k blocksize:

xfs_io -f -c "falloc 0 4k" \
-c "pwrite 1k 1k" \
-c "pwrite 3k 1k" \
-c "seek -a -r 0" foo

In this example, neither lseek(fd, 1024, SEEK_HOLE) nor lseek(fd, 2048,
SEEK_DATA) will return the correct result.

Fix the problem by neglecting buffers in a page before starting offset.

Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
CC: stable@vger.kernel.org # 3.8+


# bc98a42c 17-Jul-2017 David Howells <dhowells@redhat.com>

VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)

Firstly by applying the following with coccinelle's spatch:

@@ expression SB; @@
-SB->s_flags & MS_RDONLY
+sb_rdonly(SB)

to effect the conversion to sb_rdonly(sb), then by applying:

@@ expression A, SB; @@
(
-(!sb_rdonly(SB)) && A
+!sb_rdonly(SB) && A
|
-A != (sb_rdonly(SB))
+A != sb_rdonly(SB)
|
-A == (sb_rdonly(SB))
+A == sb_rdonly(SB)
|
-!(sb_rdonly(SB))
+!sb_rdonly(SB)
|
-A && (sb_rdonly(SB))
+A && sb_rdonly(SB)
|
-A || (sb_rdonly(SB))
+A || sb_rdonly(SB)
|
-(sb_rdonly(SB)) != A
+sb_rdonly(SB) != A
|
-(sb_rdonly(SB)) == A
+sb_rdonly(SB) == A
|
-(sb_rdonly(SB)) && A
+sb_rdonly(SB) && A
|
-(sb_rdonly(SB)) || A
+sb_rdonly(SB) || A
)

@@ expression A, B, SB; @@
(
-(sb_rdonly(SB)) ? 1 : 0
+sb_rdonly(SB)
|
-(sb_rdonly(SB)) ? A : B
+sb_rdonly(SB) ? A : B
)

to remove left over excess bracketage and finally by applying:

@@ expression A, SB; @@
(
-(A & MS_RDONLY) != sb_rdonly(SB)
+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
|
-(A & MS_RDONLY) == sb_rdonly(SB)
+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
)

to make comparisons against the result of sb_rdonly() (which is a bool)
work correctly.

Signed-off-by: David Howells <dhowells@redhat.com>


# 66e0aaad 23-Jun-2017 Eric Biggers <ebiggers@google.com>

ext4: don't bother checking for encryption key in ->mmap()

Since only an open file can be mmap'ed, and we only allow open()ing an
encrypted file when its key is available, there is no need to check for
the key again before permitting each mmap().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 728fbc0e 20-Jun-2017 Goldwyn Rodrigues <rgoldwyn@suse.com>

ext4: nowait aio support

Return EAGAIN if any of the following checks fail for direct I/O:
+ i_rwsem is lockable
+ Writing beyond end of file (will trigger allocation)
+ Blocks are not allocated at the write location

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 624327f8 24-May-2017 Eryu Guan <eguan@redhat.com>

ext4: fix off-by-one on max nr_pages in ext4_find_unwritten_pgoff()

ext4_find_unwritten_pgoff() is used to search for offset of hole or
data in page range [index, end] (both inclusive), and the max number
of pages to search should be at least one, if end == index.
Otherwise the only page is missed and no hole or data is found,
which is not correct.

When block size is smaller than page size, this can be demonstrated
by preallocating a file with size smaller than page size and writing
data to the last block. E.g. run this xfs_io command on a 1k block
size ext4 on x86_64 host.

# xfs_io -fc "falloc 0 3k" -c "pwrite 2k 1k" \
-c "seek -d 0" /mnt/ext4/testfile
wrote 1024/1024 bytes at offset 2048
1 KiB, 1 ops; 0.0000 sec (42.459 MiB/sec and 43478.2609 ops/sec)
Whence Result
DATA EOF

Data at offset 2k was missed, and lseek(2) returned ENXIO.

This is unconvered by generic/285 subtest 07 and 08 on ppc64 host,
where pagesize is 64k. Because a recent change to generic/285
reduced the preallocated file size to smaller than 64k.

Signed-off-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# 3f1d5bad 21-May-2017 Jan Kara <jack@suse.cz>

ext4: fix off-by-in in loop termination in ext4_find_unwritten_pgoff()

There is an off-by-one error in loop termination conditions in
ext4_find_unwritten_pgoff() since 'end' may index a page beyond end of
desired range if 'endoff' is page aligned. It doesn't have any visible
effects but still it is good to fix it.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 7d95eddf 21-May-2017 Jan Kara <jack@suse.cz>

ext4: fix SEEK_HOLE

Currently, SEEK_HOLE implementation in ext4 may both return that there's
a hole at some offset although that offset already has data and skip
some holes during a search for the next hole. The first problem is
demostrated by:

xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "seek -h 0" file
wrote 57344/57344 bytes at offset 0
56 KiB, 14 ops; 0.0000 sec (2.054 GiB/sec and 538461.5385 ops/sec)
Whence Result
HOLE 0

Where we can see that SEEK_HOLE wrongly returned offset 0 as containing
a hole although we have written data there. The second problem can be
demonstrated by:

xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "pwrite 128k 8k"
-c "seek -h 0" file

wrote 57344/57344 bytes at offset 0
56 KiB, 14 ops; 0.0000 sec (1.978 GiB/sec and 518518.5185 ops/sec)
wrote 8192/8192 bytes at offset 131072
8 KiB, 2 ops; 0.0000 sec (2 GiB/sec and 500000.0000 ops/sec)
Whence Result
HOLE 139264

Where we can see that hole at offsets 56k..128k has been ignored by the
SEEK_HOLE call.

The underlying problem is in the ext4_find_unwritten_pgoff() which is
just buggy. In some cases it fails to update returned offset when it
finds a hole (when no pages are found or when the first found page has
higher index than expected), in some cases conditions for detecting hole
are just missing (we fail to detect a situation where indices of
returned pages are not contiguous).

Fix ext4_find_unwritten_pgoff() to properly detect non-contiguous page
indices and also handle all cases where we got less pages then expected
in one place and handle it properly there.

CC: stable@vger.kernel.org
Fixes: c8c0df241cc2719b1262e627f999638411934f60
CC: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# fb26a1cb 12-May-2017 Jan Kara <jack@suse.cz>

ext4: return to starting transaction in ext4_dax_huge_fault()

DAX will return to locking exceptional entry before mapping blocks for a
page fault to fix possible races with concurrent writes. To avoid lock
inversion between exceptional entry lock and transaction start, start
the transaction already in ext4_dax_huge_fault().

Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
Link: http://lkml.kernel.org/r/20170510085419.27601-4-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 99652ea5 31-Mar-2017 David Howells <dhowells@redhat.com>

ext4: Add statx support

Return enhanced file attributes from the Ext4 filesystem. This includes
the following:

(1) The inode creation time (i_crtime) as stx_btime, setting STATX_BTIME.

(2) Certain FS_xxx_FL flags are mapped to stx_attribute flags.

This requires that all ext4 inodes have a getattr call, not just some of
them, so to this end, split the ext4_getattr() function and only call part
of it where appropriate.

Example output:

[root@andromeda ~]# touch foo
[root@andromeda ~]# chattr +ai foo
[root@andromeda ~]# /tmp/test-statx foo
statx(foo) = 0
results=fff
Size: 0 Blocks: 0 IO Block: 4096 regular file
Device: 08:12 Inode: 2101950 Links: 1
Access: (0644/-rw-r--r--) Uid: 0 Gid: 0
Access: 2016-02-11 17:08:29.031795451+0000
Modify: 2016-02-11 17:08:29.031795451+0000
Change: 2016-02-11 17:11:11.987790114+0000
Birth: 2016-02-11 17:08:29.031795451+0000
Attributes: 0000000000000030 (-------- -------- -------- -------- -------- -------- -------- --ai----)

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c791ace1 24-Feb-2017 Dave Jiang <dave.jiang@intel.com>

mm: replace FAULT_FLAG_SIZE with parameter to huge_fault

Since the introduction of FAULT_FLAG_SIZE to the vm_fault flag, it has
been somewhat painful with getting the flags set and removed at the
correct locations. More than one kernel oops was introduced due to
difficulties of getting the placement correctly.

Remove the flag values and introduce an input parameter to huge_fault
that indicates the size of the page entry. This makes the code easier
to trace and should avoid the issues we see with the fault flags where
removal of the flag was necessary in the fallback paths.

Link: http://lkml.kernel.org/r/148615748258.43180.1690152053774975329.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a2d58167 24-Feb-2017 Dave Jiang <dave.jiang@intel.com>

mm,fs,dax: change ->pmd_fault to ->huge_fault

Patch series "1G transparent hugepage support for device dax", v2.

The following series implements support for 1G trasparent hugepage on
x86 for device dax. The bulk of the code was written by Mathew Wilcox a
while back supporting transparent 1G hugepage for fs DAX. I have
forward ported the relevant bits to 4.10-rc. The current submission has
only the necessary code to support device DAX.

Comments from Dan Williams: So the motivation and intended user of this
functionality mirrors the motivation and users of 1GB page support in
hugetlbfs. Given expected capacities of persistent memory devices an
in-memory database may want to reduce tlb pressure beyond what they can
already achieve with 2MB mappings of a device-dax file. We have
customer feedback to that effect as Willy mentioned in his previous
version of these patches [1].

[1]: https://lkml.org/lkml/2016/1/31/52

Comments from Nilesh @ Oracle:

There are applications which have a process model; and if you assume
10,000 processes attempting to mmap all the 6TB memory available on a
server; we are looking at the following:

processes : 10,000
memory : 6TB
pte @ 4k page size: 8 bytes / 4K of memory * #processes = 6TB / 4k * 8 * 10000 = 1.5GB * 80000 = 120,000GB
pmd @ 2M page size: 120,000 / 512 = ~240GB
pud @ 1G page size: 240GB / 512 = ~480MB

As you can see with 2M pages, this system will use up an exorbitant
amount of DRAM to hold the page tables; but the 1G pages finally brings
it down to a reasonable level. Memory sizes will keep increasing; so
this number will keep increasing.

An argument can be made to convert the applications from process model
to thread model, but in the real world that may not be always practical.
Hopefully this helps explain the use case where this is valuable.

This patch (of 3):

In preparation for adding the ability to handle PUD pages, convert
vm_operations_struct.pmd_fault to vm_operations_struct.huge_fault. The
vm_fault structure is extended to include a union of the different page
table pointers that may be needed, and three flag bits are reserved to
indicate which type of pointer is in the union.

[ross.zwisler@linux.intel.com: remove unused function ext4_dax_huge_fault()]
Link: http://lkml.kernel.org/r/1485813172-7284-1-git-send-email-ross.zwisler@linux.intel.com
[dave.jiang@intel.com: clear PMD or PUD size flags when in fall through path]
Link: http://lkml.kernel.org/r/148589842696.5820.16078080610311444794.stgit@djiang5-desk3.ch.intel.com
Link: http://lkml.kernel.org/r/148545058784.17912.6353162518188733642.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 11bac800 24-Feb-2017 Dave Jiang <dave.jiang@intel.com>

mm, fs: reduce fault, page_mkwrite, and pfn_mkwrite to take only vmf

->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
take a vma and vmf parameter when the vma already resides in vmf.

Remove the vma parameter to simplify things.

[arnd@arndb.de: fix ARM build]
Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f4200391 22-Feb-2017 Dave Jiang <dave.jiang@intel.com>

mm, dax: change pmd_fault() to take only vmf parameter

pmd_fault() and related functions really only need the vmf parameter since
the additional parameters are all included in the vmf struct. Remove the
additional parameter and simplify pmd_fault() and friends.

Link: http://lkml.kernel.org/r/1484085142-2297-8-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d8a849e1 22-Feb-2017 Dave Jiang <dave.jiang@intel.com>

mm, dax: make pmd_fault() and friends be the same as fault()

Instead of passing in multiple parameters in the pmd_fault() handler,
a vmf can be passed in just like a fault() handler. This will simplify
code and remove the need for the actual pmd fault handlers to allocate a
vmf. Related functions are also modified to do the same.

[dave.jiang@intel.com: fix issue with xfs_tests stall when DAX option is off]
Link: http://lkml.kernel.org/r/148469861071.195597.3619476895250028518.stgit@djiang5-desk3.ch.intel.com
Link: http://lkml.kernel.org/r/1484085142-2297-7-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ff5462e3 08-Feb-2017 Christoph Hellwig <hch@lst.de>

ext4: fix DAX write locking

Unlike O_DIRECT DAX is not an optional opt-in feature selected by the
application, so we'll have to provide the traditional synchronÑ–zation
of overlapping writes as we do for buffered writes.

This was broken historically for DAX, but got fixed for ext2 and XFS
as part of the iomap conversion. Fix up ext4 as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# 0db1ff22 04-Feb-2017 Theodore Ts'o <tytso@mit.edu>

ext4: add shutdown bit and check for it

Add a shutdown bit that will cause ext4 processing to fail immediately
with EIO.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 1db17542 21-Oct-2016 Jan Kara <jack@suse.cz>

ext4: Simplify DAX fault path

Now that dax_iomap_fault() calls ->iomap_begin() without entry lock, we
can use transaction starting in ext4_iomap_begin() and thus simplify
ext4_dax_fault(). It also provides us proper retries in case of ENOSPC.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# e2ae766c 20-Nov-2016 Jan Kara <jack@suse.cz>

ext4: convert DAX faults to iomap infrastructure

Convert DAX faults to use iomap infrastructure. We would not have to start
transaction in ext4_dax_fault() anymore since ext4_iomap_begin takes
care of that but so far we do that to avoid lock inversion of
transaction start with DAX entry lock which gets acquired in
dax_iomap_fault() before calling ->iomap_begin handler.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 776722e8 20-Nov-2016 Jan Kara <jack@suse.cz>

ext4: DAX iomap write support

Implement DAX writes using the new iomap infrastructure instead of
overloading the direct IO path.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 364443cb 20-Nov-2016 Jan Kara <jack@suse.cz>

ext4: convert DAX reads to iomap infrastructure

Implement basic iomap_begin function that handles reading and use it for
DAX reads.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 213bcd9c 20-Nov-2016 Jan Kara <jack@suse.cz>

ext4: factor out checks from ext4_file_write_iter()

Factor out checks of 'from' and whether we are overwriting out of
ext4_file_write_iter() so that the function is easier to follow.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# fd50ecad 29-Sep-2016 Andreas Gruenbacher <agruenba@redhat.com>

vfs: Remove {get,set,remove}xattr inode operations

These inode operations are no longer used; remove them.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# dbe6ec81 07-Oct-2016 Toshi Kani <toshi.kani@hpe.com>

ext2/4, xfs: call thp_get_unmapped_area() for pmd mappings

To support DAX pmd mappings with unmodified applications, filesystems
need to align an mmap address by the pmd size.

Call thp_get_unmapped_area() from f_op->get_unmapped_area.

Note, there is no change in behavior for a non-DAX file.

Link: http://lkml.kernel.org/r/1472497881-9323-3-git-send-email-toshi.kani@hpe.com
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 51e8137b 29-Sep-2016 Jan Kara <jack@suse.cz>

ext4: remove plugging from ext4_file_write_iter()

do_blockdev_direct_IO() takes care of properly plugging direct IO so
there's no need to plug again inside ext4_file_write_iter().

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 4b0524aa 29-Sep-2016 Jan Kara <jack@suse.cz>

ext4: allow unlocked direct IO when pages are cached

Currently we do not allow unlocked (meaning without inode_lock) direct
IO when the file has any pages cached. This check is not needed anymore
as we keep inode lock until ext4_direct_IO_write() and thus can happily
writeback and evict any pages conflicting with current direct IO write.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 518eaa63 15-Sep-2016 Fabian Frederick <fabf@skynet.be>

ext4: create EXT4_MAX_BLOCKS() macro

Create a macro to calculate length + offset -> maximum blocks
This adds more readability.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 6b524995 26-Jul-2016 Ross Zwisler <zwisler@kernel.org>

dax: remote unused fault wrappers

Remove the unused wrappers dax_fault() and dax_pmd_fault(). After this
removal, rename __dax_fault() and __dax_pmd_fault() to dax_fault() and
dax_pmd_fault() respectively, and update all callers.

The dax_fault() and dax_pmd_fault() wrappers were initially intended to
capture some filesystem independent functionality around page faults
(calling sb_start_pagefault() & sb_end_pagefault(), updating file mtime
and ctime).

However, the following commits:

5726b27b09cc ("ext2: Add locking for DAX faults")
ea3d7209ca01 ("ext4: fix races between page faults and hole punching")

added locking to the ext2 and ext4 filesystems after these common
operations but before __dax_fault() and __dax_pmd_fault() were called.
This means that these wrappers are no longer used, and are unlikely to
be used in the future.

XFS has had locking analogous to what was recently added to ext2 and
ext4 since DAX support was initially introduced by:

6b698edeeef0 ("xfs: add DAX file operations support")

Link: http://lkml.kernel.org/r/20160714214049.20075-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a7550b30 10-Jul-2016 Jaegeuk Kim <jaegeuk@kernel.org>

ext4 crypto: migrate into vfs's crypto engine

This patch removes the most parts of internal crypto codes.
And then, it modifies and adds some ext4-specific crypt codes to use the generic
facility.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 02fbd139 11-May-2016 Jan Kara <jack@suse.cz>

dax: Remove complete_unwritten argument

Fault handlers currently take complete_unwritten argument to convert
unwritten extents after PTEs are updated. However no filesystem uses
this anymore as the code is racy. Remove the unused argument.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>


# 12735f88 12-May-2016 Jan Kara <jack@suse.cz>

ext4: pre-zero allocated blocks for DAX IO

Currently ext4 treats DAX IO the same way as direct IO. I.e., it
allocates unwritten extents before IO is done and converts unwritten
extents afterwards. However this way DAX IO can race with page fault to
the same area:

ext4_ext_direct_IO() dax_fault()
dax_io()
get_block() - allocates unwritten extent
copy_from_iter_pmem()
get_block() - converts
unwritten block to
written and zeroes it
out
ext4_convert_unwritten_extents()

So data written with DAX IO gets lost. Similarly dax_new_buf() called
from dax_io() can overwrite data that has been already written to the
block via mmap.

Fix the problem by using pre-zeroed blocks for DAX IO the same way as we
use them for DAX mmap. The downside of this solution is that every
allocating write writes each block twice (once zeros, once data). Fixing
the race with locking is possible as well however we would need to
lock-out faults for the whole range written to by DAX IO. And that is
not easy to do without locking-out faults for the whole file which seems
too aggressive.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# e2592217 07-Apr-2016 Christoph Hellwig <hch@lst.de>

fs: simplify the generic_write_sync prototype

The kiocb already has the new position, so use that. The only interesting
case is AIO, where we currently don't bother updating ki_pos. We're about
to free the kiocb after we're done, so we might as well update it to make
everyone's life simpler.

While we're at it also return the bytes written argument passed in if
we were successful so that the boilerplate error switch code in the
callers can go away.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# dde0c2e7 07-Apr-2016 Christoph Hellwig <hch@lst.de>

fs: add IOCB_SYNC and IOCB_DSYNC

This will allow us to do per-I/O sync file writes, as required by a lot
of fileservers or storage targets.

XXX: Will need a few additional audits for O_DSYNC

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8d2ae1cb 26-Apr-2016 Jakub Wilk <jwilk@jwilk.net>

ext4: remove trailing \n from ext4_warning/ext4_error calls

Messages passed to ext4_warning() or ext4_error() don't need trailing
newlines, because these function add the newlines themselves.

Signed-off-by: Jakub Wilk <jwilk@jwilk.net>


# 09cbfeaf 01-Apr-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c0a37d487 26-Mar-2016 Miklos Szeredi <mszeredi@redhat.com>

ext4: use file_dentry()

EXT4 may be used as lower layer of overlayfs and accessing f_path.dentry
can lead to a crash.

Fix by replacing direct access of file->f_path.dentry with the
file_dentry() accessor, which will always return a native object.

Reported-by: Daniel Axtens <dja@axtens.net>
Fixes: 4bacc9c9234c ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
Fixes: ff978b09f973 ("ext4 crypto: move context consistency check to ext4_file_open()")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org> # v4.5


# 9dd78d8c 26-Mar-2016 Miklos Szeredi <mszeredi@redhat.com>

ext4: use dget_parent() in ext4_file_open()

In f_op->open() lock on parent is not held, so there's no guarantee that
parent dentry won't go away at any time.

Even after this patch there's no guarantee that 'dir' will stay the parent
of 'inode', but at least it won't be freed while being used.

Fixes: ff978b09f973 ("ext4 crypto: move context consistency check to ext4_file_open()")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: <stable@vger.kernel.org> # v4.5


# 2d90c160 09-Mar-2016 Jan Kara <jack@suse.cz>

ext4: more efficient SEEK_DATA implementation

Using SEEK_DATA in a huge sparse file can easily lead to sotflockups as
ext4_seek_data() iterates hole block-by-block. Fix the problem by using
returned hole size from ext4_map_blocks() and thus skip the hole in one
go.

Update also SEEK_HOLE implementation to follow the same pattern as
SEEK_DATA to make future maintenance easier.

Furthermore we add cond_resched() to both ext4_seek_data() and
ext4_seek_hole() to avoid softlockups in case evil user creates huge
fragmented file and we have to go through lots of extents.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# e142d052 08-Mar-2016 Jan Kara <jack@suse.cz>

ext4: use i_mutex to serialize unaligned AIO DIO

Currently we've used hashed aio_mutex to serialize unaligned AIO DIO.
However the code cleanups that happened after 2011 when the lock was
introduced made aio_mutex acquired at almost the same places where we
already have exclusion using i_mutex. So just use i_mutex for the
exclusion of unaligned AIO DIO.

The change moves waiting for pending unwritten extent conversion under
i_mutex. That makes special handling of O_APPEND writes unnecessary and
also avoids possible livelocking of unaligned AIO DIO with aligned one
(nothing was preventing contiguous stream of aligned AIO DIOs to let
unaligned AIO DIO wait forever).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 1e9d180b 27-Feb-2016 Ross Zwisler <zwisler@kernel.org>

ext2, ext4: fix issue with missing journal entry in ext4_dax_mkwrite()

As it is currently written ext4_dax_mkwrite() assumes that the call into
__dax_mkwrite() will not have to do a block allocation so it doesn't create
a journal entry. For a read that creates a zero page to cover a hole
followed by a write that actually allocates storage this is incorrect. The
ext4_dax_mkwrite() -> __dax_mkwrite() -> __dax_fault() path calls
get_blocks() to allocate storage.

Fix this by having the ->page_mkwrite fault handler call ext4_dax_fault()
as this function already has all the logic needed to allocate a journal
entry and call __dax_fault().

Also update the ext2 fault handlers in this same way to remove duplicate
code and keep the logic between ext2 and ext4 the same.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# ff978b09 07-Feb-2016 Theodore Ts'o <tytso@mit.edu>

ext4 crypto: move context consistency check to ext4_file_open()

In the case where the per-file key for the directory is cached, but
root does not have access to the key needed to derive the per-file key
for the files in the directory, we allow the lookup to succeed, so
that lstat(2) and unlink(2) can suceed. However, if a program tries
to open the file, it will get an ENOKEY error.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# d5be7a03 22-Jan-2016 Ross Zwisler <zwisler@kernel.org>

ext4: call dax_pfn_mkwrite() for DAX fsync/msync

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can track when user pages are
dirtied.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5955102c 22-Jan-2016 Al Viro <viro@zeniv.linux.org.uk>

wrappers for ->i_mutex access

parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ba5843f5 07-Dec-2015 Jan Kara <jack@suse.com>

ext4: use pre-zeroed blocks for DAX page faults

Make DAX fault path use pre-zeroed blocks to avoid races with extent
conversion and zeroing when two page faults to the same block happen.

Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# ea3d7209 07-Dec-2015 Jan Kara <jack@suse.com>

ext4: fix races between page faults and hole punching

Currently, page faults and hole punching are completely unsynchronized.
This can result in page fault faulting in a page into a range that we
are punching after truncate_pagecache_range() has been called and thus
we can end up with a page mapped to disk blocks that will be shortly
freed. Filesystem corruption will shortly follow. Note that the same
race is avoided for truncate by checking page fault offset against
i_size but there isn't similar mechanism available for punching holes.

Fix the problem by creating new rw semaphore i_mmap_sem in inode and
grab it for writing over truncate, hole punching, and other functions
removing blocks from extent tree and for read over page faults. We
cannot easily use i_data_sem for this since that ranks below transaction
start and we need something ranking above it so that it can be held over
the whole truncate / hole punching operation. Also remove various
workarounds we had in the code to reduce race window when page fault
could have created pages with stale mapping information.

Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 01a33b4a 08-Sep-2015 Matthew Wilcox <willy@infradead.org>

ext4: start transaction before calling into DAX

Jan Kara pointed out that in the case where we are writing to a hole, we
can end up with a lock inversion between the page lock and the journal
lock. We can avoid this by starting the transaction in ext4 before
calling into DAX. The journal lock nests inside the superblock
pagefault lock, so we have to duplicate that code from dax_fault, like
XFS does.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ed923b57 08-Sep-2015 Matthew Wilcox <willy@infradead.org>

ext4: add ext4_get_block_dax()

DAX wants different semantics from any currently-existing ext4 get_block
callback. Unlike ext4_get_block_write(), it needs to honour the
'create' flag, and unlike ext4_get_block(), it needs to be able to
return unwritten extents. So introduce a new ext4_get_block_dax() which
has those semantics.

We could also change ext4_get_block_write() to honour the 'create' flag,
but that might have consequences on other users that I do not currently
understand.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e676a4c1 08-Sep-2015 Matthew Wilcox <willy@infradead.org>

ext4: use ext4_get_block_write() for DAX

DAX relies on the get_block function either zeroing newly allocated
blocks before they're findable by subsequent calls to get_block, or
marking newly allocated blocks as unwritten. ext4_get_block() cannot
create unwritten extents, but ext4_get_block_write() can.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Reported-by: Andy Rudoff <andy.rudoff@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 11bd1a9e 08-Sep-2015 Matthew Wilcox <willy@infradead.org>

ext4: huge page fault support

Use DAX to provide support for huge pages.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c94c2acf 08-Sep-2015 Matthew Wilcox <willy@infradead.org>

dax: move DAX-related functions to a new header

In order to handle the !CONFIG_TRANSPARENT_HUGEPAGES case, we need to
return VM_FAULT_FALLBACK from the inlined dax_pmd_fault(), which is
defined in linux/mm.h. Given that we don't want to include <linux/mm.h>
in <linux/fs.h>, the easiest solution is to move the DAX-related
functions to a new header, <linux/dax.h>. We could also have moved
VM_FAULT_* definitions to a new header, or a different header that isn't
quite such a boil-the-ocean header as <linux/mm.h>, but this felt like
the best option.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e842f290 03-Jun-2015 Dave Chinner <dchinner@redhat.com>

dax: don't abuse get_block mapping for endio callbacks

dax_fault() currently relies on the get_block callback to attach an
io completion callback to the mapping buffer head so that it can
run unwritten extent conversion after zeroing allocated blocks.

Instead of this hack, pass the conversion callback directly into
dax_fault() similar to the get_block callback. When the filesystem
allocates unwritten extents, it will set the buffer_unwritten()
flag, and hence the dax_fault code can call the completion function
in the contexts where it is necessary without overloading the
mapping buffer head.

Note: The changes to ext4 to use this interface are suspect at best.
In fact, the way ext4 did this end_io assignment in the first place
looks suspect because it only set a completion callback when there
wasn't already some other write() call taking place on the same
inode. The ext4 end_io code looks rather intricate and fragile with
all it's reference counting and passing to different contexts for
modification via inode private pointers that aren't protected by
locks...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# abdd438b 31-May-2015 Theodore Ts'o <tytso@mit.edu>

ext4 crypto: handle unexpected lack of encryption keys

Fix up attempts by users to try to write to a file when they don't
have access to the encryption key.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b7236e21 18-May-2015 Theodore Ts'o <tytso@mit.edu>

ext4 crypto: reorganize how we store keys in the inode

This is a pretty massive patch which does a number of different things:

1) The per-inode encryption information is now stored in an allocated
data structure, ext4_crypt_info, instead of directly in the node.
This reduces the size usage of an in-memory inode when it is not
using encryption.

2) We drop the ext4_fname_crypto_ctx entirely, and use the per-inode
encryption structure instead. This remove an unnecessary memory
allocation and free for the fname_crypto_ctx as well as allowing us
to reuse the ctfm in a directory for multiple lookups and file
creations.

3) We also cache the inode's policy information in the ext4_crypt_info
structure so we don't have to continually read it out of the
extended attributes.

4) We now keep the keyring key in the inode's encryption structure
instead of releasing it after we are done using it to derive the
per-inode key. This allows us to test to see if the key has been
revoked; if it has, we prevent the use of the derived key and free
it.

5) When an inode is released (or when the derived key is freed), we
will use memset_explicit() to zero out the derived key, so it's not
left hanging around in memory. This implies that when a user logs
out, it is important to first revoke the key, and then unlink it,
and then finally, to use "echo 3 > /proc/sys/vm/drop_caches" to
release any decrypted pages and dcache entries from the system
caches.

6) All this, and we also shrink the number of lines of code by around
100. :-)

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# be64f884 15-Apr-2015 Boaz Harrosh <boaz@plexistor.com>

dax: unify ext2/4_{dax,}_file_operations

The original dax patchset split the ext2/4_file_operations because of the
two NULL splice_read/splice_write in the dax case.

In the vfs if splice_read/splice_write are NULL we then call
default_splice_read/write.

What we do here is make generic_file_splice_read aware of IS_DAX() so the
original ext2/4_file_operations can be used as is.

For write it appears that iter_file_splice_write is just fine. It uses
the regular f_op->write(file,..) or new_sync_write(file, ...).

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0e3b210c 15-Apr-2015 Boaz Harrosh <boaz@plexistor.com>

dax: use pfn_mkwrite to update c/mtime + freeze protection

From: Yigal Korman <yigal@plexistor.com>

[v1]
Without this patch, c/mtime is not updated correctly when mmap'ed page is
first read from and then written to.

A new xfstest is submitted for testing this (generic/080)

[v2]
Jan Kara has pointed out that if we add the
sb_start/end_pagefault pair in the new pfn_mkwrite we
are then fixing another bug where: A user could start
writing to the page while filesystem is frozen.

Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c9c7429c 11-Apr-2015 Michael Halcrow <mhalcrow@google.com>

ext4 crypto: implement the ext4 decryption read path

Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 2ba48ce5 09-Apr-2015 Al Viro <viro@zeniv.linux.org.uk>

mirror O_APPEND and O_DIRECT into iocb->ki_flags

... avoiding write_iter/fcntl races.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 3309dd04 08-Apr-2015 Al Viro <viro@zeniv.linux.org.uk>

switch generic_write_checks() to iocb and iter

... returning -E... upon error and amount of data left in iter after
(possible) truncation upon success. Note, that normal case gives
a non-zero (positive) return value, so any tests for != 0 _must_ be
updated.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Conflicts:
fs/ext4/file.c


# e768d7ff 07-Apr-2015 Al Viro <viro@zeniv.linux.org.uk>

ext4_file_write_iter: move generic_write_checks() up

simpler that way...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 0fa6b005 04-Apr-2015 Al Viro <viro@zeniv.linux.org.uk>

generic_write_checks(): drop isblk argument

all remaining callers are passing 0; some just obscure that fact.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5f380c7f 07-Apr-2015 Al Viro <viro@zeniv.linux.org.uk>

lift generic_write_checks() into callers of __generic_file_write_iter()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5d5d5689 03-Apr-2015 Al Viro <viro@zeniv.linux.org.uk>

make new_sync_{read,write}() static

All places outside of core VFS that checked ->read and ->write for being NULL or
called the methods directly are gone now, so NULL {read,write} with non-NULL
{read,write}_iter will do the right thing in all cases.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 72b8e0f9 02-Apr-2015 Sheng Yong <shengyong1@huawei.com>

ext4: remove unused header files

Remove unused header files and header files which are included in
ext4.h.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# e2e40f2c 22-Feb-2015 Christoph Hellwig <hch@lst.de>

fs: move struct kiocb to fs.h

struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 923ae0ff 16-Feb-2015 Ross Zwisler <zwisler@kernel.org>

ext4: add DAX functionality

This is a port of the DAX functionality found in the current version of
ext2.

[matthew.r.wilcox@intel.com: heavily tweaked]
[akpm@linux-foundation.org: remap_pages went away]
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d83a08db 10-Feb-2015 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: drop vm_ops->remap_pages and generic_file_remap_pages() stub

Nobody uses it anymore.

[akpm@linux-foundation.org: fix filemap_xip.c]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ad7fefb1 02-Jan-2015 Theodore Ts'o <tytso@mit.edu>

Revert "ext4: fix suboptimal seek_{data,hole} extents traversial"

This reverts commit 14516bb7bb6ffbd49f35389f9ece3b2045ba5815.

This was causing regression test failures with generic/285 with an ext3
filesystem using CONFIG_EXT4_USE_FOR_EXT23.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 14516bb7 02-Dec-2014 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: fix suboptimal seek_{data,hole} extents traversial

It is ridiculous practice to scan inode block by block, this technique
applicable only for old indirect files. This takes significant amount
of time for really large files. Let's reuse ext4_fiemap which already
traverse inode-tree in most optimal meaner.

TESTCASE:
ftruncate64(fd, 0);
ftruncate64(fd, 1ULL << 40);
/* lseek will spin very long time */
lseek64(fd, 0, SEEK_DATA);
lseek64(fd, 0, SEEK_HOLE);

Original report: https://lkml.org/lkml/2014/10/16/620

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# a41537e6 30-Oct-2014 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: prevent bugon on race between write/fcntl

O_DIRECT flags can be toggeled via fcntl(F_SETFL). But this value checked
twice inside ext4_file_write_iter() and __generic_file_write() which
result in BUG_ON inside ext4_direct_IO.

Let's initialize iocb->private unconditionally.

TESTCASE: xfstest:generic/036 https://patchwork.ozlabs.org/patch/402445/

#TYPICAL STACK TRACE:
kernel BUG at fs/ext4/inode.c:2960!
invalid opcode: 0000 [#1] SMP
Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod
CPU: 6 PID: 5505 Comm: aio-dio-fcntl-r Not tainted 3.17.0-rc2-00176-gff5c017 #161
Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011
task: ffff88080e95a7c0 ti: ffff88080f908000 task.ti: ffff88080f908000
RIP: 0010:[<ffffffff811fabf2>] [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0
RSP: 0018:ffff88080f90bb58 EFLAGS: 00010246
RAX: 0000000000000400 RBX: ffff88080fdb2a28 RCX: 00000000a802c818
RDX: 0000040000080000 RSI: ffff88080d8aeb80 RDI: 0000000000000001
RBP: ffff88080f90bbc8 R08: 0000000000000000 R09: 0000000000001581
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88080d8aeb80
R13: ffff88080f90bbf8 R14: ffff88080fdb28c8 R15: ffff88080fdb2a28
FS: 00007f23b2055700(0000) GS:ffff880818400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f23b2045000 CR3: 000000080cedf000 CR4: 00000000000407e0
Stack:
ffff88080f90bb98 0000000000000000 7ffffffffffffffe ffff88080fdb2c30
0000000000000200 0000000000000200 0000000000000001 0000000000000200
ffff88080f90bbc8 ffff88080fdb2c30 ffff88080f90be08 0000000000000200
Call Trace:
[<ffffffff8112ca9d>] generic_file_direct_write+0xed/0x180
[<ffffffff8112f2b2>] __generic_file_write_iter+0x222/0x370
[<ffffffff811f495b>] ext4_file_write_iter+0x34b/0x400
[<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410
[<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410
[<ffffffff810990e5>] ? local_clock+0x25/0x30
[<ffffffff810abd94>] ? __lock_acquire+0x274/0x700
[<ffffffff811f4610>] ? ext4_unwritten_wait+0xb0/0xb0
[<ffffffff811bd756>] aio_run_iocb+0x286/0x410
[<ffffffff810990e5>] ? local_clock+0x25/0x30
[<ffffffff810ac359>] ? lock_release_holdtime+0x29/0x190
[<ffffffff811bc05b>] ? lookup_ioctx+0x4b/0xf0
[<ffffffff811bde3b>] do_io_submit+0x55b/0x740
[<ffffffff811bdcaa>] ? do_io_submit+0x3ca/0x740
[<ffffffff811be030>] SyS_io_submit+0x10/0x20
[<ffffffff815ce192>] system_call_fastpath+0x16/0x1b
Code: 01 48 8b 80 f0 01 00 00 48 8b 18 49 8b 45 10 0f 85 f1 01 00 00 48 03 45 c8 48 3b 43 48 0f 8f e3 01 00 00 49 83 7c
24 18 00 75 04 <0f> 0b eb fe f0 ff 83 ec 01 00 00 49 8b 44 24 18 8b 00 85 c0 89
RIP [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0
RSP <ffff88080f90bb58>

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: stable@vger.kernel.org


# 590a1418 15-Jul-2014 Zhang Zhen <zhenzhang.zhang@huawei.com>

ext4: remove readpage() check in ext4_mmap_file()

There is no kind of file which does not supply a page reading function.

Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 8d020765 05-Apr-2014 Al Viro <viro@zeniv.linux.org.uk>

->splice_write() via ->write_iter()

iter_file_splice_write() - a ->splice_write() instance that gathers the
pipe buffers, builds a bio_vec-based iov_iter covering those and feeds
it to ->write_iter(). A bunch of simple cases coverted to that...

[AV: fixed the braino spotted by Cyrill]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5d601255 12-May-2014 liang xie <xieliang007@gmail.com>

ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access

Make them more consistently

Signed-off-by: xieliang <xieliang@xiaomi.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# c197855e 12-May-2014 Stephen Hemminger <stephen@networkplumber.org>

ext4: make local functions static

I have been running make namespacecheck to look for unneeded globals, and
found these in ext4.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 9b884164 17-Apr-2014 Al Viro <viro@zeniv.linux.org.uk>

convert ext4 to ->write_iter()

unfortunately, Ted's changes to ext4_file_write() are *still* an
incomplete fix - playing with rlimits can let you smuggle an
unaligned request past the checks. So there almost certainly
will be more merge PITA around that place...

[fix from Peter Ujfalusi <peter.ujfalusi@ti.com> folded]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# aad4f8bb 02-Apr-2014 Al Viro <viro@zeniv.linux.org.uk>

switch simple generic_file_aio_read() users to ->read_iter()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# f5ccfe1d 21-Apr-2014 Theodore Ts'o <tytso@mit.edu>

ext4: fix locking for O_APPEND writes

Al Viro pointed out that locking for O_APPEND writes was problematic,
since the location of the write isn't known until after we take the
i_mutex, which impacts the ext4_unaligned_aio() and s_bitmap_maxbytes
check.

For O_APPEND always assume that the write is unaligned so call
ext4_unwritten_wait(). And to solve the second problem, take the
i_mutex earlier before we start the s_bitmap_maxbytes check.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 7ed07ba8 21-Apr-2014 Theodore Ts'o <tytso@mit.edu>

ext4: factor out common code in ext4_file_write()

This shouldn't change any logic flow; just delete duplicated code.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# 8ad2850f 21-Apr-2014 Theodore Ts'o <tytso@mit.edu>

ext4: move ext4_file_dio_write() into ext4_file_write()

This commit doesn't actually change anything; it just moves code
around in preparation for some code simplification work.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# 7608e610 21-Apr-2014 Theodore Ts'o <tytso@mit.edu>

ext4: inline generic_file_aio_write() into ext4_file_write()

Copy generic_file_aio_write() into ext4_file_write(). This is part of
a patch series which allows us to simplify ext4_file_write() and
ext4_file_dio_write(), by calling __generic_file_aio_write() directly.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# 556615dc 20-Apr-2014 Lukas Czerner <lczerner@redhat.com>

ext4: rename uninitialized extents to unwritten

Currently in ext4 there is quite a mess when it comes to naming
unwritten extents. Sometimes we call it uninitialized and sometimes we
refer to it as unwritten.

The right name for the extent which has been allocated but does not
contain any written data is _unwritten_. Other file systems are
using this name consistently, even the buffer head state refers to it as
unwritten. We need to fix this confusion in ext4.

This commit changes every reference to an uninitialized extent (meaning
allocated but unwritten) to unwritten extent. This includes comments,
function names and variable names. It even covers abbreviation of the
word uninitialized (such as uninit) and some misspellings.

This commit does not change any of the code paths at all. This has been
confirmed by comparing md5sums of the assembly code of each object file
after all the function names were stripped from it.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 6e6358fc 11-Apr-2014 Theodore Ts'o <tytso@mit.edu>

ext4: use i_size_read in ext4_unaligned_aio()

We haven't taken i_mutex yet, so we need to use i_size_read().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org


# f1820361 07-Apr-2014 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm: implement ->map_pages for page cache

filemap_map_pages() is generic implementation of ->map_pages() for
filesystems who uses page cache.

It should be safe to use filemap_map_pages() for ->map_pages() if
filesystem use filemap_fault() for ->fault().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ning Qu <quning@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 45d4f855 03-Apr-2014 Dan Carpenter <dan.carpenter@oracle.com>

fs/direct-io.c: remove some left over checks

We know that "ret > 0" is true here. These tests were left over from
commit 02afc27faec9 ('direct-io: Handle O_(D)SYNC AIO') and aren't
needed any more.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 41fc56d5 08-Feb-2014 Al Viro <viro@zeniv.linux.org.uk>

kill the 4th argument of __generic_file_aio_write()

It's always equal to &iocb->ki_pos, where iocb is the value of the 1st
argument.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d311d79d 09-Feb-2014 Al Viro <viro@zeniv.linux.org.uk>

fix O_SYNC|O_APPEND syncing the wrong range on write()

It actually goes back to 2004 ([PATCH] Concurrent O_SYNC write support)
when sync_page_range() had been introduced; generic_file_write{,v}() correctly
synced
pos_after_write - written .. pos_after_write - 1
but generic_file_aio_write() synced
pos_before_write .. pos_before_write + written - 1
instead. Which is not the same thing with O_APPEND, obviously.
A couple of years later correct variant had been killed off when
everything switched to use of generic_file_aio_write().

All users of generic_file_aio_write() are affected, and the same bug
has been copied into other instances of ->aio_write().

The fix is trivial; the only subtle point is that generic_write_sync()
ought to be inlined to avoid calculations useless for the majority of
calls.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 64e178a7 20-Dec-2013 Christoph Hellwig <hch@infradead.org>

ext2/3/4: use generic posix ACL infrastructure

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 02afc27f 04-Sep-2013 Christoph Hellwig <hch@infradead.org>

direct-io: Handle O_(D)SYNC AIO

Call generic_write_sync() from the deferred I/O completion handler if
O_DSYNC is set for a write request. Also make sure various callers
don't call generic_write_sync if the direct I/O code returns
-EIOCBQUEUED.

Based on an earlier patch from Jan Kara <jack@suse.cz> with updates from
Jeff Moyer <jmoyer@redhat.com> and Darrick J. Wong <darrick.wong@oracle.com>.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a361293f 16-Aug-2013 Jan Kara <jack@suse.cz>

jbd2: Fix oops in jbd2_journal_file_inode()

Commit 0713ed0cde76438d05849f1537d3aab46e099475 added
jbd2_journal_file_inode() call into ext4_block_zero_page_range().
However that function gets called from truncate path and thus inode
needn't have jinode attached - that happens in ext4_file_open() but
the file needn't be ever open since mount. Calling
jbd2_journal_file_inode() without jinode attached results in the oops.

We fix the problem by attaching jinode to inode also in ext4_truncate()
and ext4_punch_hole() when we are going to zero out partial blocks.

Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 46a1c2c7 24-Jun-2013 Jie Liu <jeff.liu@oracle.com>

vfs: export lseek_execute() to modules

For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
matter in lseek_execute() to update the current file offset
to the desired offset if it is valid, ceph also does the
simliar things at ceph_llseek().

To reduce the duplications, this patch make lseek_execute()
public accessible so that we can call it directly from the
underlying file systems.

Thanks Dave Chinner for this suggestion.

[AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]

v2->v1:
- Add kernel-doc comments for lseek_execute()
- Call lseek_execute() in ceph->llseek()

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Cc: Ben Myers <bpm@sgi.com>
Cc: Ted Tso <tytso@mit.edu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e7293fd1 31-May-2013 Jan Kara <jack@suse.cz>

ext4: fix overflows in SEEK_HOLE, SEEK_DATA implementations

ext4_lblk_t is just u32 so multiplying it by blocksize can easily
overflow for files larger than 4 GB. Fix that by properly typing the
block offsets before shifting.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>


# a27bb332 07-May-2013 Kent Overstreet <koverstreet@google.com>

aio: don't include aio.h in sched.h

Faster kernel compiles by way of fewer unnecessary includes.

[akpm@linux-foundation.org: fix fallout]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e30b5dca 03-May-2013 Yan, Zheng <zheng.z.yan@intel.com>

ext4: fix fio regression

We (Linux Kernel Performance project) found a regression introduced
by commit:

f7fec032aa ext4: track all extent status in extent status tree

The commit causes about 20% performance decrease in fio random write
test. Profiler shows that rb_next() uses a lot of CPU time. The call
stack is:

rb_next
ext4_es_find_delayed_extent
ext4_map_blocks
_ext4_get_block
ext4_get_block_write
__blockdev_direct_IO
ext4_direct_IO
generic_file_direct_write
__generic_file_aio_write
ext4_file_write
aio_rw_vect_retry
aio_run_iocb
do_io_submit
sys_io_submit
system_call_fastpath
io_submit
td_io_getevents
io_u_queued_complete
thread_main
main
__libc_start_main

The cause is that ext4_es_find_delayed_extent() doesn't have an
upper bound, it keeps searching until a delayed extent is found.
When there are a lots of non-delayed entries in the extent state
tree, ext4_es_find_delayed_extent() may uses a lot of CPU time.

Reported-by: LKP project <lkp@linux.intel.com>
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>


# 496ad9aa 23-Jan-2013 Al Viro <viro@zeniv.linux.org.uk>

new helper: file_inode(file)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# be401363 17-Feb-2013 Zheng Liu <wenqing.lz@taobao.com>

ext4: rename and improbe ext4_es_find_extent()

This commit renames ext4_es_find_extent with ext4_es_find_delayed_extent
and improve this function. First, we split input and output parameter.
Second, this function never return the first block of the next delayed
extent after 'es'.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>


# 06b0c886 17-Feb-2013 Zheng Liu <wenqing.lz@taobao.com>

ext4: refine extent status tree

This commit refines the extent status tree code.

1) A prefix 'es_' is added to to the extent status tree structure
members.

2) Refactored es_remove_extent() so that __es_remove_extent() can be
used by es_insert_extent() to remove the old extent entry(-ies) before
inserting a new one.

3) Rename extent_status_end() to ext4_es_end()

4) ext4_es_can_be_merged() is define to check whether two extents can
be merged or not.

5) Update and clarified comments.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# 9924a92a 08-Feb-2013 Theodore Ts'o <tytso@mit.edu>

ext4: pass context information to jbd2__journal_start()

So we can better understand what bits of ext4 are responsible for
long-running jbd2 handles, use jbd2__journal_start() so we can pass
context information for logging purposes.

The recommended way for finding the longer-running handles is:

T=/sys/kernel/debug/tracing
EVENT=$T/events/jbd2/jbd2_handle_stats
echo "interval > 5" > $EVENT/filter
echo 1 > $EVENT/enable

./run-my-fs-benchmark

cat $T/trace > /tmp/problem-handles

This will list handles that were active for longer than 20ms. Having
longer-running handles is bad, because a commit started at the wrong
time could stall for those 20+ milliseconds, which could delay an
fsync() or an O_SYNC operation. Here is an example line from the
trace file describing a handle which lived on for 311 jiffies, or over
1.2 seconds:

postmark-2917 [000] .... 196.435786: jbd2_handle_stats: dev 254,32
tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
dirtied_blocks 0

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# a28a9178 25-Dec-2012 Eric Sandeen <sandeen@redhat.com>

ext4: remove unaligned AIO warning printk

Although I put this in, I now think it was a bad decision. For most
users, there is very little to be done in this case. They get the
message, once per day, with no real context or proposed action. TBH,
it generates support calls when it probably does not need to; the
message sounds more dire than the situation really is.

Just nuke it. Normal investigation via blktrace or whatnot can
reveal poor IO patterns if bad performance is encountered.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 965c8e59 17-Dec-2012 Andrew Morton <akpm@linux-foundation.org>

lseek: the "whence" argument is called "whence"

But the kernel decided to call it "origin" instead. Fix most of the
sites.

Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 939da108 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: Remove CONFIG_EXT4_FS_XATTR

Ted has sent out a RFC about removing this feature. Eric and Jan
confirmed that both RedHat and SUSE enable this feature in all their
product. David also said that "As far as I know, it's enabled in all
Android kernels that use ext4." So it seems OK for us.

And what's more, as inline data depends its implementation on xattr,
and to be frank, I don't run any test again inline data enabled while
xattr disabled. So I think we should add inline data and remove this
config option in the same release.

[ The savings if you disable CONFIG_EXT4_FS_XATTR is only 27k, which
isn't much in the grand scheme of things. Since no one seems to be
testing this configuration except for some automated compile farms, on
balance we are better removing this config option, and so that it is
effectively always enabled. -- tytso ]

Cc: David Brown <davidb@codeaurora.org>
Cc: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# c8c0df24 08-Nov-2012 Zheng Liu <wenqing.lz@taobao.com>

ext4: introduce lseek SEEK_DATA/SEEK_HOLE support

This patch makes ext4 really support SEEK_DATA/SEEK_HOLE flags. Block-mapped
and extent-mapped files are fully implemented together because ext4_map_blocks
hides this differences.

After applying this patch, it will cause a failure in xfstest #285 when the file
is block-mapped due to block-mapped file isn't support fallocate(2).

I had tried to use ext4_ext_walk_space() to retrieve the offset for a
extent-mapped file. But finally I decide to keep using ext4_map_blocks() to
support SEEK_DATA/SEEK_HOLE because ext4_map_blocks() can hide the difference
between block-mapped file and extent-mapped file. Moreover, in next step,
extent status tree will track all extent status, and we can get all mappings
from this tree. So I think that using ext4_map_blocks() is a better choice.

CC: Hugh Dickins <hughd@google.com>
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 0b173bc4 08-Oct-2012 Konstantin Khlebnikov <khlebnikov@openvz.org>

mm: kill vma flag VM_CAN_NONLINEAR

Move actual pte filling for non-linear file mappings into the new special
vma operation: ->remap_pages().

Filesystems must implement this method to get non-linear mapping support,
if it uses filemap_fault() then generic_file_remap_pages() can be used.

Now device drivers can implement this method and obtain nonlinear vma support.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com> #arch/tile
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c278531d 05-Oct-2012 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: fix ext4_flush_completed_IO wait semantics

BUG #1) All places where we call ext4_flush_completed_IO are broken
because buffered io and DIO/AIO goes through three stages
1) submitted io,
2) completed io (in i_completed_io_list) conversion pended
3) finished io (conversion done)
And by calling ext4_flush_completed_IO we will flush only
requests which were in (2) stage, which is wrong because:
1) punch_hole and truncate _must_ wait for all outstanding unwritten io
regardless to it's state.
2) fsync and nolock_dio_read should also wait because there is
a time window between end_page_writeback() and ext4_add_complete_io()
As result integrity fsync is broken in case of buffered write
to fallocated region:
fsync blkdev_completion
->filemap_write_and_wait_range
->ext4_end_bio
->end_page_writeback
<-- filemap_write_and_wait_range return
->ext4_flush_completed_IO
sees empty i_completed_io_list but pended
conversion still exist
->ext4_add_complete_io

BUG #2) Race window becomes wider due to the 'ext4: completed_io
locking cleanup V4' patch series

This patch make following changes:
1) ext4_flush_completed_io() now first try to flush completed io and when
wait for any outstanding unwritten io via ext4_unwritten_wait()
2) Rename function to more appropriate name.
3) Assert that all callers of ext4_flush_unwritten_io should hold i_mutex to
prevent endless wait

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# e27f41e1 28-Sep-2012 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: give i_aiodio_unwritten a more appropriate name

AIO/DIO prefix is wrong because it account unwritten extents which
also may be scheduled from buffered write endio

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 044ce47f 22-Jul-2012 Jan Kara <jack@suse.cz>

ext4: convert last user of ext4_mark_super_dirty() to ext4_handle_dirty_super()

The last user of ext4_mark_super_dirty() in ext4_file_open() is so
rare it can well be modifying the superblock properly by journalling
the change. Change it and get rid of ext4_mark_super_dirty() as it's
not needed anymore.

Artem: small amendments.
Artem: tested using xfstests for both journalled and non-journalled ext4.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>


# 4bd809db 22-Jul-2012 Zheng Liu <wenqing.lz@taobao.com>

ext4: don't take the i_mutex lock when doing DIO overwrites

Aligned and overwrite direct I/O can be parallelized. In
ext4_file_dio_write, we first check whether these conditions are
satisfied or not. If so, we take i_data_sem and release i_mutex lock
directly. Meanwhile iocb->private is set to indicate that this is a
dio overwrite, and it will be handled in ext4_ext_direct_IO.

[ Added fix from Dan Carpenter to fix locking bug on the error path. ]

CC: Tao Ma <tm@tao.ma>
CC: Eric Sandeen <sandeen@redhat.com>
CC: Robin Dong <hao.bigrat@gmail.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>


# ec7268ce 30-Apr-2012 Eric Sandeen <sandeen@redhat.com>

ext4: use core vfs llseek code for dir seeks

Use the new functionality in generic_file_llseek_size() to
accept a custom EOF position, and un-cut-and-paste all the
vfs llseek code from ext4.

Also fix up comments on ext4_llseek() to reflect reality.

Signed-off-by: Eric Sandeen <sandeen@redaht.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e8b96eb5 30-Apr-2012 Eric Sandeen <sandeen@redhat.com>

vfs: allow custom EOF in generic_file_llseek code

For ext3/4 htree directories, using the vfs llseek function with
SEEK_END goes to i_size like for any other file, but in reality
we want the maximum possible hash value. Recent changes
in ext4 have cut & pasted generic_file_llseek() back into fs/ext4/dir.c,
but replicating this core code seems like a bad idea, especially
since the copy has already diverged from the vfs.

This patch updates generic_file_llseek_size to accept
both a custom maximum offset, and a custom EOF position. With this
in place, ext4_dir_llseek can pass in the appropriate maximum hash
position for both maxsize and eof, and get what it wants.

As far as I know, this does not fix any bugs - nfs in the kernel
doesn't use SEEK_END, and I don't know of any user who does. But
some ext4 folks seem keen on doing the right thing here, and I can't
really argue.

(Patch also fixes up some comments slightly)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# fbe10494 09-Jul-2012 Zheng Liu <wenqing.lz@taobao.com>

ext4: split ext4_file_write into buffered IO and direct IO

ext4_file_dio_write is defined in order to split buffered IO and
direct IO in ext4. This patch just refactor some stuff in write path.

CC: Tao Ma <tm@tao.ma>
CC: Eric Sandeen <sandeen@redhat.com>
CC: Robin Dong <hao.bigrat@gmail.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 8563000d 28-May-2012 Zheng Liu <gnehzuil.liu@gmail.com>

ext4: use consistent ssize_t type in ext4_file_write()

The generic_file_aio_write() function returns ssize_t, and
ext4_file_write() returns a ssize_t, so use a ssize_t to collect the
return value from generic_file_aio_write(). It shouldn't matter since
the VFS read/write paths shouldn't allow a read greater than MAX_INT,
but there was previously a bug in the AIO code paths, and it's best if
we use a consistent type so that the return value from
generic_file_aio_write() can't get truncated.

Reported-by: Jouni Siren <jouni.siren@iki.fi>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 4cce0e28 15-Sep-2011 Andi Kleen <ak@linux.intel.com>

ext4: replace cut'n'pasted llseek code with generic_file_llseek_size

This gives ext4 the benefits of unlocked llseek.

Cc: tytso@mit.edu
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>


# cf803903 25-Oct-2011 Darrick J. Wong <djwong@us.ibm.com>

ext4: prevent stack overrun in ext4_file_open

In ext4_file_open, the filesystem records the mountpoint of the first
file that is opened after mounting the filesystem. It does this by
allocating a 64-byte stack buffer, calling d_path() to grab the mount
point through which this file was accessed, and then memcpy()ing 64
bytes into the superblock's s_last_mounted field, starting from the
return value of d_path(), which is stored as "cp". However, if cp >
buf (which it frequently is since path components are prepended
starting at the end of buf) then we can end up copying stack data into
the superblock.

Writing stack variables into the superblock doesn't sound like a great
idea, so use strlcpy instead. Andi Kleen suggested using strlcpy
instead of strncpy.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 4e34e719 23-Jul-2011 Christoph Hellwig <hch@lst.de>

fs: take the ACL checks to common code

Replace the ->check_acl method with a ->get_acl method that simply reads an
ACL from disk after having a cache miss. This means we can replace the ACL
checking boilerplate code with a single implementation in namei.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c334b113 18-Jul-2011 Josef Bacik <josef@redhat.com>

Ext4: handle SEEK_HOLE/SEEK_DATA generically

Since Ext4 has its own lseek we need to make sure it handles
SEEK_HOLE/SEEK_DATA. For now just do the same thing that is done in the generic
case, somebody else can come along and make it do fancy things later. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ae24f28d 25-May-2011 Jan Kara <jack@suse.cz>

ext4: Convert ext4 to new truncate calling convention


Trivial conversion. Fixup one error handling case calling vmtruncate()
and remove ->truncate callback. We also fix a bug that IS_IMMUTABLE and
IS_APPEND files could not be truncated during failed writes. In fact, the
test can be completely removed as upper layers do necessary permission
checks for truncate in do_sys_[f]truncate() and may_open() anyway.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# e9e3bcec 12-Feb-2011 Eric Sandeen <sandeen@redhat.com>

ext4: serialize unaligned asynchronous DIO

ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.

The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.

Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.

I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.

This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.

Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.

[tytso@mit.edu: Keep the mutex as a hashed array instead
of bloating the ext4 inode]

[tytso@mit.edu: Fix up namespace issues so that global
variables are protected with an "ext4_" prefix.]

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 2fe17c10 14-Jan-2011 Christoph Hellwig <hch@lst.de>

fallocate should be a file operation

Currently all filesystems except XFS implement fallocate asynchronously,
while XFS forced a commit. Both of these are suboptimal - in case of O_SYNC
I/O we really want our allocation on disk, especially for the !KEEP_SIZE
case where we actually grow the file with user-visible zeroes. On the
other hand always commiting the transaction is a bad idea for fast-path
uses of fallocate like for example in recent Samba versions. Given
that block allocation is a data plane operation anyway change it from
an inode operation to a file operation so that we have the file structure
available that lets us check for O_SYNC.

This also includes moving the code around for a few of the filesystems,
and remove the already unnedded S_ISDIR checks given that we only wire
up fallocate for regular files.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8aefcd55 09-Jan-2011 Theodore Ts'o <tytso@mit.edu>

ext4: dynamically allocate the jbd2_inode in ext4_inode_info as necessary

Replace the jbd2_inode structure (which is 48 bytes) with a pointer
and only allocate the jbd2_inode when it is needed --- that is, when
the file system has a journal present and the inode has been opened
for writing. This allows us to further slim down the ext4_inode_info
structure.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# e0d10bfa 27-Oct-2010 Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>

ext4: improve llseek error handling for overly large seek offsets

The llseek system call should return EINVAL if passed a seek offset
which results in a write error. What this maximum offset should be
depends on whether or not the huge_file file system feature is set,
and whether or not the file is extent based or not.


If the file has no "EXT4_EXTENTS_FL" flag, the maximum size which can be
written (write systemcall) is different from the maximum size which can be
sought (lseek systemcall).

For example, the following 2 cases demonstrates the differences
between the maximum size which can be written, versus the seek offset
allowed by the llseek system call:

#1: mkfs.ext3 <dev>; mount -t ext4 <dev>
#2: mkfs.ext3 <dev>; tune2fs -Oextent,huge_file <dev>; mount -t ext4 <dev>

Table. the max file size which we can write or seek
at each filesystem feature tuning and file flag setting
+============+===============================+===============================+
| \ File flag| | |
| \ | !EXT4_EXTENTS_FL | EXT4_EXTETNS_FL |
|case \| | |
+------------+-------------------------------+-------------------------------+
| #1 | write: 2194719883264 | write: -------------- |
| | seek: 2199023251456 | seek: -------------- |
+------------+-------------------------------+-------------------------------+
| #2 | write: 4402345721856 | write: 17592186044415 |
| | seek: 17592186044415 | seek: 17592186044415 |
+------------+-------------------------------+-------------------------------+

The differences exist because ext4 has 2 maxbytes which are sb->s_maxbytes
(= extent-mapped maxbytes) and EXT4_SB(sb)->s_bitmap_maxbytes (= block-mapped
maxbytes). Although generic_file_llseek uses only extent-mapped maxbytes.
(llseek of ext4_file_operations is generic_file_llseek which uses
sb->s_maxbytes.)

Therefore we create ext4 llseek function which uses 2 maxbytes.

The new own function originates from generic_file_llseek().
If the file flag, "EXT4_EXTENTS_FL" is not set, the function alters
inode->i_sb->s_maxbytes into EXT4_SB(inode->i_sb)->s_bitmap_maxbytes.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>


# d889dc83 27-Jul-2010 Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>

ext4: fix EFBIG edge case when writing to large non-extent file

By running the following reproducer, we can confirm that the write
system call returns with 0 when it should return the error EFBIG.

#!/bin/sh

/bin/dd if=/dev/zero of=./img bs=1k count=1 seek=1024k > /dev/null 2>&1
/sbin/mkfs.ext3 -Fq ./img
/bin/mount -o loop -t ext4 ./img /mnt
/bin/touch /mnt/file
strace /bin/dd if=/dev/zero of=/mnt/file conv=notrunc bs=1k count=1 seek=$((2194719883264/1024)) 2>&1 | /bin/egrep "write.* 1024\) = "
/bin/umount /mnt
exit

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Eric Sandeen <sandeen@redhat.com>


# a0375156 11-Jun-2010 Theodore Ts'o <tytso@mit.edu>

ext4: Clean up s_dirt handling

We don't need to set s_dirt in most of the ext4 code when journaling
is enabled. In ext3/4 some of the summary statistics for # of free
inodes, blocks, and directories are calculated from the per-block
group statistics when the file system is mounted or unmounted. As a
result the superblock doesn't have to be updated, either via the
journal or by setting s_dirt. There are a few exceptions, most
notably when resizing the file system, where the superblock needs to
be modified --- and in that case it should be done as a journalled
operation if possible, and s_dirt set only in no-journal mode.

This patch will optimize out some unneeded disk writes when using ext4
with a journal.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 12e9b892 16-May-2010 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: Use bitops to read/modify i_flags in struct ext4_inode_info

At several places we modify EXT4_I(inode)->i_flags without holding
i_mutex (ext4_do_update_inode, ...). These modifications are racy and
we can lose updates to i_flags. So convert handling of i_flags to use
bitops which are atomic.

https://bugzilla.kernel.org/show_bug.cgi?id=15792

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 871a2931 03-Mar-2010 Christoph Hellwig <hch@infradead.org>

dquot: cleanup dquot initialize routine

Get rid of the initialize dquot operation - it is now always called from
the filesystem and if a filesystem really needs it's own (which none
currently does) it can just call into it's own routine directly.

Rename the now static low-level dquot_initialize helper to __dquot_initialize
and vfs_dq_init to dquot_initialize to have a consistent namespace.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>


# 907f4554 03-Mar-2010 Christoph Hellwig <hch@infradead.org>

dquot: move dquot initialization responsibility into the filesystem

Currently various places in the VFS call vfs_dq_init directly. This means
we tie the quota code into the VFS. Get rid of that and make the
filesystem responsible for the initialization. For most metadata operations
this is a straight forward move into the methods, but for truncate and
open it's a bit more complicated.

For truncate we currently only call vfs_dq_init for the sys_truncate case
because open already takes care of it for ftruncate and open(O_TRUNC) - the
new code causes an additional vfs_dq_init for those which is harmless.

For open the initialization is moved from do_filp_open into the open method,
which means it happens slightly earlier now, and only for regular files.
The latter is fine because we don't need to initialize it for operations
on special files, and we already do it as part of the namespace operations
for directories.

Add a dquot_file_open helper that filesystems that support generic quotas
can use to fill in ->open.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>


# 3899167d 23-Jan-2010 Al Viro <viro@zeniv.linux.org.uk>

Get rid of mnt_mountpoint abuses in ext4

path to mnt/mnt->mnt_root is no worse than that to
mnt->mnt_parent/mnt->mnt_mountpoint *and* needs no
pinning the sucker down (mnt is not going away and
mnt->mnt_root won't change)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 19f5fb7a 24-Jan-2010 Theodore Ts'o <tytso@mit.edu>

ext4: Use bitops to read/modify EXT4_I(inode)->i_state

At several places we modify EXT4_I(inode)->i_state without holding
i_mutex (ext4_release_file, ext4_bmap, ext4_journalled_writepage,
ext4_do_update_inode, ...). These modifications are racy and we can
lose updates to i_state. So convert handling of i_state to use bitops
which are atomic.

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# f0f37e2f 27-Sep-2009 Alexey Dobriyan <adobriyan@gmail.com>

const: mark struct vm_struct_operations

* mark struct vm_area_struct::vm_ops as const
* mark vm_ops in AGP code

But leave TTM code alone, something is fishy there with global vm_ops
being used.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0d34ec62 18-Aug-2009 Jan Kara <jack@suse.cz>

ext4: Remove syncing logic from ext4_file_write

The syncing is now properly handled by generic_file_aio_write() so
no special ext4 code is needed.

CC: linux-ext4@vger.kernel.org
CC: tytso@mit.edu
Signed-off-by: Jan Kara <jack@suse.cz>


# 1d5ccd1c 28-Aug-2009 Linus Torvalds <torvalds@linux-foundation.org>

ext[234]: move over to 'check_acl' permission model

Don't implement per-filesystem 'extX_permission()' functions that have
to be called for every path component operation, and instead just expose
the actual ACL checking so that the VFS layer can now do it for us.

Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bc0b0d6d 13-Jun-2009 Theodore Ts'o <tytso@mit.edu>

ext4: update the s_last_mounted field in the superblock

This field can be very helpful when a system administrator is trying
to sort through large numbers of block devices or filesystem images.
What is stored in this field can be ambiguous if multiple filesystem
namespaces are in play; what we store in practice is the mountpoint
interpreted by the process's namespace which first opens a file in the
filesystem.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# d6014301 27-Mar-2009 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Fix discard of inode prealloc space with delayed allocation.

With delayed allocation we should not/cannot discard inode prealloc
space during file close. We would still have dirty pages for which we
haven't allocated blocks yet. With this fix after each get_blocks
request we check whether we have zero reserved blocks and if yes and
we don't have any writers on the file we discard inode prealloc space.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 7d8f9f7d 24-Feb-2009 Theodore Ts'o <tytso@mit.edu>

ext4: Automatically allocate delay allocated blocks on close

When closing a file that had been previously truncated, force any
delay allocated blocks that to be allocated so that if the filesystem
is mounted with data=ordered, the data blocks will be pushed out to
disk along with the journal commit. Many application programs expect
this, so we do this to avoid zero length files if the system crashes
unexpectedly.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 3a06d778 22-Nov-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: sparse fixes

* Change EXT4_HAS_*_FEATURE to return a boolean
* Add a function prototype for ext4_fiemap() in ext4.h
* Make ext4_ext_fiemap_cb() and ext4_xattr_fiemap() be static functions
* Add lock annotations to mb_free_blocks()

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 03010a33 10-Oct-2008 Theodore Ts'o <tytso@mit.edu>

ext4: Rename ext4dev to ext4

The ext4 filesystem is getting stable enough that it's time to drop
the "dev" prefix. Also remove the requirement for the TEST_FILESYS
flag.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 6873fa0d 06-Oct-2008 Eric Sandeen <sandeen@redhat.com>

Hook ext4 to the vfs fiemap interface.

ext4_ext_walk_space() was reinstated to be used for iterating over file
extents with a callback; it is used by the ext4 fiemap implementation.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-ext4@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org


# c2ea3fde 10-Oct-2008 Theodore Ts'o <tytso@mit.edu>

ext4: Remove old legacy block allocator

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# af5bc92d 08-Sep-2008 Theodore Ts'o <tytso@mit.edu>

ext4: Fix whitespace checkpatch warnings/errors

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 3e3398a0 11-Jul-2008 Mingming Cao <cmm@us.ibm.com>

ext4: delayed allocation i_blocks fix for stat

Right now i_blocks is not getting updated until the blocks are actually
allocaed on disk. This means with delayed allocation, right after files
are copied, "ls -sF" shoes the file as taking 0 blocks on disk. "du"
also shows the files taking zero space, which is highly confusing to the
user.

Since delayed allocation already keeps track of per-inode total
number of blocks that are subject to delayed allocation, this patch fix
this by using that to adjust the value returned by stat(2). When real
block allocation is done, the i_blocks will get updated. Since the
reserved blocks for delayed allocation will be decreased, this will be
keep value returned by stat(2) consistent.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 2e9ee850 11-Jul-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Use page_mkwrite vma_operations to get mmap write notification.

We would like to get notified when we are doing a write on mmap section.
This is needed with respect to preallocated area. We split the preallocated
area into initialzed extent and uninitialzed extent in the call back. This
let us handle ENOSPC better. Otherwise we get ENOSPC in the writepage and
that would result in data loss. The changes are also needed to handle ENOSPC
when writing to an mmap section of files with holes.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 3dcf5451 29-Apr-2008 Christoph Hellwig <hch@lst.de>

ext4: move headers out of include/linux

Move ext4 headers out of include/linux. This is just the trivial move,
there's some more thing that could be done later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 5cdd7b2d 29-Apr-2008 Andi Kleen <ak@linux.intel.com>

Convert ext4 to use unlocked_ioctl

I checked ext4_ioctl and it looked largely safe to not be used
without BKL. So convert it over to unlocked_ioctl.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 0e855ac8 28-Jan-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Convert truncate_mutex to read write semaphore.

We are currently taking the truncate_mutex for every read. This would have
performance impact on large CPU configuration. Convert the lock to read write
semaphore and take read lock when we are trying to read the file.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>


# e2b46574 28-Jan-2008 Eric Sandeen <sandeen@redhat.com>

ext4: store maxbytes for bitmapped files and return EFBIG as appropriate

Calculate & store the max offset for bitmapped files, and
catch too-large seeks, truncates, and writes in ext4, shortening
or rejecting as appropriate.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>


# a2df2a63 17-Jul-2007 Amit Arora <aarora@in.ibm.com>

fallocate support in ext4

This patch implements ->fallocate() inode operation in ext4. With this
patch users of ext4 file systems will be able to use fallocate() system
call for persistent preallocation. Current implementation only supports
preallocation for regular files (directories not supported as of date)
with extent maps. This patch does not support block-mapped files currently.
Only FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes are being supported as of
now.

Signed-off-by: Amit Arora <aarora@in.ibm.com>


# 5ffc4ef4 01-Jun-2007 Jens Axboe <jens.axboe@oracle.com>

sendfile: remove .sendfile from filesystems that use generic_file_sendfile()

They can use generic_file_splice_read() instead. Since sys_sendfile() now
prefers that, there should be no change in behaviour.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 754661f1 12-Feb-2007 Arjan van de Ven <arjan@linux.intel.com>

[PATCH] mark struct inode_operations const 1

Many struct inode_operations in the kernel can be "const". Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data. In addition it'll catch accidental writes at compile time to
these shared resources.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9d549890 08-Dec-2006 Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>

[PATCH] ext4: change uses of f_{dentry, vfsmnt} to use f_path

Change all the uses of f_{dentry,vfsmnt} to f_path.{dentry,mnt} in the ext4
filesystem.

Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# dab291af 11-Oct-2006 Mingming Cao <cmm@us.ibm.com>

[PATCH] jbd2: enable building of jbd2 and have ext4 use it rather than jbd

Reworked from a patch by Mingming Cao and Randy Dunlap

Signed-off-By: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 617ba13b 11-Oct-2006 Mingming Cao <cmm@us.ibm.com>

[PATCH] ext4: rename ext4 symbols to avoid duplication of ext3 symbols

Mingming Cao originally did this work, and Shaggy reproduced it using some
scripts from her.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# ac27a0ec 11-Oct-2006 Dave Kleikamp <shaggy@austin.ibm.com>

[PATCH] ext4: initial copy of files from ext3

Start of the ext4 patch series. See Documentation/filesystems/ext4.txt for
details.

This is a simple copy of the files in fs/ext3 to fs/ext4 and
/usr/incude/linux/ext3* to /usr/include/ex4*

Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>