History log of /linux-master/fs/ext4/namei.c
Revision Date Author Comments
# 04aa5f4e 20-Feb-2024 Gabriel Krisman Bertazi <krisman@suse.de>

ext4: Configure dentry operations at dentry-creation time

This was already the case for case-insensitive before commit
bb9cd9106b22 ("fscrypt: Have filesystems handle their d_ops"), but it
was changed to set at lookup-time to facilitate the integration with
fscrypt. But it's a problem because dentries that don't get created
through ->lookup() won't have any visibility of the operations.

Since fscrypt now also supports configuring dentry operations at
creation-time, do it for any encrypted and/or casefold volume,
simplifying the implementation across these features.

Acked-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20240221171412.10710-8-krisman@suse.de
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>


# 556f38bf 12-Nov-2023 Al Viro <viro@zeniv.linux.org.uk>

ext4_add_entry(): ->d_name.len is never 0

That bogosity goes back to the initial merge of ext3. Once upon a time
ext2 used to have a similar check; that got taken out during the switch
to page cache (June 2001). ext3 got merged into mainline 5 months later,
still using buffer cache for directories; removal of the pointless check
in ext2 should've been done as a separate patch, but it hadn't been,
so that thing got missed...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 40dbd071 17-Oct-2023 Al Viro <viro@zeniv.linux.org.uk>

ext4: don't access the source subdirectory content on same-directory rename

We can't really afford locking the source on same-directory rename;
currently vfs_rename() tries to do that, but it will have to be changed.
The logics in ext4 is lazy and goes looking for ".." in source even in
same-directory case. It's not hard to get rid of that, leaving that
behaviour only for cross-directory case; that VFS can get locks safely
(and will keep doing that after the coming changes).

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8fedebb5 24-Aug-2023 Wang Jianjian <wangjianjian0@foxmail.com>

ext4: fix incorrect offset

The last argument of ext4_check_dir_entry is dentry offset int the
file. Luckily this error only results in the wrong offset being
printed in the eventual error message.

Signed-off-by: Wang Jianjian <wangjianjian0@foxmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/tencent_F992989953734FD5DE3F88ECB2191A856206@qq.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b898ab23 04-Oct-2023 Jeff Layton <jlayton@kernel.org>

ext4: convert to new timestamp accessors

Convert to using the new inode timestamp accessor functions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20231004185347.80880-33-jlayton@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 7fda67e8 03-Aug-2023 Shida Zhang <zhangshida@kylinos.cn>

ext4: fix rec_len verify error

With the configuration PAGE_SIZE 64k and filesystem blocksize 64k,
a problem occurred when more than 13 million files were directly created
under a directory:

EXT4-fs error (device xx): ext4_dx_csum_set:492: inode #xxxx: comm xxxxx: dir seems corrupt? Run e2fsck -D.
EXT4-fs error (device xx): ext4_dx_csum_verify:463: inode #xxxx: comm xxxxx: dir seems corrupt? Run e2fsck -D.
EXT4-fs error (device xx): dx_probe:856: inode #xxxx: block 8188: comm xxxxx: Directory index failed checksum

When enough files are created, the fake_dirent->reclen will be 0xffff.
it doesn't equal to the blocksize 65536, i.e. 0x10000.

But it is not the same condition when blocksize equals to 4k.
when enough files are created, the fake_dirent->reclen will be 0x1000.
it equals to the blocksize 4k, i.e. 0x1000.

The problem seems to be related to the limitation of the 16-bit field
when the blocksize is set to 64k.
To address this, helpers like ext4_rec_len_{from,to}_disk has already
been introduced to complete the conversion between the encoded and the
plain form of rec_len.

So fix this one by using the helper, and all the other in this file too.

Cc: stable@kernel.org
Fixes: dbe89444042a ("ext4: Calculate and verify checksums for htree nodes")
Suggested-by: Andreas Dilger <adilger@dilger.ca>
Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20230803060938.1929759-1-zhangshida@kylinos.cn
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b8142793 14-Aug-2023 Eric Biggers <ebiggers@google.com>

ext4: remove redundant checks of s_encoding

Now that ext4 does not allow inodes with the casefold flag to be
instantiated when unsupported, it's unnecessary to repeatedly check for
support later on during random filesystem operations.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20230814182903.37267-3-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 15247734 28-Jun-2023 Zhihao Cheng <chengzhihao1@huawei.com>

ext4: fix unttached inode after power cut with orphan file feature enabled

Running generic/475(filesystem consistent tests after power cut) could
easily trigger unattached inode error while doing fsck:
Unattached zero-length inode 39405. Clear? no

Unattached inode 39405
Connect to /lost+found? no

Above inconsistence is caused by following process:
P1 P2
ext4_create
inode = ext4_new_inode_start_handle // itable records nlink=1
ext4_add_nondir
err = ext4_add_entry // ENOSPC
ext4_append
ext4_bread
ext4_getblk
ext4_map_blocks // returns ENOSPC
drop_nlink(inode) // won't be updated into disk inode
ext4_orphan_add(handle, inode)
ext4_orphan_file_add
ext4_journal_stop(handle)
jbd2_journal_commit_transaction // commit success
>> power cut <<
ext4_fill_super
ext4_load_and_init_journal // itable records nlink=1
ext4_orphan_cleanup
ext4_process_orphan
if (inode->i_nlink) // true, inode won't be deleted

Then, allocated inode will be reserved on disk and corresponds to no
dentries, so e2fsck reports 'unattached inode' problem.

The problem won't happen if orphan file feature is disabled, because
ext4_orphan_add() will update disk inode in orphan list mode. There
are several places not updating disk inode while putting inode into
orphan area, such as ext4_add_nondir(), ext4_symlink() and whiteout
in ext4_rename(). Fix it by updating inode into disk in all error
branches of these places.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217605
Fixes: 02f310fcf47f ("ext4: Speedup ext4 orphan inode handling")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230628132011.650383-1-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# eb8ab444 16-Jun-2023 Jan Kara <jack@suse.cz>

ext4: make ext4_forced_shutdown() take struct super_block

Currently ext4_forced_shutdown() takes struct ext4_sb_info but most
callers need to get it from struct super_block anyway. So just pass in
struct super_block to save all callers from some boilerplate code. No
functional changes.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230616165109.21695-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 1bc33893 05-Jul-2023 Jeff Layton <jlayton@kernel.org>

ext4: convert to ctime accessor functions

In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-40-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 3658840c 31-May-2023 Jan Kara <jack@suse.cz>

ext4: Remove ext4 locking of moved directory

Remove locking of moved directory in ext4_rename2(). We will take care
of it in VFS instead. This effectively reverts commit 0813299c586b
("ext4: Fix possible corruption when moving a directory") and followup
fixes.

CC: Ted Tso <tytso@mit.edu>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230601105830.13168-1-jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 4b3cb1d1 06-May-2023 Theodore Ts'o <tytso@mit.edu>

ext4: improve error handling from ext4_dirhash()

The ext4_dirhash() will *almost* never fail, especially when the hash
tree feature was first introduced. However, with the addition of
support of encrypted, casefolded file names, that function can most
certainly fail today.

So make sure the callers of ext4_dirhash() properly check for
failures, and reflect the errors back up to their callers.

Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20230506142419.984260-1-tytso@mit.edu
Reported-by: syzbot+394aa8a792cb99dbc837@syzkaller.appspotmail.com
Reported-by: syzbot+344aaa8697ebd232bfc8@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=db56459ea4ac4a676ae4b4678f633e55da005a9b
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 70e42fea 17-Mar-2023 Theodore Ts'o <tytso@mit.edu>

ext4: fix possible double unlock when moving a directory

Fixes: 0813299c586b ("ext4: Fix possible corruption when moving a directory")
Link: https://lore.kernel.org/r/5efbe1b9-ad8b-4a4f-b422-24824d2b775c@kili.mountain
Reported-by: Dan Carpenter <error27@gmail.com>
Reported-by: syzbot+0c73d1d8b952c5f3d714@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 3c92792d 01-Mar-2023 Jan Kara <jack@suse.cz>

ext4: Fix deadlock during directory rename

As lockdep properly warns, we should not be locking i_rwsem while having
transactions started as the proper lock ordering used by all directory
handling operations is i_rwsem -> transaction start. Fix the lock
ordering by moving the locking of the directory earlier in
ext4_rename().

Reported-by: syzbot+9d16c39efb5fade84574@syzkaller.appspotmail.com
Fixes: 0813299c586b ("ext4: Fix possible corruption when moving a directory")
Link: https://syzkaller.appspot.com/bug?extid=9d16c39efb5fade84574
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230301141004.15087-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# c9f62c8b 09-Feb-2023 Eric Whitney <enwlinux@gmail.com>

ext4: fix RENAME_WHITEOUT handling for inline directories

A significant number of xfstests can cause ext4 to log one or more
warning messages when they are run on a test file system where the
inline_data feature has been enabled. An example:

"EXT4-fs warning (device vdc): ext4_dirblock_csum_set:425: inode
#16385: comm fsstress: No space for directory leaf checksum. Please
run e2fsck -D."

The xfstests include: ext4/057, 058, and 307; generic/013, 051, 068,
070, 076, 078, 083, 232, 269, 270, 390, 461, 475, 476, 482, 579, 585,
589, 626, 631, and 650.

In this situation, the warning message indicates a bug in the code that
performs the RENAME_WHITEOUT operation on a directory entry that has
been stored inline. It doesn't detect that the directory is stored
inline, and incorrectly attempts to compute a dirent block checksum on
the whiteout inode when creating it. This attempt fails as a result
of the integrity checking in get_dirent_tail (usually due to a failure
to match the EXT4_FT_DIR_CSUM magic cookie), and the warning message
is then emitted.

Fix this by simply collecting the inlined data state at the time the
search for the source directory entry is performed. Existing code
handles the rest, and this is sufficient to eliminate all spurious
warning messages produced by the tests above. Go one step further
and do the same in the code that resets the source directory entry in
the event of failure. The inlined state should be present in the
"old" struct, but given the possibility of a race there's no harm
in taking a conservative approach and getting that information again
since the directory entry is being reread anyway.

Fixes: b7ff91fd030d ("ext4: find old entry again if failed to rename whiteout")
Cc: stable@kernel.org
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230210173244.679890-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 0813299c 25-Jan-2023 Jan Kara <jack@suse.cz>

ext4: Fix possible corruption when moving a directory

When we are renaming a directory to a different directory, we need to
update '..' entry in the moved directory. However nothing prevents moved
directory from being modified and even converted from the inline format
to the normal format. When such race happens the rename code gets
confused and we crash. Fix the problem by locking the moved directory.

CC: stable@vger.kernel.org
Fixes: 32f7f22c0b52 ("ext4: let ext4_rename handle inline dir")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230126112221.11866-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# f2d40141 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port inode_init_owner() to mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 011e2b71 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->tmpfile() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# e18275ae 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->rename() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 5ebb29be 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->mknod() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# c54bd91e 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->mkdir() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 7a77db95 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->symlink() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 6c960e68 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->create() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# fae381a3 06-Nov-2022 Ye Bin <yebin10@huawei.com>

ext4: init quota for 'old.inode' in 'ext4_rename'

Syzbot found the following issue:
ext4_parse_param: s_want_extra_isize=128
ext4_inode_info_init: s_want_extra_isize=32
ext4_rename: old.inode=ffff88823869a2c8 old.dir=ffff888238699828 new.inode=ffff88823869d7e8 new.dir=ffff888238699828
__ext4_mark_inode_dirty: inode=ffff888238699828 ea_isize=32 want_ea_size=128
__ext4_mark_inode_dirty: inode=ffff88823869a2c8 ea_isize=32 want_ea_size=128
ext4_xattr_block_set: inode=ffff88823869a2c8
------------[ cut here ]------------
WARNING: CPU: 13 PID: 2234 at fs/ext4/xattr.c:2070 ext4_xattr_block_set.cold+0x22/0x980
Modules linked in:
RIP: 0010:ext4_xattr_block_set.cold+0x22/0x980
RSP: 0018:ffff888227d3f3b0 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffff88823007a000 RCX: 0000000000000000
RDX: 0000000000000a03 RSI: 0000000000000040 RDI: ffff888230078178
RBP: 0000000000000000 R08: 000000000000002c R09: ffffed1075c7df8e
R10: ffff8883ae3efc6b R11: ffffed1075c7df8d R12: 0000000000000000
R13: ffff88823869a2c8 R14: ffff8881012e0460 R15: dffffc0000000000
FS: 00007f350ac1f740(0000) GS:ffff8883ae200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f350a6ed6a0 CR3: 0000000237456000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
? ext4_xattr_set_entry+0x3b7/0x2320
? ext4_xattr_block_set+0x0/0x2020
? ext4_xattr_set_entry+0x0/0x2320
? ext4_xattr_check_entries+0x77/0x310
? ext4_xattr_ibody_set+0x23b/0x340
ext4_xattr_move_to_block+0x594/0x720
ext4_expand_extra_isize_ea+0x59a/0x10f0
__ext4_expand_extra_isize+0x278/0x3f0
__ext4_mark_inode_dirty.cold+0x347/0x410
ext4_rename+0xed3/0x174f
vfs_rename+0x13a7/0x2510
do_renameat2+0x55d/0x920
__x64_sys_rename+0x7d/0xb0
do_syscall_64+0x3b/0xa0
entry_SYSCALL_64_after_hwframe+0x72/0xdc

As 'ext4_rename' will modify 'old.inode' ctime and mark inode dirty,
which may trigger expand 'extra_isize' and allocate block. If inode
didn't init quota will lead to warning. To solve above issue, init
'old.inode' firstly in 'ext4_rename'.

Reported-by: syzbot+98346927678ac3059c77@syzkaller.appspotmail.com
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221107015335.2524319-1-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org


# 4c0d5778 06-Nov-2022 Eric Biggers <ebiggers@google.com>

ext4: don't set up encryption key during jbd2 transaction

Commit a80f7fcf1867 ("ext4: fixup ext4_fc_track_* functions' signature")
extended the scope of the transaction in ext4_unlink() too far, making
it include the call to ext4_find_entry(). However, ext4_find_entry()
can deadlock when called from within a transaction because it may need
to set up the directory's encryption key.

Fix this by restoring the transaction to its original scope.

Reported-by: syzbot+1a748d0007eeac3ab079@syzkaller.appspotmail.com
Fixes: a80f7fcf1867 ("ext4: fixup ext4_fc_track_* functions' signature")
Cc: <stable@vger.kernel.org> # v5.10+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221106224841.279231-3-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# cac2f8b8 22-Sep-2022 Christian Brauner <brauner@kernel.org>

fs: rename current get acl method

The current way of setting and getting posix acls through the generic
xattr interface is error prone and type unsafe. The vfs needs to
interpret and fixup posix acls before storing or reporting it to
userspace. Various hacks exist to make this work. The code is hard to
understand and difficult to maintain in it's current form. Instead of
making this work by hacking posix acls through xattr handlers we are
building a dedicated posix acl api around the get and set inode
operations. This removes a lot of hackiness and makes the codepaths
easier to maintain. A lot of background can be found in [1].

The current inode operation for getting posix acls takes an inode
argument but various filesystems (e.g., 9p, cifs, overlayfs) need access
to the dentry. In contrast to the ->set_acl() inode operation we cannot
simply extend ->get_acl() to take a dentry argument. The ->get_acl()
inode operation is called from:

acl_permission_check()
-> check_acl()
-> get_acl()

which is part of generic_permission() which in turn is part of
inode_permission(). Both generic_permission() and inode_permission() are
called in the ->permission() handler of various filesystems (e.g.,
overlayfs). So simply passing a dentry argument to ->get_acl() would
amount to also having to pass a dentry argument to ->permission(). We
should avoid this unnecessary change.

So instead of extending the existing inode operation rename it from
->get_acl() to ->get_inode_acl() and add a ->get_acl() method later that
passes a dentry argument and which filesystems that need access to the
dentry can implement instead of ->get_inode_acl(). Filesystems like cifs
which allow setting and getting posix acls but not using them for
permission checking during lookup can simply not implement
->get_inode_acl().

This is intended to be a non-functional change.

Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1]
Suggested-by/Inspired-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 17a0bc9b 12-Oct-2022 Luís Henriques <lhenriques@suse.de>

ext4: fix BUG_ON() when directory entry has invalid rec_len

The rec_len field in the directory entry has to be a multiple of 4. A
corrupted filesystem image can be used to hit a BUG() in
ext4_rec_len_to_disk(), called from make_indexed_dir().

------------[ cut here ]------------
kernel BUG at fs/ext4/ext4.h:2413!
...
RIP: 0010:make_indexed_dir+0x53f/0x5f0
...
Call Trace:
<TASK>
? add_dirent_to_buf+0x1b2/0x200
ext4_add_entry+0x36e/0x480
ext4_add_nondir+0x2b/0xc0
ext4_create+0x163/0x200
path_openat+0x635/0xe90
do_filp_open+0xb4/0x160
? __create_object.isra.0+0x1de/0x3b0
? _raw_spin_unlock+0x12/0x30
do_sys_openat2+0x91/0x150
__x64_sys_open+0x6c/0xa0
do_syscall_64+0x3c/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0

The fix simply adds a call to ext4_check_dir_entry() to validate the
directory entry, returning -EFSCORRUPTED if the entry is invalid.

CC: stable@kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216540
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Link: https://lore.kernel.org/r/20221012131330.32456-1-lhenriques@suse.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 863f144f 23-Sep-2022 Miklos Szeredi <mszeredi@redhat.com>

vfs: open inside ->tmpfile()

This is in preparation for adding tmpfile support to fuse, which requires
that the tmpfile creation and opening are done as a single operation.

Replace the 'struct dentry *' argument of i_op->tmpfile with
'struct file *'.

Call finish_open_simple() as the last thing in ->tmpfile() instances (may
be omitted in the error case).

Change d_tmpfile() argument to 'struct file *' as well to make callers more
readable.

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 7177dd00 10-Sep-2022 Zhihao Cheng <chengzhihao1@huawei.com>

ext4: fix dir corruption when ext4_dx_add_entry() fails

Following process may lead to fs corruption:
1. ext4_create(dir/foo)
ext4_add_nondir
ext4_add_entry
ext4_dx_add_entry
a. add_dirent_to_buf
ext4_mark_inode_dirty
ext4_handle_dirty_metadata // dir inode bh is recorded into journal
b. ext4_append // dx_get_count(entries) == dx_get_limit(entries)
ext4_bread(EXT4_GET_BLOCKS_CREATE)
ext4_getblk
ext4_map_blocks
ext4_ext_map_blocks
ext4_mb_new_blocks
dquot_alloc_block
dquot_alloc_space_nodirty
inode_add_bytes // update dir's i_blocks
ext4_ext_insert_extent
ext4_ext_dirty // record extent bh into journal
ext4_handle_dirty_metadata(bh)
// record new block into journal
inode->i_size += inode->i_sb->s_blocksize // new size(in mem)
c. ext4_handle_dirty_dx_node(bh2)
// record dir's new block(dx_node) into journal
d. ext4_handle_dirty_dx_node((frame - 1)->bh)
e. ext4_handle_dirty_dx_node(frame->bh)
f. do_split // ret err!
g. add_dirent_to_buf
ext4_mark_inode_dirty(dir) // update raw_inode on disk(skipped)
2. fsck -a /dev/sdb
drop last block(dx_node) which beyonds dir's i_size.
/dev/sdb: recovering journal
/dev/sdb contains a file system with errors, check forced.
/dev/sdb: Inode 12, end of extent exceeds allowed value
(logical block 128, physical block 3938, len 1)
3. fsck -fn /dev/sdb
dx_node->entry[i].blk > dir->i_size
Pass 2: Checking directory structure
Problem in HTREE directory inode 12 (/dir): bad block number 128.
Clear HTree index? no
Problem in HTREE directory inode 12: block #3 has invalid depth (2)
Problem in HTREE directory inode 12: block #3 has bad max hash
Problem in HTREE directory inode 12: block #3 not referenced

Fix it by marking inode dirty directly inside ext4_append().
Fetch a reproducer in [Link].

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216466
Cc: stable@vger.kernel.org
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220911045204.516460-1-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 61a1d87a 22-Aug-2022 Jan Kara <jack@suse.cz>

ext4: fix check for block being out of directory size

The check in __ext4_read_dirblock() for block being outside of directory
size was wrong because it compared block number against directory size
in bytes. Fix it.

Fixes: 65f8ea4cd57d ("ext4: check if directory block is within i_size")
CVE: CVE-2022-1184
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20220822114832.1482-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b8a04fe7 04-Jul-2022 Lukas Czerner <lczerner@redhat.com>

ext4: make sure ext4_append() always allocates new block

ext4_append() must always allocate a new block, otherwise we run the
risk of overwriting existing directory block corrupting the directory
tree in the process resulting in all manner of problems later on.

Add a sanity check to see if the logical block is already allocated and
error out if it is.

Cc: stable@kernel.org
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20220704142721.157985-2-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 65f8ea4c 04-Jul-2022 Lukas Czerner <lczerner@redhat.com>

ext4: check if directory block is within i_size

Currently ext4 directory handling code implicitly assumes that the
directory blocks are always within the i_size. In fact ext4_append()
will attempt to allocate next directory block based solely on i_size and
the i_size is then appropriately increased after a successful
allocation.

However, for this to work it requires i_size to be correct. If, for any
reason, the directory inode i_size is corrupted in a way that the
directory tree refers to a valid directory block past i_size, we could
end up corrupting parts of the directory tree structure by overwriting
already used directory blocks when modifying the directory.

Fix it by catching the corruption early in __ext4_read_dirblock().

Addresses Red-Hat-Bugzilla: #2070205
CVE: CVE-2022-1184
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20220704142721.157985-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b24e77ef 22-Jun-2022 Ye Bin <yebin10@huawei.com>

ext4: avoid remove directory when directory is corrupted

Now if check directoy entry is corrupted, ext4_empty_dir may return true
then directory will be removed when file system mounted with "errors=continue".
In order not to make things worse just return false when directory is corrupted.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220622090223.682234-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# bc75a6eb 30-May-2022 Ding Xiang <dingxiang@cmss.chinamobile.com>

ext4: make variable "count" signed

Since dx_make_map() may return -EFSCORRUPTED now, so change "count" to
be a signed integer so we can correctly check for an error code returned
by dx_make_map().

Fixes: 46c116b920eb ("ext4: verify dir block before splitting it")
Cc: stable@kernel.org
Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com>
Link: https://lore.kernel.org/r/20220530100047.537598-1-dingxiang@cmss.chinamobile.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 3ba733f8 18-May-2022 Jan Kara <jack@suse.cz>

ext4: avoid cycles in directory h-tree

A maliciously corrupted filesystem can contain cycles in the h-tree
stored inside a directory. That can easily lead to the kernel corrupting
tree nodes that were already verified under its hands while doing a node
split and consequently accessing unallocated memory. Fix the problem by
verifying traversed block numbers are unique.

Cc: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220518093332.13986-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 46c116b9 18-May-2022 Jan Kara <jack@suse.cz>

ext4: verify dir block before splitting it

Before splitting a directory block verify its directory entries are sane
so that the splitting code does not access memory it should not.

Cc: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220518093332.13986-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 6493792d 24-Apr-2022 Zhang Yi <yi.zhang@huawei.com>

ext4: convert symlink external data block mapping to bdev

Symlink's external data block is one kind of metadata block, and now
that almost all ext4 metadata block's page cache (e.g. directory blocks,
quota blocks...) belongs to bdev backing inode except the symlink. It
is essentially worked in data=journal mode like other regular file's
data block because probably in order to make it simple for generic VFS
code handling symlinks or some other historical reasons, but the logic
of creating external data block in ext4_symlink() is complicated. and it
also make things confused if user do not want to let the filesystem
worked in data=journal mode. This patch convert the final exceptional
case and make things clean, move the mapping of the symlink's external
data block to bdev like any other metadata block does.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20220424140936.1898920-3-yi.zhang@huawei.com


# 0be698ec 13-Apr-2022 Ye Bin <yebin10@huawei.com>

ext4: fix use-after-free in ext4_rename_dir_prepare

We got issue as follows:
EXT4-fs (loop0): mounted filesystem without journal. Opts: ,errors=continue
ext4_get_first_dir_block: bh->b_data=0xffff88810bee6000 len=34478
ext4_get_first_dir_block: *parent_de=0xffff88810beee6ae bh->b_data=0xffff88810bee6000
ext4_rename_dir_prepare: [1] parent_de=0xffff88810beee6ae
==================================================================
BUG: KASAN: use-after-free in ext4_rename_dir_prepare+0x152/0x220
Read of size 4 at addr ffff88810beee6ae by task rep/1895

CPU: 13 PID: 1895 Comm: rep Not tainted 5.10.0+ #241
Call Trace:
dump_stack+0xbe/0xf9
print_address_description.constprop.0+0x1e/0x220
kasan_report.cold+0x37/0x7f
ext4_rename_dir_prepare+0x152/0x220
ext4_rename+0xf44/0x1ad0
ext4_rename2+0x11c/0x170
vfs_rename+0xa84/0x1440
do_renameat2+0x683/0x8f0
__x64_sys_renameat+0x53/0x60
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f45a6fc41c9
RSP: 002b:00007ffc5a470218 EFLAGS: 00000246 ORIG_RAX: 0000000000000108
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f45a6fc41c9
RDX: 0000000000000005 RSI: 0000000020000180 RDI: 0000000000000005
RBP: 00007ffc5a470240 R08: 00007ffc5a470160 R09: 0000000020000080
R10: 00000000200001c0 R11: 0000000000000246 R12: 0000000000400bb0
R13: 00007ffc5a470320 R14: 0000000000000000 R15: 0000000000000000

The buggy address belongs to the page:
page:00000000440015ce refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x10beee
flags: 0x200000000000000()
raw: 0200000000000000 ffffea00043ff4c8 ffffea0004325608 0000000000000000
raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff88810beee580: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88810beee600: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff88810beee680: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff88810beee700: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88810beee780: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================
Disabling lock debugging due to kernel taint
ext4_rename_dir_prepare: [2] parent_de->inode=3537895424
ext4_rename_dir_prepare: [3] dir=0xffff888124170140
ext4_rename_dir_prepare: [4] ino=2
ext4_rename_dir_prepare: ent->dir->i_ino=2 parent=-757071872

Reason is first directory entry which 'rec_len' is 34478, then will get illegal
parent entry. Now, we do not check directory entry after read directory block
in 'ext4_get_first_dir_block'.
To solve this issue, check directory entry in 'ext4_get_first_dir_block'.

[ Trigger an ext4_error() instead of just warning if the directory is
missing a '.' or '..' entry. Also make sure we return an error code
if the file system is corrupted. -TYT ]

Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220414025223.4113128-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org


# 784a0995 10-Apr-2022 Lv Ruyi <lv.ruyi@zte.com.cn>

ext4: remove unnecessary conditionals

iput() has already handled null and non-null parameter, so it is no
need to use if().

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Lv Ruyi <lv.ruyi@zte.com.cn>
Link: https://lore.kernel.org/r/20220411032337.2517465-1-lv.ruyi@zte.com.cn
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# c30365b9 01-Apr-2022 Yu Zhe <yuzhe@nfschina.com>

ext4: remove unnecessary type castings

remove unnecessary void* type castings.

Signed-off-by: Yu Zhe <yuzhe@nfschina.com>
Link: https://lore.kernel.org/r/20220401081321.73735-1-yuzhe@nfschina.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# a125d2ae 22-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

ext4: Use page_symlink() instead of __page_symlink()

By using the memalloc_nofs_save() functionality, we can call
page_symlink(), safe in the knowledge that it won't recurse into the
filesystem.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# c186f088 24-Mar-2022 Ye Bin <yebin10@huawei.com>

ext4: fix use-after-free in ext4_search_dir

We got issue as follows:
EXT4-fs (loop0): mounted filesystem without journal. Opts: ,errors=continue
==================================================================
BUG: KASAN: use-after-free in ext4_search_dir fs/ext4/namei.c:1394 [inline]
BUG: KASAN: use-after-free in search_dirblock fs/ext4/namei.c:1199 [inline]
BUG: KASAN: use-after-free in __ext4_find_entry+0xdca/0x1210 fs/ext4/namei.c:1553
Read of size 1 at addr ffff8881317c3005 by task syz-executor117/2331

CPU: 1 PID: 2331 Comm: syz-executor117 Not tainted 5.10.0+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Call Trace:
__dump_stack lib/dump_stack.c:83 [inline]
dump_stack+0x144/0x187 lib/dump_stack.c:124
print_address_description+0x7d/0x630 mm/kasan/report.c:387
__kasan_report+0x132/0x190 mm/kasan/report.c:547
kasan_report+0x47/0x60 mm/kasan/report.c:564
ext4_search_dir fs/ext4/namei.c:1394 [inline]
search_dirblock fs/ext4/namei.c:1199 [inline]
__ext4_find_entry+0xdca/0x1210 fs/ext4/namei.c:1553
ext4_lookup_entry fs/ext4/namei.c:1622 [inline]
ext4_lookup+0xb8/0x3a0 fs/ext4/namei.c:1690
__lookup_hash+0xc5/0x190 fs/namei.c:1451
do_rmdir+0x19e/0x310 fs/namei.c:3760
do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x445e59
Code: 4d c7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b c7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fff2277fac8 EFLAGS: 00000246 ORIG_RAX: 0000000000000054
RAX: ffffffffffffffda RBX: 0000000000400280 RCX: 0000000000445e59
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000200000c0
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000002
R10: 00007fff2277f990 R11: 0000000000000246 R12: 0000000000000000
R13: 431bde82d7b634db R14: 0000000000000000 R15: 0000000000000000

The buggy address belongs to the page:
page:0000000048cd3304 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1317c3
flags: 0x200000000000000()
raw: 0200000000000000 ffffea0004526588 ffffea0004528088 0000000000000000
raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff8881317c2f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff8881317c2f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff8881317c3000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff8881317c3080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff8881317c3100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================

ext4_search_dir:
...
de = (struct ext4_dir_entry_2 *)search_buf;
dlimit = search_buf + buf_size;
while ((char *) de < dlimit) {
...
if ((char *) de + de->name_len <= dlimit &&
ext4_match(dir, fname, de)) {
...
}
...
de_len = ext4_rec_len_from_disk(de->rec_len, dir->i_sb->s_blocksize);
if (de_len <= 0)
return -1;
offset += de_len;
de = (struct ext4_dir_entry_2 *) ((char *) de + de_len);
}

Assume:
de=0xffff8881317c2fff
dlimit=0x0xffff8881317c3000

If read 'de->name_len' which address is 0xffff8881317c3005, obviously is
out of range, then will trigger use-after-free.
To solve this issue, 'dlimit' must reserve 8 bytes, as we will read
'de->name_len' to judge if '(char *) de + de->name_len' out of range.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220324064816.1209985-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org


# 78be0471 11-Mar-2022 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: return early for non-eligible fast_commit track events

Currently ext4_fc_track_template() checks, whether the trace event
path belongs to replay or does sb has ineligible set, if yes it simply
returns. This patch pulls those checks before calling
ext4_fc_track_template() in the callers of ext4_fc_track_template().

[ Add checks to ext4_rename() which calls the __ext4_fc_track_*()
functions directly. -- TYT ]

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/3cd025d9c490218a92e6d8fb30b6123e693373e3.1647057583.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 7aab5c84 27-Feb-2022 Ye Bin <yebin10@huawei.com>

ext4: fix fs corruption when tring to remove a non-empty directory with IO error

We inject IO error when rmdir non empty direcory, then got issue as follows:
step1: mkfs.ext4 -F /dev/sda
step2: mount /dev/sda test
step3: cd test
step4: mkdir -p 1/2
step5: rmdir 1
[ 110.920551] ext4_empty_dir: inject fault
[ 110.921926] EXT4-fs warning (device sda): ext4_rmdir:3113: inode #12:
comm rmdir: empty directory '1' has too many links (3)
step6: cd ..
step7: umount test
step8: fsck.ext4 -f /dev/sda
e2fsck 1.42.9 (28-Dec-2013)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Entry '..' in .../??? (13) has deleted/unused inode 12. Clear<y>? yes
Pass 3: Checking directory connectivity
Unconnected directory inode 13 (...)
Connect to /lost+found<y>? yes
Pass 4: Checking reference counts
Inode 13 ref count is 3, should be 2. Fix<y>? yes
Pass 5: Checking group summary information

/dev/sda: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda: 12/131072 files (0.0% non-contiguous), 26157/524288 blocks

ext4_rmdir
if (!ext4_empty_dir(inode))
goto end_rmdir;
ext4_empty_dir
bh = ext4_read_dirblock(inode, 0, DIRENT_HTREE);
if (IS_ERR(bh))
return true;
Now if read directory block failed, 'ext4_empty_dir' will return true, assume
directory is empty. Obviously, it will lead to above issue.
To solve this issue, if read directory block failed 'ext4_empty_dir' just
return false. To avoid making things worse when file system is already
corrupted, 'ext4_empty_dir' also return false.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20220228024815.3952506-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# e85c81ba 17-Jan-2022 Xin Yin <yinxin.x@bytedance.com>

ext4: fast commit may not fallback for ineligible commit

For the follow scenario:
1. jbd start commit transaction n
2. task A get new handle for transaction n+1
3. task A do some ineligible actions and mark FC_INELIGIBLE
4. jbd complete transaction n and clean FC_INELIGIBLE
5. task A call fsync

In this case fast commit will not fallback to full commit and
transaction n+1 also not handled by jbd.

Make ext4_fc_mark_ineligible() also record transaction tid for
latest ineligible case, when call ext4_fc_cleanup() check
current transaction tid, if small than latest ineligible tid
do not clear the EXT4_MF_FC_INELIGIBLE.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Suggested-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Signed-off-by: Xin Yin <yinxin.x@bytedance.com>
Link: https://lore.kernel.org/r/20220117093655.35160-2-yinxin.x@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org


# 5298d4bf 17-Jan-2022 Christoph Hellwig <hch@lst.de>

unicode: clean up the Kconfig symbol confusion

Turn the CONFIG_UNICODE symbol into a tristate that generates some always
built in code and remove the confusing CONFIG_UNICODE_UTF8_DATA symbol.

Note that a lot of the IS_ENABLED() checks could be turned from cpp
statements into normal ifs, but this change is intended to be fairly
mechanic, so that should be cleaned up later.

Fixes: 2b3d04787012 ("unicode: Add utf8-data module")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>


# d4ffeeb7 23-Aug-2021 Jing Yangyang <jing.yangyang@zte.com.cn>

ext4: fix boolreturn.cocci warnings in fs/ext4/name.c

Return statements in functions returning bool should use true/false
instead of 1/0.

./fs/ext4/namei.c:1441:12-13:WARNING:return of 0/1 in function
'ext4_match' with return type bool

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Jing Yangyang <jing.yangyang@zte.com.cn>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210824055543.58718-1-deng.changcheng@zte.com.cn
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 25c6d98f 16-Aug-2021 Jan Kara <jack@suse.cz>

ext4: Move orphan inode handling into a separate file

Move functions for handling orphan inodes into a new file
fs/ext4/orphan.c to have them in one place and somewhat reduce size of
other files. No code changes.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210816095713.16537-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 188c299e 16-Aug-2021 Jan Kara <jack@suse.cz>

ext4: Support for checksumming from journal triggers

JBD2 layer support triggers which are called when journaling layer moves
buffer to a certain state. We can use the frozen trigger, which gets
called when buffer data is frozen and about to be written out to the
journal, to compute block checksums for some buffer types (similarly as
does ocfs2). This avoids unnecessary repeated recomputation of the
checksum (at the cost of larger window where memory corruption won't be
caught by checksumming) and is even necessary when there are
unsynchronized updaters of the checksummed data.

So add superblock and journal trigger type arguments to
ext4_journal_get_write_access() and ext4_journal_get_create_access() so
that frozen triggers can be set accordingly. Also add inode argument to
ext4_walk_page_buffers() and all the callbacks used with that function
for the same purpose. This patch is mostly only a change of prototype of
the above mentioned functions and a few small helpers. Real checksumming
will come later.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210816095713.16537-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 877ba3f7 04-Aug-2021 Theodore Ts'o <tytso@mit.edu>

ext4: fix potential htree corruption when growing large_dir directories

Commit b5776e7524af ("ext4: fix potential htree index checksum
corruption) removed a required restart when multiple levels of index
nodes need to be split. Fix this to avoid directory htree corruptions
when using the large_dir feature.

Cc: stable@kernel.org # v5.11
Cc: Благодаренко Артём <artem.blagodarenko@gmail.com>
Fixes: b5776e7524af ("ext4: fix potential htree index checksum corruption)
Reported-by: Denis <denis@voxelsoft.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b2d2e757 20-May-2021 Tian Tao <tiantao6@hisilicon.com>

ext4: remove set but rewrite variables

In the ext4_dx_add_entry function, the at variable is assigned but will
reset just after “again:” label. So delete the unnecessary assignment.
this will not chang the logic.

Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Reviewed-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>
Link: https://lore.kernel.org/r/1621493752-36890-1-git-send-email-tiantao6@hisilicon.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 63e7f128 21-May-2021 Daniel Rosenberg <drosen@google.com>

ext4: fix no-key deletion for encrypt+casefold

commit 471fbbea7ff7 ("ext4: handle casefolding with encryption") is
missing a few checks for the encryption key which are needed to
support deleting enrypted casefolded files when the key is not
present.

This bug made it impossible to delete encrypted+casefolded directories
without the encryption key, due to errors like:

W : EXT4-fs warning (device vdc): __ext4fs_dirhash:270: inode #49202: comm Binder:378_4: Siphash requires key

Repro steps in kvm-xfstests test appliance:
mkfs.ext4 -F -E encoding=utf8 -O encrypt /dev/vdc
mount /vdc
mkdir /vdc/dir
chattr +F /vdc/dir
keyid=$(head -c 64 /dev/zero | xfs_io -c add_enckey /vdc | awk '{print $NF}')
xfs_io -c "set_encpolicy $keyid" /vdc/dir
for i in `seq 1 100`; do
mkdir /vdc/dir/$i
done
xfs_io -c "rm_enckey $keyid" /vdc
rm -rf /vdc/dir # fails with the bug

Fixes: 471fbbea7ff7 ("ext4: handle casefolding with encryption")
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Link: https://lore.kernel.org/r/20210522004132.2142563-1-drosen@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 80e5d1ff5 15-Apr-2021 Al Viro <viro@zeniv.linux.org.uk>

useful constants: struct qstr for ".."

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6c091273 22-Apr-2021 Leah Rumancik <leah.rumancik@gmail.com>

ext4: wipe ext4_dir_entry2 upon file deletion

Upon file deletion, zero out all fields in ext4_dir_entry2 besides rec_len.
In case sensitive data is stored in filenames, this ensures no potentially
sensitive data is left in the directory entry upon deletion. Also, wipe
these fields upon moving a directory entry during the conversion to an
htree and when splitting htree nodes.

The data wiped may still exist in the journal, but there are future
commits planned to address this.

Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Link: https://lore.kernel.org/r/20210422180834.2242353-1-leah.rumancik@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 666245d9 08-Apr-2021 Jack Qiu <jack.qiu@huawei.com>

ext4: fix trailing whitespace

Made suggested modifications from checkpatch in reference to ERROR:
trailing whitespace

Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20210409042035.15516-1-jack.qiu@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 3088e5a5 27-Mar-2021 Bhaskar Chowdhury <unixbhaskar@gmail.com>

ext4: fix various seppling typos

Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Link: https://lore.kernel.org/r/cover.1616840203.git.unixbhaskar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 1ae98e29 19-Mar-2021 Daniel Rosenberg <drosen@google.com>

ext4: optimize match for casefolded encrypted dirs

Matching names with casefolded encrypting directories requires
decrypting entries to confirm case since we are case preserving. We can
avoid needing to decrypt if our hash values don't match.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Link: https://lore.kernel.org/r/20210319073414.1381041-3-drosen@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 471fbbea 19-Mar-2021 Daniel Rosenberg <drosen@google.com>

ext4: handle casefolding with encryption

This adds support for encryption with casefolding.

Since the name on disk is case preserving, and also encrypted, we can no
longer just recompute the hash on the fly. Additionally, to avoid
leaking extra information from the hash of the unencrypted name, we use
siphash via an fscrypt v2 policy.

The hash is stored at the end of the directory entry for all entries
inside of an encrypted and casefolded directory apart from those that
deal with '.' and '..'. This way, the change is backwards compatible
with existing ext4 filesystems.

[ Changed to advertise this feature via the file:
/sys/fs/ext4/features/encrypted_casefold -- TYT ]

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210319073414.1381041-2-drosen@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 4db5c2e6 07-Apr-2021 Miklos Szeredi <mszeredi@redhat.com>

ext4: convert to fileattr

Use the fileattr API to let the VFS handle locking, permission checking and
conversion.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>


# 8210bb29 16-Mar-2021 Harshad Shirwadkar <harshadshirwadkar@gmail.com>

ext4: fix rename whiteout with fast commit

This patch adds rename whiteout support in fast commits. Note that the
whiteout object that gets created is actually char device. Which
imples, the function ext4_inode_journal_mode(struct inode *inode)
would return "JOURNAL_DATA" for this inode. This has a consequence in
fast commit code that it will make creation of the whiteout object a
fast-commit ineligible behavior and thus will fall back to full
commits. With this patch, this can be observed by running fast commits
with rename whiteout and seeing the stats generated by ext4_fc_stats
tracepoint as follows:

ext4_fc_stats: dev 254:32 fc ineligible reasons:
XATTR:0, CROSS_RENAME:0, JOURNAL_FLAG_CHANGE:0, NO_MEM:0, SWAP_BOOT:0,
RESIZE:0, RENAME_DIR:0, FALLOC_RANGE:0, INODE_JOURNAL_DATA:16;
num_commits:6, ineligible: 6, numblks: 3

So in short, this patch guarantees that in case of rename whiteout, we
fall back to full commits.

Amir mentioned that instead of creating a new whiteout object for
every rename, we can create a static whiteout object with irrelevant
nlink. That will make fast commits to not fall back to full
commit. But until this happens, this patch will ensure correctness by
falling back to full commits.

Fixes: 8016e29f4362 ("ext4: fast commit recovery path")
Cc: stable@kernel.org
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20210316221921.1124955-1-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 5dccdc5a 03-Mar-2021 zhangyi (F) <yi.zhang@huawei.com>

ext4: do not iput inode under running transaction in ext4_rename()

In ext4_rename(), when RENAME_WHITEOUT failed to add new entry into
directory, it ends up dropping new created whiteout inode under the
running transaction. After commit <9b88f9fb0d2> ("ext4: Do not iput inode
under running transaction"), we follow the assumptions that evict() does
not get called from a transaction context but in ext4_rename() it breaks
this suggestion. Although it's not a real problem, better to obey it, so
this patch add inode to orphan list and stop transaction before final
iput().

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20210303131703.330415-2-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b7ff91fd 03-Mar-2021 zhangyi (F) <yi.zhang@huawei.com>

ext4: find old entry again if failed to rename whiteout

If we failed to add new entry on rename whiteout, we cannot reset the
old->de entry directly, because the old->de could have moved from under
us during make indexed dir. So find the old entry again before reset is
needed, otherwise it may corrupt the filesystem as below.

/dev/sda: Entry '00000001' in ??? (12) has deleted/unused inode 15. CLEARED.
/dev/sda: Unattached inode 75
/dev/sda: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.

Fixes: 6b4b8e6b4ad ("ext4: fix bug for rename with RENAME_WHITEOUT")
Cc: stable@vger.kernel.org
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20210303131703.330415-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b5776e75 03-Feb-2021 Theodore Ts'o <tytso@mit.edu>

ext4: fix potential htree index checksum corruption

In the case where we need to do an interior node split, and
immediately afterwards, we are unable to allocate a new directory leaf
block due to ENOSPC, the directory index checksum's will not be filled
in correctly (and indeed, will not be correctly journalled).

This looks like a bug that was introduced when we added largedir
support. The original code doesn't make any sense (and should have
been caught in code review), but it was hidden because most of the
time, the index node checksum will be set by do_split(). But if
do_split bails out due to ENOSPC, then ext4_handle_dirty_dx_node()
won't get called, and so the directory index checksum field will not
get set, leading to:

EXT4-fs error (device sdb): dx_probe:858: inode #6635543: block 4022: comm nfsd: Directory index failed checksum

Google-Bug-Id: 176345532
Fixes: e08ac99fa2a2 ("ext4: add largedir feature")
Cc: Artem Blagodarenko <artem.blagodarenko@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# c6c818e5 02-Feb-2021 Vinicius Tinti <viniciustinti@gmail.com>

ext4: factor out htree rep invariant check

This patch moves some debugging code which is used to validate the
hash tree node when doing a binary search of an htree node into a
separate function, which is disabled by default (since it is only used
by developers when they are modifying the htree code paths).

In addition to cleaning up the code to make it more maintainable, it
silences a Clang compiler warning when -Wunreachable-code-aggressive
is enabled. (There is no plan to enable this warning by default, since
there it has far too many false positives; nevertheless, this commit
reduces one of the many false positives by one.)

Signed-off-by: Vinicius Tinti <viniciustinti@gmail.com>
Link: https://lore.kernel.org/r/20210202162837.129631-1-viniciustinti@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 14f3db55 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

ext4: support idmapped mounts

Enable idmapped mounts for ext4. All dedicated helpers we need for this
exist. So this basically just means we're passing down the
user_namespace argument from the VFS methods to the relevant helpers.

Let's create simple example where we idmap an ext4 filesystem:

root@f2-vm:~# truncate -s 5G ext4.img

root@f2-vm:~# mkfs.ext4 ./ext4.img
mke2fs 1.45.5 (07-Jan-2020)
Discarding device blocks: done
Creating filesystem with 1310720 4k blocks and 327680 inodes
Filesystem UUID: 3fd91794-c6ca-4b0f-9964-289a000919cf
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736

Allocating group tables: done
Writing inode tables: done
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done

root@f2-vm:~# losetup -f --show ./ext4.img
/dev/loop0

root@f2-vm:~# mount /dev/loop0 /mnt

root@f2-vm:~# ls -al /mnt/
total 24
drwxr-xr-x 3 root root 4096 Oct 28 13:34 .
drwxr-xr-x 30 root root 4096 Oct 28 13:22 ..
drwx------ 2 root root 16384 Oct 28 13:34 lost+found

# Let's create an idmapped mount at /idmapped1 where we map uid and gid
# 0 to uid and gid 1000
root@f2-vm:/# ./mount-idmapped --map-mount b:0:1000:1 /mnt/ /idmapped1/

root@f2-vm:/# ls -al /idmapped1/
total 24
drwxr-xr-x 3 ubuntu ubuntu 4096 Oct 28 13:34 .
drwxr-xr-x 30 root root 4096 Oct 28 13:22 ..
drwx------ 2 ubuntu ubuntu 16384 Oct 28 13:34 lost+found

# Let's create an idmapped mount at /idmapped2 where we map uid and gid
# 0 to uid and gid 2000
root@f2-vm:/# ./mount-idmapped --map-mount b:0:2000:1 /mnt/ /idmapped2/

root@f2-vm:/# ls -al /idmapped2/
total 24
drwxr-xr-x 3 2000 2000 4096 Oct 28 13:34 .
drwxr-xr-x 31 root root 4096 Oct 28 13:39 ..
drwx------ 2 2000 2000 16384 Oct 28 13:34 lost+found

Let's create another example where we idmap the rootfs filesystem
without a mapping for uid 0 and gid 0:

# Create an idmapped mount of for a full POSIX range of rootfs under
# /mnt but without a mapping for uid 0 to reduce attack surface

root@f2-vm:/# ./mount-idmapped --map-mount b:1:1:65536 / /mnt/

# Since we don't have a mapping for uid and gid 0 all files owned by
# uid and gid 0 should show up as uid and gid 65534:
root@f2-vm:/# ls -al /mnt/
total 664
drwxr-xr-x 31 nobody nogroup 4096 Oct 28 13:39 .
drwxr-xr-x 31 root root 4096 Oct 28 13:39 ..
lrwxrwxrwx 1 nobody nogroup 7 Aug 25 07:44 bin -> usr/bin
drwxr-xr-x 4 nobody nogroup 4096 Oct 28 13:17 boot
drwxr-xr-x 2 nobody nogroup 4096 Aug 25 07:48 dev
drwxr-xr-x 81 nobody nogroup 4096 Oct 28 04:00 etc
drwxr-xr-x 4 nobody nogroup 4096 Oct 28 04:00 home
lrwxrwxrwx 1 nobody nogroup 7 Aug 25 07:44 lib -> usr/lib
lrwxrwxrwx 1 nobody nogroup 9 Aug 25 07:44 lib32 -> usr/lib32
lrwxrwxrwx 1 nobody nogroup 9 Aug 25 07:44 lib64 -> usr/lib64
lrwxrwxrwx 1 nobody nogroup 10 Aug 25 07:44 libx32 -> usr/libx32
drwx------ 2 nobody nogroup 16384 Aug 25 07:47 lost+found
drwxr-xr-x 2 nobody nogroup 4096 Aug 25 07:44 media
drwxr-xr-x 31 nobody nogroup 4096 Oct 28 13:39 mnt
drwxr-xr-x 2 nobody nogroup 4096 Aug 25 07:44 opt
drwxr-xr-x 2 nobody nogroup 4096 Apr 15 2020 proc
drwx--x--x 6 nobody nogroup 4096 Oct 28 13:34 root
drwxr-xr-x 2 nobody nogroup 4096 Aug 25 07:46 run
lrwxrwxrwx 1 nobody nogroup 8 Aug 25 07:44 sbin -> usr/sbin
drwxr-xr-x 2 nobody nogroup 4096 Aug 25 07:44 srv
drwxr-xr-x 2 nobody nogroup 4096 Apr 15 2020 sys
drwxrwxrwt 10 nobody nogroup 4096 Oct 28 13:19 tmp
drwxr-xr-x 14 nobody nogroup 4096 Oct 20 13:00 usr
drwxr-xr-x 12 nobody nogroup 4096 Aug 25 07:45 var

# Since we do have a mapping for uid and gid 1000 all files owned by
# uid and gid 1000 should simply show up as uid and gid 1000:
root@f2-vm:/# ls -al /mnt/home/ubuntu/
total 40
drwxr-xr-x 3 ubuntu ubuntu 4096 Oct 28 00:43 .
drwxr-xr-x 4 nobody nogroup 4096 Oct 28 04:00 ..
-rw------- 1 ubuntu ubuntu 2936 Oct 28 12:26 .bash_history
-rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile
-rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo

Link: https://lore.kernel.org/r/20210121131959.646623-39-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-ext4@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# 549c7297 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: make helpers idmap mount aware

Extend some inode methods with an additional user namespace argument. A
filesystem that is aware of idmapped mounts will receive the user
namespace the mount has been marked with. This can be used for
additional permission checking and also to enable filesystems to
translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.

As requested we simply extend the exisiting inode method instead of
introducing new ones. This is a little more code churn but it's mostly
mechanical and doesnt't leave us with additional inode methods.

Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# 6b4b8e6b 04-Jan-2021 yangerkun <yangerkun@huawei.com>

ext4: fix bug for rename with RENAME_WHITEOUT

We got a "deleted inode referenced" warning cross our fsstress test. The
bug can be reproduced easily with following steps:

cd /dev/shm
mkdir test/
fallocate -l 128M img
mkfs.ext4 -b 1024 img
mount img test/
dd if=/dev/zero of=test/foo bs=1M count=128
mkdir test/dir/ && cd test/dir/
for ((i=0;i<1000;i++)); do touch file$i; done # consume all block
cd ~ && renameat2(AT_FDCWD, /dev/shm/test/dir/file1, AT_FDCWD,
/dev/shm/test/dir/dst_file, RENAME_WHITEOUT) # ext4_add_entry in
ext4_rename will return ENOSPC!!
cd /dev/shm/ && umount test/ && mount img test/ && ls -li test/dir/file1
We will get the output:
"ls: cannot access 'test/dir/file1': Structure needs cleaning"
and the dmesg show:
"EXT4-fs error (device loop0): ext4_lookup:1626: inode #2049: comm ls:
deleted inode referenced: 139"

ext4_rename will create a special inode for whiteout and use this 'ino'
to replace the source file's dir entry 'ino'. Once error happens
latter(the error above was the ENOSPC return from ext4_add_entry in
ext4_rename since all space has been consumed), the cleanup do drop the
nlink for whiteout, but forget to restore 'ino' with source file. This
will trigger the bug describle as above.

Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Fixes: cd808deced43 ("ext4: support RENAME_WHITEOUT")
Link: https://lore.kernel.org/r/20210105062857.3566-1-yangerkun@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# a3f5cf14 16-Dec-2020 Jan Kara <jack@suse.cz>

ext4: drop ext4_handle_dirty_super()

The wrapper is now useless since it does what
ext4_handle_dirty_metadata() does. Just remove it.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201216101844.22917-9-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 05c2c00f 16-Dec-2020 Jan Kara <jack@suse.cz>

ext4: protect superblock modifications with a buffer lock

Protect all superblock modifications (including checksum computation)
with a superblock buffer lock. That way we are sure computed checksum
matches current superblock contents (a mismatch could cause checksum
failures in nojournal mode or if an unjournalled superblock update races
with a journalled one). Also we avoid modifying superblock contents
while it is being written out (which can cause DIF/DIX failures if we
are running in nojournal mode).

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201216101844.22917-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 837c23fb 07-Nov-2020 Chunguang Xu <brookxu@tencent.com>

ext4: use ASSERT() to replace J_ASSERT()

There are currently multiple forms of assertion, such as J_ASSERT().
J_ASEERT() is provided for the jbd module, which is a public module.
Maybe we should use custom ASSERT() like other file systems, such as
xfs, which would be better.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/1604764698-4269-1-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# bb9cd910 18-Nov-2020 Daniel Rosenberg <drosen@google.com>

fscrypt: Have filesystems handle their d_ops

This shifts the responsibility of setting up dentry operations from
fscrypt to the individual filesystems, allowing them to have their own
operations while still setting fscrypt's d_revalidate as appropriate.

Most filesystems can just use generic_set_encrypted_ci_d_ops, unless
they have their own specific dentry operations as well. That operation
will set the minimal d_ops required under the circumstances.

Since the fscrypt d_ops are set later on, we must set all d_ops there,
since we cannot adjust those later on. This should not result in any
change in behavior.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>


# ec0caa97 02-Dec-2020 Eric Biggers <ebiggers@google.com>

fscrypt: introduce fscrypt_prepare_readdir()

The last remaining use of fscrypt_get_encryption_info() from filesystems
is for readdir (->iterate_shared()). Every other call is now in
fs/crypto/ as part of some other higher-level operation.

We need to add a new argument to fscrypt_get_encryption_info() to
indicate whether the encryption policy is allowed to be unrecognized or
not. Doing this is easier if we can work with high-level operations
rather than direct filesystem use of fscrypt_get_encryption_info().

So add a function fscrypt_prepare_readdir() which wraps the call to
fscrypt_get_encryption_info() for the readdir use case.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201203022041.230976-6-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>


# 91d0d892 02-Dec-2020 Eric Biggers <ebiggers@google.com>

ext4: don't call fscrypt_get_encryption_info() from dx_show_leaf()

The call to fscrypt_get_encryption_info() in dx_show_leaf() is too low
in the call tree; fscrypt_get_encryption_info() should have already been
called when starting the directory operation. And indeed, it already
is. Moreover, the encryption key is guaranteed to already be available
because dx_show_leaf() is only called when adding a new directory entry.

And even if the key wasn't available, dx_show_leaf() uses
fscrypt_fname_disk_to_usr() which knows how to create a no-key name.

So for the above reasons, and because it would be desirable to stop
exporting fscrypt_get_encryption_info() directly to filesystems, remove
the call to fscrypt_get_encryption_info() from dx_show_leaf().

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201203022041.230976-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>


# 75d18cd1 18-Nov-2020 Eric Biggers <ebiggers@google.com>

ext4: prevent creating duplicate encrypted filenames

As described in "fscrypt: add fscrypt_is_nokey_name()", it's possible to
create a duplicate filename in an encrypted directory by creating a file
concurrently with adding the directory's encryption key.

Fix this bug on ext4 by rejecting no-key dentries in ext4_add_entry().

Note that the duplicate check in ext4_find_dest_de() sometimes prevented
this bug. However in many cases it didn't, since ext4_find_dest_de()
doesn't examine every dentry.

Fixes: 4461471107b7 ("ext4 crypto: enable filename encryption")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201118075609.120337-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>


# a80f7fcf 05-Nov-2020 Harshad Shirwadkar <harshadshirwadkar@gmail.com>

ext4: fixup ext4_fc_track_* functions' signature

Firstly, pass handle to all ext4_fc_track_* functions and use
transaction id found in handle->h_transaction->h_tid for tracking fast
commit updates. Secondly, don't pass inode to
ext4_fc_track_link/create/unlink functions. inode can be found inside
these functions as d_inode(dentry). However, rename path is an
exeception. That's because in that case, we need inode that's not same
as d_inode(dentry). To handle that, add a couple of low-level wrapper
functions that take inode and dentry as arguments.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201106035911.1942128-5-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# f8f4acb6 27-Oct-2020 Daniel Rosenberg <drosen@google.com>

ext4: use generic casefolding support

This switches ext4 over to the generic support provided in libfs.

Since casefolded dentries behave the same in ext4 and f2fs, we decrease
the maintenance burden by unifying them, and any optimizations will
immediately apply to both.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20201028050820.1636571-1-drosen@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 8016e29f 15-Oct-2020 Harshad Shirwadkar <harshadshirwadkar@gmail.com>

ext4: fast commit recovery path

This patch adds fast commit recovery path support for Ext4 file
system. We add several helper functions that are similar in spirit to
e2fsprogs journal recovery path handlers. Example of such functions
include - a simple block allocator, idempotent block bitmap update
function etc. Using these routines and the fast commit log in the fast
commit area, the recovery path (ext4_fc_replay()) performs fast commit
log recovery.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201015203802.3597742-8-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# aa75f4d3 15-Oct-2020 Harshad Shirwadkar <harshadshirwadkar@gmail.com>

ext4: main fast-commit commit path

This patch adds main fast commit commit path handlers. The overall
patch can be divided into two inter-related parts:

(A) Metadata updates tracking

This part consists of helper functions to track changes that need
to be committed during a commit operation. These updates are
maintained by Ext4 in different in-memory queues. Following are
the APIs and their short description that are implemented in this
patch:

- ext4_fc_track_link/unlink/creat() - Track unlink. link and creat
operations
- ext4_fc_track_range() - Track changed logical block offsets
inodes
- ext4_fc_track_inode() - Track inodes
- ext4_fc_mark_ineligible() - Mark file system fast commit
ineligible()
- ext4_fc_start_update() / ext4_fc_stop_update() /
ext4_fc_start_ineligible() / ext4_fc_stop_ineligible() These
functions are useful for co-ordinating inode updates with
commits.

(B) Main commit Path

This part consists of functions to convert updates tracked in
in-memory data structures into on-disk commits. Function
ext4_fc_commit() is the main entry point to commit path.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201015203802.3597742-6-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 15ed2851 26-Aug-2020 Nikolay Borisov <nborisov@suse.com>

ext4: remove unused argument from ext4_(inc|dec)_count

The 'handle' argument is not used for anything so simply remove it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200826133116.11592-1-nborisov@suse.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 8b10fe68 10-Aug-2020 Jeff Layton <jlayton@kernel.org>

fscrypt: drop unused inode argument from fscrypt_fname_alloc_buffer

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200810142139.487631-1-jlayton@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>


# 2fe34d29 10-Aug-2020 Kyoungho Koo <rnrudgh@gmail.com>

ext4: remove unused parameter of ext4_generic_delete_entry function

The ext4_generic_delete_entry function does not use the parameter
handle, so it can be removed.

Signed-off-by: Kyoungho Koo <rnrudgh@gmail.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200810080701.GA14160@koo-Z370-HD3
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 7303cb5b 31-Jul-2020 Jan Kara <jack@suse.cz>

ext4: fix checking of directory entry validity for inline directories

ext4_search_dir() and ext4_generic_delete_entry() can be called both for
standard director blocks and for inline directories stored inside inode
or inline xattr space. For the second case we didn't call
ext4_check_dir_entry() with proper constraints that could result in
accepting corrupted directory entry as well as false positive filesystem
errors like:

EXT4-fs error (device dm-0): ext4_search_dir:1395: inode #28320400:
block 113246792: comm dockerd: bad entry in directory: directory entry too
close to block end - offset=0, inode=28320403, rec_len=32, name_len=8,
size=4096

Fix the arguments passed to ext4_check_dir_entry().

Fixes: 109ba779d6cc ("ext4: check for directory entries too close to block end")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200731162135.8080-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# e5f78159 29-Jun-2020 Yi Zhuang <zhuangyi1@huawei.com>

ext4: lost matching-pair of trace in ext4_unlink

If dquot_initialize() return non-zero and trace of ext4_unlink_enter/exit
enabled then the matching-pair of trace_exit will lost in log.

Signed-off-by: Yi Zhuang <zhuangyi1@huawei.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200629122621.129953-1-zhuangyi1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 5872331b 17-Jun-2020 Eric Sandeen <sandeen@redhat.com>

ext4: fix potential negative array index in do_split()

If for any reason a directory passed to do_split() does not have enough
active entries to exceed half the size of the block, we can end up
iterating over all "count" entries without finding a split point.

In this case, count == move, and split will be zero, and we will
attempt a negative index into map[].

Guard against this by detecting this case, and falling back to
split-to-half-of-count instead; in this case we will still have
plenty of space (> half blocksize) in each split block.

Fixes: ef2b02d3e617 ("ext34: ensure do_split leaves enough free space in both blocks")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/f53e246b-647c-64bb-16ec-135383c70ad7@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 4209ae12 26-Apr-2020 Harshad Shirwadkar <harshadshirwadkar@gmail.com>

ext4: handle ext4_mark_inode_dirty errors

ext4_mark_inode_dirty() can fail for real reasons. Ignoring its return
value may lead ext4 to ignore real failures that would result in
corruption / crashes. Harden ext4_mark_inode_dirty error paths to fail
as soon as possible and return errors to the caller whenever
appropriate.

One of the possible scnearios when this bug could affected is that
while creating a new inode, its directory entry gets added
successfully but while writing the inode itself mark_inode_dirty
returns error which is ignored. This would result in inconsistency
that the directory entry points to a non-existent inode.

Ran gce-xfstests smoke tests and verified that there were no
regressions.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20200427013438.219117-1-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 54d3adbc 28-Mar-2020 Theodore Ts'o <tytso@mit.edu>

ext4: save all error info in save_error_info() and drop ext4_set_errno()

Using a separate function, ext4_set_errno() to set the errno is
problematic because it doesn't do the right thing once
s_last_error_errorcode is non-zero. It's also less racy to set all of
the error information all at once. (Also, as a bonus, it shrinks code
size slightly.)

Link: https://lore.kernel.org/r/20200329020404.686965-1-tytso@mit.edu
Fixes: 878520ac45f9 ("ext4: save the error code which triggered...")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 6cfb061f 13-Feb-2020 Gustavo A. R. Silva <gustavo@embeddedor.com>

ext4: use flexible-array members in struct dx_node and struct dx_root

The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
int stuff;
struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Link: https://lore.kernel.org/r/20200213160648.GA7054@embeddedor
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 9424ef56 15-Feb-2020 Shijie Luo <luoshijie1@huawei.com>

ext4: add cond_resched() to __ext4_find_entry()

We tested a soft lockup problem in linux 4.19 which could also
be found in linux 5.x.

When dir inode takes up a large number of blocks, and if the
directory is growing when we are searching, it's possible the
restart branch could be called many times, and the do while loop
could hold cpu a long time.

Here is the call trace in linux 4.19.

[ 473.756186] Call trace:
[ 473.756196] dump_backtrace+0x0/0x198
[ 473.756199] show_stack+0x24/0x30
[ 473.756205] dump_stack+0xa4/0xcc
[ 473.756210] watchdog_timer_fn+0x300/0x3e8
[ 473.756215] __hrtimer_run_queues+0x114/0x358
[ 473.756217] hrtimer_interrupt+0x104/0x2d8
[ 473.756222] arch_timer_handler_virt+0x38/0x58
[ 473.756226] handle_percpu_devid_irq+0x90/0x248
[ 473.756231] generic_handle_irq+0x34/0x50
[ 473.756234] __handle_domain_irq+0x68/0xc0
[ 473.756236] gic_handle_irq+0x6c/0x150
[ 473.756238] el1_irq+0xb8/0x140
[ 473.756286] ext4_es_lookup_extent+0xdc/0x258 [ext4]
[ 473.756310] ext4_map_blocks+0x64/0x5c0 [ext4]
[ 473.756333] ext4_getblk+0x6c/0x1d0 [ext4]
[ 473.756356] ext4_bread_batch+0x7c/0x1f8 [ext4]
[ 473.756379] ext4_find_entry+0x124/0x3f8 [ext4]
[ 473.756402] ext4_lookup+0x8c/0x258 [ext4]
[ 473.756407] __lookup_hash+0x8c/0xe8
[ 473.756411] filename_create+0xa0/0x170
[ 473.756413] do_mkdirat+0x6c/0x140
[ 473.756415] __arm64_sys_mkdirat+0x28/0x38
[ 473.756419] el0_svc_common+0x78/0x130
[ 473.756421] el0_svc_handler+0x38/0x78
[ 473.756423] el0_svc+0x8/0xc
[ 485.755156] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [tmp:5149]

Add cond_resched() to avoid soft lockup and to provide a better
system responding.

Link: https://lore.kernel.org/r/20200215080206.13293-1-luoshijie1@huawei.com
Signed-off-by: Shijie Luo <luoshijie1@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org


# 48a34311 10-Feb-2020 Jan Kara <jack@suse.cz>

ext4: fix checksum errors with indexed dirs

DIR_INDEX has been introduced as a compat ext4 feature. That means that
even kernels / tools that don't understand the feature may modify the
filesystem. This works because for kernels not understanding indexed dir
format, internal htree nodes appear just as empty directory entries.
Index dir aware kernels then check the htree structure is still
consistent before using the data. This all worked reasonably well until
metadata checksums were introduced. The problem is that these
effectively made DIR_INDEX only ro-compatible because internal htree
nodes store checksums in a different place than normal directory blocks.
Thus any modification ignorant to DIR_INDEX (or just clearing
EXT4_INDEX_FL from the inode) will effectively cause checksum mismatch
and trigger kernel errors. So we have to be more careful when dealing
with indexed directories on filesystems with checksumming enabled.

1) We just disallow loading any directory inodes with EXT4_INDEX_FL when
DIR_INDEX is not enabled. This is harsh but it should be very rare (it
means someone disabled DIR_INDEX on existing filesystem and didn't run
e2fsck), e2fsck can fix the problem, and we don't want to answer the
difficult question: "Should we rather corrupt the directory more or
should we ignore that DIR_INDEX feature is not set?"

2) When we find out htree structure is corrupted (but the filesystem and
the directory should in support htrees), we continue just ignoring htree
information for reading but we refuse to add new entries to the
directory to avoid corrupting it more.

Link: https://lore.kernel.org/r/20200210144316.22081-1-jack@suse.cz
Fixes: dbe89444042a ("ext4: Calculate and verify checksums for htree nodes")
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org


# 64c314ff 09-Dec-2019 Eric Biggers <ebiggers@google.com>

ext4: remove unnecessary ifdefs in htree_dirblock_to_tree()

The ifdefs for CONFIG_FS_ENCRYPTION in htree_dirblock_to_tree() are
unnecessary, as the called functions are already stubbed out when
!CONFIG_FS_ENCRYPTION. Remove them.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191209213225.18477-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# 46f870d6 21-Nov-2019 Theodore Ts'o <tytso@mit.edu>

ext4: simulate various I/O and checksum errors when reading metadata

This allows us to test various error handling code paths

Link: https://lore.kernel.org/r/20191209012317.59398-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 878520ac 19-Nov-2019 Theodore Ts'o <tytso@mit.edu>

ext4: save the error code which triggered an ext4_error() in the superblock

This allows the cause of an ext4_error() report to be categorized
based on whether it was triggered due to an I/O error, or an memory
allocation error, or other possible causes. Most errors are caused by
a detected file system inconsistency, so the default code stored in
the superblock will be EXT4_ERR_EFSCORRUPTED.

Link: https://lore.kernel.org/r/20191204032335.7683-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 68d7b2d8 17-Dec-2019 Yunfeng Ye <yeyunfeng@huawei.com>

ext4: fix unused-but-set-variable warning in ext4_add_entry()

Warning is found when compile with "-Wunused-but-set-variable":

fs/ext4/namei.c: In function ‘ext4_add_entry’:
fs/ext4/namei.c:2167:23: warning: variable ‘sbi’ set but not used
[-Wunused-but-set-variable]
struct ext4_sb_info *sbi;
^~~
Fix this by moving the variable @sbi under CONFIG_UNICODE.

Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/cb5eb904-224a-9701-c38f-cb23514b1fff@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 64d4ce89 02-Dec-2019 Jan Kara <jack@suse.cz>

ext4: fix ext4_empty_dir() for directories with holes

Function ext4_empty_dir() doesn't correctly handle directories with
holes and crashes on bh->b_data dereference when bh is NULL. Reorganize
the loop to use 'offset' variable all the times instead of comparing
pointers to current direntry with bh->b_data pointer. Also add more
strict checking of '.' and '..' directory entries to avoid entering loop
in possibly invalid state on corrupted filesystems.

References: CVE-2019-19037
CC: stable@vger.kernel.org
Fixes: 4e19d6b65fb4 ("ext4: allow directory holes")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191202170213.4761-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# c7df4a1e 11-Nov-2019 Theodore Ts'o <tytso@mit.edu>

ext4: work around deleting a file with i_nlink == 0 safely

If the file system is corrupted such that a file's i_links_count is
too small, then it's possible that when unlinking that file, i_nlink
will already be zero. Previously we were working around this kind of
corruption by forcing i_nlink to one; but we were doing this before
trying to delete the directory entry --- and if the file system is
corrupted enough that ext4_delete_entry() fails, then we exit with
i_nlink elevated, and this causes the orphan inode list handling to be
FUBAR'ed, such that when we unmount the file system, the orphan inode
list can get corrupted.

A better way to fix this is to simply skip trying to call drop_nlink()
if i_nlink is already zero, thus moving the check to the place where
it makes the most sense.

https://bugzilla.kernel.org/show_bug.cgi?id=205433

Link: https://lore.kernel.org/r/20191112032903.8828-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Reviewed-by: Andreas Dilger <adilger@dilger.ca>


# 9b88f9fb 05-Nov-2019 Jan Kara <jack@suse.cz>

ext4: Do not iput inode under running transaction

When ext4_mkdir(), ext4_symlink(), ext4_create(), or ext4_mknod() fail
to add entry into directory, it ends up dropping freshly created inode
under the running transaction and thus inode truncation happens under
that transaction. That breaks assumptions that evict() does not get
called from a transaction context and at least in ext4_symlink() case it
can result in inode eviction deadlocking in inode_wait_for_writeback()
when flush worker finds symlink inode, starts to write it back and
blocks on starting a transaction. So change the code in ext4_mkdir() and
ext4_add_nondir() to drop inode reference only after the transaction is
stopped. We also have to add inode to the orphan list in that case as
otherwise the inode would get leaked in case we crash before inode
deletion is committed.

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-5-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# a9e26328 05-Nov-2019 Jan Kara <jack@suse.cz>

ext4: Move marking of handle as sync to ext4_add_nondir()

Every caller of ext4_add_nondir() marks handle as sync if directory has
DIRSYNC set. Move this marking to ext4_add_nondir() so reduce some
duplication.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 6456ca65 02-Sep-2019 Theodore Ts'o <tytso@mit.edu>

ext4: fix kernel oops caused by spurious casefold flag

If an directory has the a casefold flag set without the casefold
feature set, s_encoding will not be initialized, and this will cause
the kernel to dereference a NULL pointer. In addition to adding
checks to avoid these kernel oops, attempts to load inodes with the
casefold flag when the casefold feature is not enable will cause the
file system to be declared corrupted.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 96fcaf86 02-Jul-2019 Gabriel Krisman Bertazi <krisman@collabora.com>

ext4: fix coverity warning on error path of filename setup

Fix the following coverity warning reported by Dan Carpenter:

fs/ext4/namei.c:1311 ext4_fname_setup_ci_filename()
warn: 'cf_name->len' unsigned <= 0

Fixes: 3ae72562ad91 ("ext4: optimize case-insensitive lookups")
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>


# 7633b08b 21-Jun-2019 Theodore Ts'o <tytso@mit.edu>

ext4: rename htree_inline_dir_to_tree() to ext4_inlinedir_to_tree()

Clean up namespace pollution by the inline_data code.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# ddce3b94 21-Jun-2019 Theodore Ts'o <tytso@mit.edu>

ext4: refactor initialize_dirent_tail()

Move the calculation of the location of the dirent tail into
initialize_dirent_tail(). Also prefix the function with ext4_ to fix
kernel namepsace polution.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# f036adb3 21-Jun-2019 Theodore Ts'o <tytso@mit.edu>

ext4: rename "dirent_csum" functions to use "dirblock"

Functions such as ext4_dirent_csum_verify() and ext4_dirent_csum_set()
don't actually operate on a directory entry, but a directory block.
And while they take a struct ext4_dir_entry *dirent as an argument, it
had better be the first directory at the beginning of the direct
block, or things will go very wrong.

Rename the following functions so that things make more sense, and
remove a lot of confusing casts along the way:

ext4_dirent_csum_verify -> ext4_dirblock_csum_verify
ext4_dirent_csum_set -> ext4_dirblock_csum_set
ext4_dirent_csum -> ext4_dirblock_csum
ext4_handle_dirty_dirent_node -> ext4_handle_dirty_dirblock

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 4e19d6b6 20-Jun-2019 Theodore Ts'o <tytso@mit.edu>

ext4: allow directory holes

The largedir feature was intended to allow ext4 directories to have
unmapped directory blocks (e.g., directory holes). And so the
released e2fsprogs no longer enforces this for largedir file systems;
however, the corresponding change to the kernel-side code was not made.

This commit fixes this oversight.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org


# 3ae72562 19-Jun-2019 Gabriel Krisman Bertazi <krisman@collabora.com>

ext4: optimize case-insensitive lookups

Temporarily cache a casefolded version of the file name under lookup in
ext4_filename, to avoid repeatedly casefolding it. I got up to 30%
speedup on lookups of large directories (>100k entries), depending on
the length of the string under lookup.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 08fc98a4 10-May-2019 Sahitya Tummala <stummala@codeaurora.org>

ext4: fix use-after-free in dx_release()

The buffer_head (frames[0].bh) and it's corresping page can be
potentially free'd once brelse() is done inside the for loop
but before the for loop exits in dx_release(). It can be free'd
in another context, when the page cache is flushed via
drop_caches_sysctl_handler(). This results into below data abort
when accessing info->indirect_levels in dx_release().

Unable to handle kernel paging request at virtual address ffffffc17ac3e01e
Call trace:
dx_release+0x70/0x90
ext4_htree_fill_tree+0x2d4/0x300
ext4_readdir+0x244/0x6f8
iterate_dir+0xbc/0x160
SyS_getdents64+0x94/0x174

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@kernel.org


# b886ee3e 25-Apr-2019 Gabriel Krisman Bertazi <krisman@collabora.co.uk>

ext4: Support case-insensitive file name lookups

This patch implements the actual support for case-insensitive file name
lookups in ext4, based on the feature bit and the encoding stored in the
superblock.

A filesystem that has the casefold feature set is able to configure
directories with the +F (EXT4_CASEFOLD_FL) attribute, enabling lookups
to succeed in that directory in a case-insensitive fashion, i.e: match
a directory entry even if the name used by userspace is not a byte per
byte match with the disk name, but is an equivalent case-insensitive
version of the Unicode string. This operation is called a
case-insensitive file name lookup.

The feature is configured as an inode attribute applied to directories
and inherited by its children. This attribute can only be enabled on
empty directories for filesystems that support the encoding feature,
thus preventing collision of file names that only differ by case.

* dcache handling:

For a +F directory, Ext4 only stores the first equivalent name dentry
used in the dcache. This is done to prevent unintentional duplication of
dentries in the dcache, while also allowing the VFS code to quickly find
the right entry in the cache despite which equivalent string was used in
a previous lookup, without having to resort to ->lookup().

d_hash() of casefolded directories is implemented as the hash of the
casefolded string, such that we always have a well-known bucket for all
the equivalencies of the same string. d_compare() uses the
utf8_strncasecmp() infrastructure, which handles the comparison of
equivalent, same case, names as well.

For now, negative lookups are not inserted in the dcache, since they
would need to be invalidated anyway, because we can't trust missing file
dentries. This is bad for performance but requires some leveraging of
the vfs layer to fix. We can live without that for now, and so does
everyone else.

* on-disk data:

Despite using a specific version of the name as the internal
representation within the dcache, the name stored and fetched from the
disk is a byte-per-byte match with what the user requested, making this
implementation 'name-preserving'. i.e. no actual information is lost
when writing to storage.

DX is supported by modifying the hashes used in +F directories to make
them case/encoding-aware. The new disk hashes are calculated as the
hash of the full casefolded string, instead of the string directly.
This allows us to efficiently search for file names in the htree without
requiring the user to provide an exact name.

* Dealing with invalid sequences:

By default, when a invalid UTF-8 sequence is identified, ext4 will treat
it as an opaque byte sequence, ignoring the encoding and reverting to
the old behavior for that unique file. This means that case-insensitive
file name lookup will not work only for that file. An optional bit can
be set in the superblock telling the filesystem code and userspace tools
to enforce the encoding. When that optional bit is set, any attempt to
create a file name using an invalid UTF-8 sequence will fail and return
an error to userspace.

* Normalization algorithm:

The UTF-8 algorithms used to compare strings in ext4 is implemented
lives in fs/unicode, and is based on a previous version developed by
SGI. It implements the Canonical decomposition (NFD) algorithm
described by the Unicode specification 12.1, or higher, combined with
the elimination of ignorable code points (NFDi) and full
case-folding (CF) as documented in fs/unicode/utf8_norm.c.

NFD seems to be the best normalization method for EXT4 because:

- It has a lower cost than NFC/NFKC (which requires
decomposing to NFD as an intermediary step)
- It doesn't eliminate important semantic meaning like
compatibility decompositions.

Although:

- This implementation is not completely linguistic accurate, because
different languages have conflicting rules, which would require the
specialization of the filesystem to a given locale, which brings all
sorts of problems for removable media and for users who use more than
one language.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b01531db 20-Mar-2019 Eric Biggers <ebiggers@google.com>

fscrypt: fix race where ->lookup() marks plaintext dentry as ciphertext

->lookup() in an encrypted directory begins as follows:

1. fscrypt_prepare_lookup():
a. Try to load the directory's encryption key.
b. If the key is unavailable, mark the dentry as a ciphertext name
via d_flags.
2. fscrypt_setup_filename():
a. Try to load the directory's encryption key.
b. If the key is available, encrypt the name (treated as a plaintext
name) to get the on-disk name. Otherwise decode the name
(treated as a ciphertext name) to get the on-disk name.

But if the key is concurrently added, it may be found at (2a) but not at
(1a). In this case, the dentry will be wrongly marked as a ciphertext
name even though it was actually treated as plaintext.

This will cause the dentry to be wrongly invalidated on the next lookup,
potentially causing problems. For example, if the racy ->lookup() was
part of sys_mount(), then the new mount will be detached when anything
tries to access it. This is despite the mountpoint having a plaintext
path, which should remain valid now that the key was added.

Of course, this is only possible if there's a userspace race. Still,
the additional kernel-side race is confusing and unexpected.

Close the kernel-side race by changing fscrypt_prepare_lookup() to also
set the on-disk filename (step 2b), consistent with the d_flags update.

Fixes: 28b4c263961c ("ext4 crypto: revalidate dentry after adding or removing the key")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 643fa961 12-Dec-2018 Chandan Rajendra <chandan@linux.vnet.ibm.com>

fscrypt: remove filesystem specific build config option

In order to have a common code base for fscrypt "post read" processing
for all filesystems which support encryption, this commit removes
filesystem specific build config option (e.g. CONFIG_EXT4_FS_ENCRYPTION)
and replaces it with a build option (i.e. CONFIG_FS_ENCRYPTION) whose
value affects all the filesystems making use of fscrypt.

Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>


# 592ddec7 12-Dec-2018 Chandan Rajendra <chandan@linux.vnet.ibm.com>

ext4: use IS_ENCRYPTED() to check encryption status

This commit removes the ext4 specific ext4_encrypted_inode() and makes
use of the generic IS_ENCRYPTED() macro to check for the encryption
status of an inode.

Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>


# 8a363970 18-Dec-2018 Theodore Ts'o <tytso@mit.edu>

ext4: avoid declaring fs inconsistent due to invalid file handles

If we receive a file handle, either from NFS or open_by_handle_at(2),
and it points at an inode which has not been initialized, and the file
system has metadata checksums enabled, we shouldn't try to get the
inode, discover the checksum is invalid, and then declare the file
system as being inconsistent.

This can be reproduced by creating a test file system via "mke2fs -t
ext4 -O metadata_csum /tmp/foo.img 8M", mounting it, cd'ing into that
directory, and then running the following program.

#define _GNU_SOURCE
#include <fcntl.h>

struct handle {
struct file_handle fh;
unsigned char fid[MAX_HANDLE_SZ];
};

int main(int argc, char **argv)
{
struct handle h = {{8, 1 }, { 12, }};

open_by_handle_at(AT_FDCWD, &h.fh, O_RDONLY);
return 0;
}

Google-Bug-Id: 120690101
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org


# de59fae0 07-Nov-2018 Vasily Averin <vvs@virtuozzo.com>

ext4: fix buffer leak in __ext4_read_dirblock() on error path

Fixes: dc6982ff4db1 ("ext4: refactor code to read directory blocks ...")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 3.9


# feaf264c 06-Nov-2018 Vasily Averin <vvs@virtuozzo.com>

ext4: avoid buffer leak in ext4_orphan_add() after prior errors

Fixes: d745a8c20c1f ("ext4: reduce contention on s_orphan_lock")
Fixes: 6e3617e579e0 ("ext4: Handle non empty on-disk orphan link")
Cc: Dmitry Monakhov <dmonakhov@gmail.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 2.6.34


# e884bce1 10-Oct-2018 Al Viro <viro@zeniv.linux.org.uk>

ext4: don't open-code ERR_CAST

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 799578ab 01-Oct-2018 Gabriel Krisman Bertazi <krisman@collabora.co.uk>

ext4: fix build error when DX_DEBUG is defined

Enabling DX_DEBUG triggers the build error below. info is an attribute
of the dxroot structure.

linux/fs/ext4/namei.c:2264:12: error: ‘info’
undeclared (first use in this function); did you mean ‘insl’?
info->indirect_levels));

Fixes: e08ac99fa2a2 ("ext4: add largedir feature")
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>


# b50282f3 26-Aug-2018 Theodore Ts'o <tytso@mit.edu>

ext4: check to make sure the rename(2)'s destination is not freed

If the destination of the rename(2) system call exists, the inode's
link count (i_nlinks) must be non-zero. If it is, the inode can end
up on the orphan list prematurely, leading to all sorts of hilarity,
including a use-after-free.

https://bugzilla.kernel.org/show_bug.cgi?id=200931

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: stable@vger.kernel.org


# f39b3f45 29-Jul-2018 Eric Sandeen <sandeen@redhat.com>

ext4: reset error code in ext4_find_entry in fallback

When ext4_find_entry() falls back to "searching the old fashioned
way" due to a corrupt dx dir, it needs to reset the error code
to NULL so that the nonstandard ERR_BAD_DX_DIR code isn't returned
to userspace.

https://bugzilla.kernel.org/show_bug.cgi?id=199947

Reported-by: Anatoly Trosinenko <anatoly.trosinenko@yandex.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org


# 95582b00 08-May-2018 Deepa Dinamani <deepa.kernel@gmail.com>

vfs: change inode times to use struct timespec64

struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.

The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.

The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.

virtual patch

@ depends on patch @
identifier now;
@@
- struct timespec
+ struct timespec64
current_time ( ... )
{
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
...
- return timespec_trunc(
+ return timespec64_trunc(
... );
}

@ depends on patch @
identifier xtime;
@@
struct \( iattr \| inode \| kstat \) {
...
- struct timespec xtime;
+ struct timespec64 xtime;
...
}

@ depends on patch @
identifier t;
@@
struct inode_operations {
...
int (*update_time) (...,
- struct timespec t,
+ struct timespec64 t,
...);
...
}

@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
@@
fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
...) { ... }

@ depends on patch @
identifier t;
@@
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
) { ... }

@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
@@
(
(
- struct timespec ts;
+ struct timespec64 ts;
|
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);
)

<+... when != ts
(
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
|
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
|
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
|
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
|
ts = current_time(e)
|
fn_update_time(..., &ts,...)
|
inode_node->i_xtime = ts
|
node1->i_xtime = ts
|
ts = inode_node->i_xtime
|
<+... attr1->ia_xtime ...+> = ts
|
ts = attr1->ia_xtime
|
ts.tv_sec
|
ts.tv_nsec
|
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
|
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
|
- ts = timespec64_to_timespec(
+ ts =
...
-)
|
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
...)
|
- ts = E3
+ ts = timespec_to_timespec64(E3)
|
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
|
fn(...,
- ts
+ timespec64_to_timespec(ts)
,...)
)
...+>
(
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
...>
)
|
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
|
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
|
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
|
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
...)
|
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2,
...)
|
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
|
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)
)

@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
@@
(
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
|
fn(...,
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
|
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));
)

@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,
...);
)
...+>
}

@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);
)
...+>
}

@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
@@
(
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ;
|
node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
stat->xtime = node2->i_xtime1;
|
stat1.xtime = node2->i_xtime1;
|
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ;
|
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
|
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
|
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
|
node->i_xtime1 = current_time(...);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);
)

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>


# 1e2e547a 04-May-2018 Al Viro <viro@zeniv.linux.org.uk>

do d_instantiate/unlock_new_inode combinations safely

For anything NFS-exported we do _not_ want to unlock new inode
before it has grown an alias; original set of fixes got the
ordering right, but missed the nasty complication in case of
lockdep being enabled - unlock_new_inode() does
lockdep_annotate_inode_mutex_key(inode)
which can only be done before anyone gets a chance to touch
->i_mutex. Unfortunately, flipping the order and doing
unlock_new_inode() before d_instantiate() opens a window when
mkdir can race with open-by-fhandle on a guessed fhandle, leading
to multiple aliases for a directory inode and all the breakage
that follows from that.

Correct solution: a new primitive (d_instantiate_new())
combining these two in the right order - lockdep annotate, then
d_instantiate(), then the rest of unlock_new_inode(). All
combinations of d_instantiate() with unlock_new_inode() should
be converted to that.

Cc: stable@kernel.org # 2.6.29 and later
Tested-by: Mike Marshall <hubcap@omnibond.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ee73f9a5 09-Jan-2018 Jeff Layton <jlayton@kernel.org>

ext4: convert to new i_version API

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>


# ae5e165d 29-Jan-2018 Jeff Layton <jlayton@kernel.org>

fs: new API for handling inode->i_version

Add a documentation blob that explains what the i_version field is, how
it is expected to work, and how it is currently implemented by various
filesystems.

We already have inode_inc_iversion. Add several other functions for
manipulating and accessing the i_version counter. For now, the
implementation is trivial and basically works the way that all of the
open-coded i_version accesses work today.

Future patches will convert existing users of i_version to use the new
API, and then convert the backend implementation to do things more
efficiently.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>


# 78e1060c 11-Jan-2018 Eric Biggers <ebiggers@google.com>

ext4: switch to fscrypt ->symlink() helper functions

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# a794df0e 11-Jan-2018 Colin Ian King <colin.king@canonical.com>

ext4: fix incorrect indentation of if statement

The indentation is incorrect and spaces need replacing with a tab
on the if statement.

Cleans up smatch warning:
fs/ext4/namei.c:3220 ext4_link() warn: inconsistent indenting

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# 9d5afec6 11-Dec-2017 Chandan Rajendra <chandan@linux.vnet.ibm.com>

ext4: fix crash when a directory's i_size is too small

On a ppc64 machine, when mounting a fuzzed ext2 image (generated by
fsfuzzer) the following call trace is seen,

VFS: brelse: Trying to free free buffer
WARNING: CPU: 1 PID: 6913 at /root/repos/linux/fs/buffer.c:1165 .__brelse.part.6+0x24/0x40
.__brelse.part.6+0x20/0x40 (unreliable)
.ext4_find_entry+0x384/0x4f0
.ext4_lookup+0x84/0x250
.lookup_slow+0xdc/0x230
.walk_component+0x268/0x400
.path_lookupat+0xec/0x2d0
.filename_lookup+0x9c/0x1d0
.vfs_statx+0x98/0x140
.SyS_newfstatat+0x48/0x80
system_call+0x58/0x6c

This happens because the directory that ext4_find_entry() looks up has
inode->i_size that is less than the block size of the filesystem. This
causes 'nblocks' to have a value of zero. ext4_bread_batch() ends up not
reading any of the directory file's blocks. This renders the entries in
bh_use[] array to continue to have garbage data. buffer_uptodate() on
bh_use[0] can then return a zero value upon which brelse() function is
invoked.

This commit fixes the bug by returning -ENOENT when the directory file
has no associated blocks.

Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org


# b2441318 01-Nov-2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 89904275 18-Oct-2017 Eric Biggers <ebiggers@google.com>

ext4: switch to fscrypt_prepare_lookup()

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 07543d16 18-Oct-2017 Eric Biggers <ebiggers@google.com>

ext4: switch to fscrypt_prepare_rename()

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 69725181 18-Oct-2017 Eric Biggers <ebiggers@google.com>

ext4: switch to fscrypt_prepare_link()

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 9699d4f9 05-Aug-2017 Tahsin Erdogan <tahsin@google.com>

ext4: make xattr inode reads faster

ext4_xattr_inode_read() currently reads each block sequentially while
waiting for io operation to complete before moving on to the next
block. This prevents request merging in block layer.

Add a ext4_bread_batch() function that starts reads for all blocks
then optionally waits for them to complete. A similar logic is used
in ext4_find_entry(), so update that code to use the new function.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# c7414892 05-Aug-2017 Andreas Dilger <adilger@dilger.ca>

ext4: fix dir_nlink behaviour

The dir_nlink feature has been enabled by default for new ext4
filesystems since e2fsprogs-1.41 in 2008, and was automatically
enabled by the kernel for older ext4 filesystems since the
dir_nlink feature was added with ext4 in kernel 2.6.28+ when
the subdirectory count exceeded EXT4_LINK_MAX-1.

Automatically adding the file system features such as dir_nlink is
generally frowned upon, since it could cause the file system to not be
mountable on older kernel, thus preventing the administrator from
rolling back to an older kernel if necessary.

In this case, the administrator might also want to disable the feature
because glibc's fts_read() function does not correctly optimize
directory traversal for directories that use st_nlinks field of 1 to
indicate that the number of links in the directory are not tracked by
the file system, and could fail to traverse the full directory
hierarchy. Fortunately, in the past ten years very few users have
complained about incomplete file system traversal by glibc's
fts_read().

This commit also changes ext4_inc_count() to allow i_nlinks to reach
the full EXT4_LINK_MAX links on the parent directory (including "."
and "..") before changing i_links_count to be 1.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=196405
Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# bdddf342 22-Jun-2017 Theodore Ts'o <tytso@mit.edu>

ext4: return EFSBADCRC if a bad checksum error is found in ext4_find_entry()

Previously a bad directory block with a bad checksum is skipped; we
should be returning EFSBADCRC (aka EBADMSG).

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 6febe6f2 22-Jun-2017 Khazhismel Kumykov <khazhy@google.com>

ext4: return EIO on read error in ext4_find_entry

Previously, a read error would be ignored and we would eventually return
NULL from ext4_find_entry, which signals "no such file or directory". We
should be returning EIO.

Signed-off-by: Khazhismel Kumykov <khazhy@google.com>


# e08ac99f 21-Jun-2017 Artem Blagodarenko <artem.blagodarenko@gmail.com>

ext4: add largedir feature

This INCOMPAT_LARGEDIR feature allows larger directories to be created
in ldiskfs, both with directory sizes over 2GB and and a maximum htree
depth of 3 instead of the current limit of 2. These features are needed
in order to exceed the current limit of approximately 10M entries in a
single directory.

This patch was originally written by Yang Sheng to support the Lustre server.

[ Bumped the credits needed to update an indexed directory -- tytso ]

Signed-off-by: Liang Zhen <liang.zhen@intel.com>
Signed-off-by: Yang Sheng <yang.sheng@intel.com>
Signed-off-by: Artem Blagodarenko <artem.blagodarenko@seagate.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>


# d6b97550 24-May-2017 Eric Biggers <ebiggers@google.com>

ext4: remove unused d_name argument from ext4_search_dir() et al.

Now that we are passing a struct ext4_filename, we do not need to pass
around the original struct qstr too.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# d9b9f8d5 24-Apr-2017 Eric Biggers <ebiggers@google.com>

ext4: clean up ext4_match() and callers

When ext4 encryption was originally merged, we were encrypting the
user-specified filename in ext4_match(), introducing a lot of additional
complexity into ext4_match() and its callers. This has since been
changed to encrypt the filename earlier, so we can remove the gunk
that's no longer needed. This more or less reverts ext4_search_dir()
and ext4_find_dest_de() to the way they were in the v4.0 kernel.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 067d1023 24-Apr-2017 Eric Biggers <ebiggers@google.com>

ext4: switch to using fscrypt_match_name()

Switch ext4 directory searches to use the fscrypt_match_name() helper
function. There should be no functional change.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 6b06cdee 24-Apr-2017 Eric Biggers <ebiggers@google.com>

fscrypt: avoid collisions when presenting long encrypted filenames

When accessing an encrypted directory without the key, userspace must
operate on filenames derived from the ciphertext names, which contain
arbitrary bytes. Since we must support filenames as long as NAME_MAX,
we can't always just base64-encode the ciphertext, since that may make
it too long. Currently, this is solved by presenting long names in an
abbreviated form containing any needed filesystem-specific hashes (e.g.
to identify a directory block), then the last 16 bytes of ciphertext.
This needs to be sufficient to identify the actual name on lookup.

However, there is a bug. It seems to have been assumed that due to the
use of a CBC (ciphertext block chaining)-based encryption mode, the last
16 bytes (i.e. the AES block size) of ciphertext would depend on the
full plaintext, preventing collisions. However, we actually use CBC
with ciphertext stealing (CTS), which handles the last two blocks
specially, causing them to appear "flipped". Thus, it's actually the
second-to-last block which depends on the full plaintext.

This caused long filenames that differ only near the end of their
plaintexts to, when observed without the key, point to the wrong inode
and be undeletable. For example, with ext4:

# echo pass | e4crypt add_key -p 16 edir/
# seq -f "edir/abcdefghijklmnopqrstuvwxyz012345%.0f" 100000 | xargs touch
# find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l
100000
# sync
# echo 3 > /proc/sys/vm/drop_caches
# keyctl new_session
# find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l
2004
# rm -rf edir/
rm: cannot remove 'edir/_A7nNFi3rhkEQlJ6P,hdzluhODKOeWx5V': Structure needs cleaning
...

To fix this, when presenting long encrypted filenames, encode the
second-to-last block of ciphertext rather than the last 16 bytes.

Although it would be nice to solve this without depending on a specific
encryption mode, that would mean doing a cryptographic hash like SHA-256
which would be much less efficient. This way is sufficient for now, and
it's still compatible with encryption modes like HEH which are strong
pseudorandom permutations. Also, changing the presented names is still
allowed at any time because they are only provided to allow applications
to do things like delete encrypted directories. They're not designed to
be used to persistently identify files --- which would be hard to do
anyway, given that they're encrypted after all.

For ease of backports, this patch only makes the minimal fix to both
ext4 and f2fs. It leaves ubifs as-is, since ubifs doesn't compare the
ciphertext block yet. Follow-on patches will clean things up properly
and make the filesystems use a shared helper function.

Fixes: 5de0b4d0cd15 ("ext4 crypto: simplify and speed up filename encryption")
Reported-by: Gwendal Grignou <gwendal@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 8c68084b 07-Apr-2017 Eric Biggers <ebiggers@google.com>

ext4: remove "nokey" check from ext4_lookup()

Now that fscrypt_has_permitted_context() correctly handles the case
where we have the key for the parent directory but not the child, we
don't need to try to work around this in ext4_lookup().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 1bc0af60 29-Apr-2017 Eric Biggers <ebiggers@google.com>

ext4: trim return value and 'dir' argument from ext4_insert_dentry()

In the initial implementation of ext4 encryption, the filename was
encrypted in ext4_insert_dentry(), which could fail and also required
access to the 'dir' inode. Since then ext4 filename encryption has been
changed to encrypt the filename earlier, so we can revert the additions
to ext4_insert_dentry().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 99652ea5 31-Mar-2017 David Howells <dhowells@redhat.com>

ext4: Add statx support

Return enhanced file attributes from the Ext4 filesystem. This includes
the following:

(1) The inode creation time (i_crtime) as stx_btime, setting STATX_BTIME.

(2) Certain FS_xxx_FL flags are mapped to stx_attribute flags.

This requires that all ext4 inodes have a getattr call, not just some of
them, so to this end, split the ext4_getattr() function and only call part
of it where appropriate.

Example output:

[root@andromeda ~]# touch foo
[root@andromeda ~]# chattr +ai foo
[root@andromeda ~]# /tmp/test-statx foo
statx(foo) = 0
results=fff
Size: 0 Blocks: 0 IO Block: 4096 regular file
Device: 08:12 Inode: 2101950 Links: 1
Access: (0644/-rw-r--r--) Uid: 0 Gid: 0
Access: 2016-02-11 17:08:29.031795451+0000
Modify: 2016-02-11 17:08:29.031795451+0000
Change: 2016-02-11 17:11:11.987790114+0000
Birth: 2016-02-11 17:08:29.031795451+0000
Attributes: 0000000000000030 (-------- -------- -------- -------- -------- -------- -------- --ai----)

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 0db1ff22 04-Feb-2017 Theodore Ts'o <tytso@mit.edu>

ext4: add shutdown bit and check for it

Add a shutdown bit that will cause ext4 processing to fail immediately
with EIO.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# dd01b690 01-Feb-2017 Eric Biggers <ebiggers@google.com>

ext4: fix use-after-iput when fscrypt contexts are inconsistent

In the case where the child's encryption context was inconsistent with
its parent directory, we were using inode->i_sb and inode->i_ino after
the inode had already been iput(). Fix this by doing the iput() in the
correct places.

Note: only ext4 had this bug, not f2fs and ubifs.

Fixes: d9cdc9033181 ("ext4 crypto: enforce context consistency")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 173b8439 27-Dec-2016 Theodore Ts'o <tytso@mit.edu>

ext4: don't allow encrypted operations without keys

While we allow deletes without the key, the following should not be
permitted:

# cd /vdc/encrypted-dir-without-key
# ls -l
total 4
-rw-r--r-- 1 root root 0 Dec 27 22:35 6,LKNRJsp209FbXoSvJWzB
-rw-r--r-- 1 root root 286 Dec 27 22:35 uRJ5vJh9gE7vcomYMqTAyD
# mv uRJ5vJh9gE7vcomYMqTAyD 6,LKNRJsp209FbXoSvJWzB

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 54475f53 05-Dec-2016 Eric Biggers <ebiggers@google.com>

fscrypt: use ENOKEY when file cannot be created w/o key

As part of an effort to clean up fscrypt-related error codes, make
attempting to create a file in an encrypted directory that hasn't been
"unlocked" fail with ENOKEY. Previously, several error codes were used
for this case, including ENOENT, EACCES, and EPERM, and they were not
consistent between and within filesystems. ENOKEY is a better choice
because it expresses that the failure is due to lacking the encryption
key. It also matches the error code returned when trying to open an
encrypted regular file without the key.

I am not aware of any users who might be relying on the previous
inconsistent error codes, which were never documented anywhere.

This failure case will be exercised by an xfstest.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# eeca7ea1 14-Nov-2016 Deepa Dinamani <deepa.kernel@gmail.com>

ext4: use current_time() for inode timestamps

CURRENT_TIME_SEC and CURRENT_TIME are not y2038 safe.
current_time() will be transitioned to be y2038 safe
along with vfs.

current_time() returns timestamps according to the
granularities set in the super_block.
The granularity check in ext4_current_time() to call
current_time() or CURRENT_TIME_SEC is not required.
Use current_time() directly to obtain timestamps
unconditionally, and remove ext4_current_time().

Quota files are assumed to be on the same filesystem.
Hence, use current_time() for these files as well.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>


# d74f3d25 15-Oct-2016 Joe Perches <joe@perches.com>

ext4: add missing KERN_CONT to a few more debugging uses

Recent commits require line continuing printks to always use
pr_cont or KERN_CONT. Add these markings to a few more printks.

Miscellaneaous:

o Integrate the ea_idebug and ea_bdebug macros to use a single
call to printk(KERN_DEBUG instead of 3 separate printks
o Use the more common varargs macro style

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>


# fd50ecad 29-Sep-2016 Andreas Gruenbacher <agruenba@redhat.com>

vfs: Remove {get,set,remove}xattr inode operations

These inode operations are no longer used; remove them.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e81d4477 29-Sep-2016 gmail <yngsion@gmail.com>

ext4: release bh in make_indexed_dir

The commit 6050d47adcad: "ext4: bail out from make_indexed_dir() on
first error" could end up leaking bh2 in the error path.

[ Also avoid renaming bh2 to bh, which just confuses things --tytso ]

Cc: stable@vger.kernel.org
Signed-off-by: yangsheng <yngsion@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 2773bf00 27-Sep-2016 Miklos Szeredi <mszeredi@redhat.com>

fs: rename "rename2" i_op to "rename"

Generated patch:

sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2`
sed -i "s/\brename2\b/rename/g" `git grep -wl rename2`

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# ef1eb3aa 15-Sep-2016 Eric Biggers <ebiggers@google.com>

fscrypto: make filename crypto functions return 0 on success

Several filename crypto functions: fname_decrypt(),
fscrypt_fname_disk_to_usr(), and fscrypt_fname_usr_to_disk(), returned
the output length on success or -errno on failure. However, the output
length was redundant with the value written to 'oname->len'. It is also
potentially error-prone to make callers have to check for '< 0' instead
of '!= 0'.

Therefore, make these functions return 0 instead of a length, and make
the callers who cared about the return value being a length use
'oname->len' instead. For consistency also make other callers check for
a nonzero result rather than a negative result.

This change also fixes the inconsistency of fname_encrypt() actually
already returning 0 on success, not a length like the other filename
crypto functions and as documented in its function comment.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>


# a7550b30 10-Jul-2016 Jaegeuk Kim <jaegeuk@kernel.org>

ext4 crypto: migrate into vfs's crypto engine

This patch removes the most parts of internal crypto codes.
And then, it modifies and adds some ext4-specific crypt codes to use the generic
facility.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# fa964540 03-Jul-2016 Daeho Jeong <daeho.jeong@samsung.com>

ext4: correct error value of function verifying dx checksum

ext4_dx_csum_verify() returns the success return value in two checksum
verification failure cases. We need to set the return values to zero
as failure like ext4_dirent_csum_verify() returning zero when failing
to find a checksum dirent at the tail.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>


# b47820ed 03-Jul-2016 Daeho Jeong <daeho.jeong@samsung.com>

ext4: avoid modifying checksum fields directly during checksum verification

We temporally change checksum fields in buffers of some types of
metadata into '0' for verifying the checksum values. By doing this
without locking the buffer, some metadata's checksums, which are
being committed or written back to the storage, could be damaged.
In our test, several metadata blocks were found with damaged metadata
checksum value during recovery process. When we only verify the
checksum value, we have to avoid modifying checksum fields directly.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Youngjin Gil <youngjin.gil@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>


# dfec8a14 05-Jun-2016 Mike Christie <mchristi@redhat.com>

fs: have ll_rw_block users pass in op and flags separately

This has ll_rw_block users pass in the operation and flags separately,
so ll_rw_block can setup the bio op and bi_rw flags on the bio that
is submitted.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 74177f55 05-May-2016 Jan Kara <jack@suse.cz>

ext4: fix oops on corrupted filesystem

When filesystem is corrupted in the right way, it can happen
ext4_mark_iloc_dirty() in ext4_orphan_add() returns error and we
subsequently remove inode from the in-memory orphan list. However this
deletion is done with list_del(&EXT4_I(inode)->i_orphan) and thus we
leave i_orphan list_head with a stale content. Later we can look at this
content causing list corruption, oops, or other issues. The reported
trace looked like:

WARNING: CPU: 0 PID: 46 at lib/list_debug.c:53 __list_del_entry+0x6b/0x100()
list_del corruption, 0000000061c1d6e0->next is LIST_POISON1
0000000000100100)
CPU: 0 PID: 46 Comm: ext4.exe Not tainted 4.1.0-rc4+ #250
Stack:
60462947 62219960 602ede24 62219960
602ede24 603ca293 622198f0 602f02eb
62219950 6002c12c 62219900 601b4d6b
Call Trace:
[<6005769c>] ? vprintk_emit+0x2dc/0x5c0
[<602ede24>] ? printk+0x0/0x94
[<600190bc>] show_stack+0xdc/0x1a0
[<602ede24>] ? printk+0x0/0x94
[<602ede24>] ? printk+0x0/0x94
[<602f02eb>] dump_stack+0x2a/0x2c
[<6002c12c>] warn_slowpath_common+0x9c/0xf0
[<601b4d6b>] ? __list_del_entry+0x6b/0x100
[<6002c254>] warn_slowpath_fmt+0x94/0xa0
[<602f4d09>] ? __mutex_lock_slowpath+0x239/0x3a0
[<6002c1c0>] ? warn_slowpath_fmt+0x0/0xa0
[<60023ebf>] ? set_signals+0x3f/0x50
[<600a205a>] ? kmem_cache_free+0x10a/0x180
[<602f4e88>] ? mutex_lock+0x18/0x30
[<601b4d6b>] __list_del_entry+0x6b/0x100
[<601177ec>] ext4_orphan_del+0x22c/0x2f0
[<6012f27c>] ? __ext4_journal_start_sb+0x2c/0xa0
[<6010b973>] ? ext4_truncate+0x383/0x390
[<6010bc8b>] ext4_write_begin+0x30b/0x4b0
[<6001bb50>] ? copy_from_user+0x0/0xb0
[<601aa840>] ? iov_iter_fault_in_readable+0xa0/0xc0
[<60072c4f>] generic_perform_write+0xaf/0x1e0
[<600c4166>] ? file_update_time+0x46/0x110
[<60072f0f>] __generic_file_write_iter+0x18f/0x1b0
[<6010030f>] ext4_file_write_iter+0x15f/0x470
[<60094e10>] ? unlink_file_vma+0x0/0x70
[<6009b180>] ? unlink_anon_vmas+0x0/0x260
[<6008f169>] ? free_pgtables+0xb9/0x100
[<600a6030>] __vfs_write+0xb0/0x130
[<600a61d5>] vfs_write+0xa5/0x170
[<600a63d6>] SyS_write+0x56/0xe0
[<6029fcb0>] ? __libc_waitpid+0x0/0xa0
[<6001b698>] handle_syscall+0x68/0x90
[<6002633d>] userspace+0x4fd/0x600
[<6002274f>] ? save_registers+0x1f/0x40
[<60028bd7>] ? arch_prctl+0x177/0x1b0
[<60017bd5>] fork_handler+0x85/0x90

Fix the problem by using list_del_init() as we always should with
i_orphan list.

CC: stable@vger.kernel.org
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 8d2ae1cb 26-Apr-2016 Jakub Wilk <jwilk@jwilk.net>

ext4: remove trailing \n from ext4_warning/ext4_error calls

Messages passed to ext4_warning() or ext4_error() don't need trailing
newlines, because these function add the newlines themselves.

Signed-off-by: Jakub Wilk <jwilk@jwilk.net>


# 1f60fbe7 23-Apr-2016 Theodore Ts'o <tytso@mit.edu>

ext4: allow readdir()'s of large empty directories to be interrupted

If a directory has a large number of empty blocks, iterating over all
of them can take a long time, leading to scheduler warnings and users
getting irritated when they can't kill a process in the middle of one
of these long-running readdir operations. Fix this by adding checks to
ext4_readdir() and ext4_htree_fill_tree().

This was reverted earlier due to a typo in the original commit where I
experimented with using signal_pending() instead of
fatal_signal_pending(). The test was in the wrong place if we were
going to return signal_pending() since we would end up returning
duplicant entries. See 9f2394c9be47 for a more detailed explanation.

Added fix as suggested by Linus to check for signal_pending() in
in the filldir() functions.

Reported-by: Benjamin LaHaise <bcrl@kvack.org>
Google-Bug-Id: 27880676
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 9f2394c9 10-Apr-2016 Linus Torvalds <torvalds@linux-foundation.org>

Revert "ext4: allow readdir()'s of large empty directories to be interrupted"

This reverts commit 1028b55bafb7611dda1d8fed2aeca16a436b7dff.

It's broken: it makes ext4 return an error at an invalid point, causing
the readdir wrappers to write the the position of the last successful
directory entry into the position field, which means that the next
readdir will now return that last successful entry _again_.

You can only return fatal errors (that terminate the readdir directory
walk) from within the filesystem readdir functions, the "normal" errors
(that happen when the readdir buffer fills up, for example) happen in
the iterorator where we know the position of the actual failing entry.

I do have a very different patch that does the "signal_pending()"
handling inside the iterator function where it is allowable, but while
that one passes all the sanity checks, I screwed up something like four
times while emailing it out, so I'm not going to commit it today.

So my track record is not good enough, and the stars will have to align
better before that one gets committed. And it would be good to get some
review too, of course, since celestial alignments are always an iffy
debugging model.

IOW, let's just revert the commit that caused the problem for now.

Reported-by: Greg Thelen <gthelen@google.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fc64005c 09-Apr-2016 Al Viro <viro@zeniv.linux.org.uk>

don't bother with ->d_inode->i_sb - it's always equal to ->d_sb

... and neither can ever be NULL

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 1028b55b 30-Mar-2016 Theodore Ts'o <tytso@mit.edu>

ext4: allow readdir()'s of large empty directories to be interrupted

If a directory has a large number of empty blocks, iterating over all
of them can take a long time, leading to scheduler warnings and users
getting irritated when they can't kill a process in the middle of one
of these long-running readdir operations. Fix this by adding checks to
ext4_readdir() and ext4_htree_fill_tree().

Reported-by: Benjamin LaHaise <bcrl@kvack.org>
Google-Bug-Id: 27880676
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# ff978b09 07-Feb-2016 Theodore Ts'o <tytso@mit.edu>

ext4 crypto: move context consistency check to ext4_file_open()

In the case where the per-file key for the directory is cached, but
root does not have access to the key needed to derive the per-file key
for the files in the directory, we allow the lookup to succeed, so
that lstat(2) and unlink(2) can suceed. However, if a program tries
to open the file, it will get an ENOKEY error.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 28b4c263 07-Feb-2016 Theodore Ts'o <tytso@mit.edu>

ext4 crypto: revalidate dentry after adding or removing the key

Add a validation check for dentries for encrypted directory to make
sure we're not caching stale data after a key has been added or removed.

Also check to make sure that status of the encryption key is updated
when readdir(2) is executed.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 5955102c 22-Jan-2016 Al Viro <viro@zeniv.linux.org.uk>

wrappers for ->i_mutex access

parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 040cb378 08-Jan-2016 Li Xi <pkuelelixi@gmail.com>

ext4: adds project ID support

Signed-off-by: Li Xi <lixi@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>


# 56a04915 08-Jan-2016 Theodore Ts'o <tytso@mit.edu>

ext4 crypto: simplify interfaces to directory entry insert functions

A number of functions include ext4_add_dx_entry, make_indexed_dir,
etc. are being passed a dentry even though the only thing they use is
the containing parent. We can shrink the code size slightly by making
this replacement. This will also be useful in cases where we don't
have a dentry as the argument to the directory entry insert functions.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 21fc61c7 16-Nov-2015 Al Viro <viro@zeniv.linux.org.uk>

don't put symlink bodies in pagecache into highmem

kmap() in page_follow_link_light() needed to go - allowing to hold
an arbitrary number of kmaps for long is a great way to deadlocking
the system.

new helper (inode_nohighmem(inode)) needs to be used for pagecache
symlinks inodes; done for all in-tree cases. page_follow_link_light()
instrumented to yell about anything missed.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# be69e1c1 29-Oct-2015 Yaowei Bai <bywxiaobai@163.com>

fs/ext4: remove unnecessary new_valid_dev check

As new_valid_dev always returns 1, so !new_valid_dev check is not
needed, remove it.

Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# e2b911c5 17-Oct-2015 Darrick J. Wong <darrick.wong@oracle.com>

ext4: clean up feature test macros with predicate functions

Create separate predicate functions to test/set/clear feature flags,
thereby replacing the wordy old macros. Furthermore, clean out the
places where we open-coded feature tests.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 6a797d27 17-Oct-2015 Darrick J. Wong <darrick.wong@oracle.com>

ext4: call out CRC and corruption errors with specific error codes

Instead of overloading EIO for CRC errors and corrupt structures,
return the same error codes that XFS returns for the same issues.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# a1c83681 12-Aug-2015 Viresh Kumar <viresh.kumar@linaro.org>

fs: Drop unlikely before IS_ERR(_OR_NULL)

IS_ERR(_OR_NULL) already contain an 'unlikely' compiler flag and there
is no need to do that again from its callers. Drop it.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steve French <smfrench@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# 926631c2 24-Jul-2015 Dan Carpenter <dan.carpenter@oracle.com>

ext4: memory leak on error in ext4_symlink()

We should release "sd" before returning.

Fixes: 0fa12ad1b285 ('ext4: Handle error from dquot_initialize()')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jan Kara <jack@suse.com>


# a7cdadee 29-Jun-2015 Jan Kara <jack@suse.com>

ext4: Handle error from dquot_initialize()

dquot_initialize() can now return error. Handle it where possible.

Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.com>


# c5e298ae 20-Jun-2015 Theodore Ts'o <tytso@mit.edu>

ext4: prevent ext4_quota_write() from failing due to ENOSPC

In order to prevent quota block tracking to be inaccurate when
ext4_quota_write() fails with ENOSPC, we make two changes. The quota
file can now use the reserved block (since the quota file is arguably
file system metadata), and ext4_quota_write() now uses
ext4_should_retry_alloc() to retry the block allocation after a commit
has completed and released some blocks for allocation.

This fixes failures of xfstests generic/270:

Quota error (device vdc): write_blk: dquota write failed
Quota error (device vdc): qtree_write_dquot: Error -28 occurred while creating quota

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b03a2f7e 15-Jun-2015 Andreas Dilger <adilger@dilger.ca>

ext4: improve warning directory handling messages

Several ext4_warning() messages in the directory handling code do not
report the inode number of the (potentially corrupt) directory where a
problem is seen, and others report this in an ad-hoc manner. Add an
ext4_warning_inode() helper to print the inode number and command name
consistent with ext4_error_inode().

Consolidate the place in ext4.h that these macros are defined.

Clean up some other directory error and warning messages to print the
calling function name.

Minor code style fixes in nearby lines.

Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 4d3c4e5b 31-May-2015 Theodore Ts'o <tytso@mit.edu>

ext4 crypto: allocate the right amount of memory for the on-disk symlink

Previously we were taking the required padding when allocating space
for the on-disk symlink. This caused a buffer overrun which could
trigger a krenel crash when running fsstress.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# c2faccaf 31-May-2015 Theodore Ts'o <tytso@mit.edu>

ext4 crypto: enforce crypto policy restrictions on cross-renames

Thanks to Chao Yu <chao2.yu@samsung.com> for pointing out the need for
this check.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# e709e9df 31-May-2015 Theodore Ts'o <tytso@mit.edu>

ext4 crypto: encrypt tmpfile located in encryption protected directory

Factor out calls to ext4_inherit_context() and move them to
__ext4_new_inode(); this fixes a problem where ext4_tmpfile() wasn't
calling calling ext4_inherit_context(), so the temporary file wasn't
getting protected. Since the blocks for the tmpfile could end up on
disk, they really should be protected if the tmpfile is created within
the context of an encrypted directory.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# c936e1ec 31-May-2015 Theodore Ts'o <tytso@mit.edu>

ext4 crypto: use per-inode tfm structure

As suggested by Herbert Xu, we shouldn't allocate a new tfm each time
we read or write a page. Instead we can use a single tfm hanging off
the inode's crypt_info structure for all of our encryption needs for
that inode, since the tfm can be used by multiple crypto requests in
parallel.

Also use cmpxchg() to avoid races that could result in crypt_info
structure getting doubly allocated or doubly freed.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b7236e21 18-May-2015 Theodore Ts'o <tytso@mit.edu>

ext4 crypto: reorganize how we store keys in the inode

This is a pretty massive patch which does a number of different things:

1) The per-inode encryption information is now stored in an allocated
data structure, ext4_crypt_info, instead of directly in the node.
This reduces the size usage of an in-memory inode when it is not
using encryption.

2) We drop the ext4_fname_crypto_ctx entirely, and use the per-inode
encryption structure instead. This remove an unnecessary memory
allocation and free for the fname_crypto_ctx as well as allowing us
to reuse the ctfm in a directory for multiple lookups and file
creations.

3) We also cache the inode's policy information in the ext4_crypt_info
structure so we don't have to continually read it out of the
extended attributes.

4) We now keep the keyring key in the inode's encryption structure
instead of releasing it after we are done using it to derive the
per-inode key. This allows us to test to see if the key has been
revoked; if it has, we prevent the use of the derived key and free
it.

5) When an inode is released (or when the derived key is freed), we
will use memset_explicit() to zero out the derived key, so it's not
left hanging around in memory. This implies that when a user logs
out, it is important to first revoke the key, and then unlink it,
and then finally, to use "echo 3 > /proc/sys/vm/drop_caches" to
release any decrypted pages and dcache entries from the system
caches.

6) All this, and we also shrink the number of lines of code by around
100. :-)

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# d2299590 18-May-2015 Theodore Ts'o <tytso@mit.edu>

ext4 crypto: don't allocate a page when encrypting/decrypting file names

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 5b643f9c 18-May-2015 Theodore Ts'o <tytso@mit.edu>

ext4 crypto: optimize filename encryption

Encrypt the filename as soon it is passed in by the user. This avoids
our needing to encrypt the filename 2 or 3 times while in the process
of creating a filename.

Similarly, when looking up a directory entry, encrypt the filename
early, or if the encryption key is not available, base-64 decode the
file syystem so that the hash value and the last 16 bytes of the
encrypted filename is available in the new struct ext4_filename data
structure.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 75e7566b 02-May-2015 Al Viro <viro@zeniv.linux.org.uk>

ext4: switch to simple_follow_link()

for fast symlinks only, of course...

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a7a67e8a 27-Apr-2015 Al Viro <viro@zeniv.linux.org.uk>

ext4: split inode_operations for encrypted symlinks off the rest

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5de0b4d0 01-May-2015 Theodore Ts'o <tytso@mit.edu>

ext4 crypto: simplify and speed up filename encryption

Avoid using SHA-1 when calculating the user-visible filename when the
encryption key is available, and avoid decrypting lots of filenames
when searching for a directory entry in a directory block.

Change-Id: If4655f144784978ba0305b597bfa1c8d7bb69e63
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 6ddb2447 15-Apr-2015 Theodore Ts'o <tytso@mit.edu>

ext4 crypto: enable encryption feature flag

Also add the test dummy encryption mode flag so we can more easily
test the encryption patches using xfstests.

Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# f348c252 15-Apr-2015 Theodore Ts'o <tytso@mit.edu>

ext4 crypto: add symlink encryption

Signed-off-by: Uday Savagaonkar <savagaon@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# be64f884 15-Apr-2015 Boaz Harrosh <boaz@plexistor.com>

dax: unify ext2/4_{dax,}_file_operations

The original dax patchset split the ext2/4_file_operations because of the
two NULL splice_read/splice_write in the dax case.

In the vfs if splice_read/splice_write are NULL we then call
default_splice_read/write.

What we do here is make generic_file_splice_read aware of IS_DAX() so the
original ext2/4_file_operations can be used as is.

For write it appears that iter_file_splice_write is just fine. It uses
the regular f_op->write(file,..) or new_sync_write(file, ...).

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2b0143b5 17-Mar-2015 David Howells <dhowells@redhat.com>

VFS: normal filesystems (and lustre): d_inode() annotations

that's the bulk of filesystem drivers dealing with inodes of their own

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 1f3862b5 11-Apr-2015 Michael Halcrow <mhalcrow@google.com>

ext4 crypto: filename encryption modifications

Modifies htree_dirblock_to_tree, dx_make_map, ext4_match search_dir,
and ext4_find_dest_de to support fname crypto. Filename encryption
feature is not yet enabled at this patch.

Signed-off-by: Uday Savagaonkar <savagaon@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b3098486 11-Apr-2015 Michael Halcrow <mhalcrow@google.com>

ext4 crypto: partial update to namei.c for fname crypto

Modifies dx_show_leaf and dx_probe to support fname encryption.
Filename encryption not yet enabled.

Signed-off-by: Uday Savagaonkar <savagaon@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 4bdfc873 11-Apr-2015 Michael Halcrow <mhalcrow@google.com>

ext4 crypto: insert encrypted filenames into a leaf directory block

Signed-off-by: Uday Savagaonkar <savagaon@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 2f61830a 11-Apr-2015 Theodore Ts'o <tytso@mit.edu>

ext4 crypto: teach ext4_htree_store_dirent() to store decrypted filenames

For encrypted directories, we need to pass in a separate parameter for
the decrypted filename, since the directory entry contains the
encrypted filename.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# dde680ce 11-Apr-2015 Michael Halcrow <mhalcrow@google.com>

ext4 crypto: inherit encryption policies on inode and directory create

Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# d9cdc903 11-Apr-2015 Theodore Ts'o <tytso@mit.edu>

ext4 crypto: enforce context consistency

Enforce the following inheritance policy:

1) An unencrypted directory may contain encrypted or unencrypted files
or directories.

2) All files or directories in a directory must be protected using the
same key as their containing directory.

As a result, assuming the following setup:

mke2fs -t ext4 -Fq -O encrypt /dev/vdc
mount -t ext4 /dev/vdc /vdc
mkdir /vdc/a /vdc/b /vdc/c
echo foo | e4crypt add_key /vdc/a
echo bar | e4crypt add_key /vdc/b
for i in a b c ; do cp /etc/motd /vdc/$i/motd-$i ; done

Then we will see the following results:

cd /vdc
mv a b # will fail; /vdc/a and /vdc/b have different keys
mv b/motd-b a # will fail, see above
ln a/motd-a b # will fail, see above
mv c a # will fail; all inodes in an encrypted directory
# must be encrypted
ln c/motd-c b # will fail, see above
mv a/motd-a c # will succeed
mv c/motd-a a # will succeed

Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# e875a2dd 11-Apr-2015 Michael Halcrow <mhalcrow@google.com>

ext4 crypto: export ext4_empty_dir()

Required for future encryption xattr changes.

Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# e12fb972 03-Apr-2015 Lukas Czerner <lczerner@redhat.com>

ext4: make fsync to sync parent dir in no-journal for real this time

Previously commit 14ece1028b3ed53ffec1b1213ffc6acaf79ad77c added a
support for for syncing parent directory of newly created inodes to
make sure that the inode is not lost after a power failure in
no-journal mode.

However this does not work in majority of cases, namely:
- if the directory has inline data
- if the directory is already indexed
- if the directory already has at least one block and:
- the new entry fits into it
- or we've successfully converted it to indexed

So in those cases we might lose the inode entirely even after fsync in
the no-journal mode. This also includes ext2 default mode obviously.

I've noticed this while running xfstest generic/321 and even though the
test should fail (we need to run fsck after a crash in no-journal mode)
I could not find a newly created entries even when if it was fsynced
before.

Fix this by adjusting the ext4_add_entry() successful exit paths to set
the inode EXT4_STATE_NEWENTRY so that fsync has the chance to fsync the
parent directory as well.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Frank Mayhar <fmayhar@google.com>
Cc: stable@vger.kernel.org


# 72b8e0f9 02-Apr-2015 Sheng Yong <shengyong1@huawei.com>

ext4: remove unused header files

Remove unused header files and header files which are included in
ext4.h.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 7071b715 02-Apr-2015 Konstantin Khlebnikov <koct9i@gmail.com>

ext4: fix bh leak on error paths in ext4_rename() and ext4_cross_rename()

Release references to buffer-heads if ext4_journal_start() fails.

Fixes: 5b61de757535 ("ext4: start handle at least possible moment when renaming files")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# 923ae0ff 16-Feb-2015 Ross Zwisler <zwisler@kernel.org>

ext4: add DAX functionality

This is a port of the DAX functionality found in the current version of
ext2.

[matthew.r.wilcox@intel.com: heavily tweaked]
[akpm@linux-foundation.org: remap_pages went away]
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 31fc006b 25-Nov-2014 Namjae Jeon <namjae.jeon@samsung.com>

ext4: remove unneeded code in ext4_unlink

Setting retval to zero is not needed in ext4_unlink.
Remove unneeded code.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 6050d47a 30-Oct-2014 Jan Kara <jack@suse.cz>

ext4: bail out from make_indexed_dir() on first error

When ext4_handle_dirty_dx_node() or ext4_handle_dirty_dirent_node()
fail, there's really something wrong with the fs and there's no point in
continuing further. Just return error from make_indexed_dir() in that
case. Also initialize frames array so that if we return early due to
error, dx_release() doesn't try to dereference uninitialized memory
(which could happen also due to error in do_split()).

Coverity-id: 741300
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org


# cd808dec 23-Oct-2014 Miklos Szeredi <mszeredi@suse.cz>

ext4: support RENAME_WHITEOUT

Add whiteout support to ext4_rename(). A whiteout inode (chrdev/0,0) is
created before the rename takes place. The whiteout inode is added to the
old entry instead of deleting it.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>


# 9aa5d32b 13-Oct-2014 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: Replace open coded mdata csum feature to helper function

Besides the fact that this replacement improves code readability
it also protects from errors caused direct EXT4_S(sb)->s_es manipulation
which may result attempt to use uninitialized csum machinery.

#Testcase_BEGIN
IMG=/dev/ram0
MNT=/mnt
mkfs.ext4 $IMG
mount $IMG $MNT
#Enable feature directly on disk, on mounted fs
tune2fs -O metadata_csum $IMG
# Provoke metadata update, likey result in OOPS
touch $MNT/test
umount $MNT
#Testcase_END

# Replacement script
@@
expression E;
@@
- EXT4_HAS_RO_COMPAT_FEATURE(E, EXT4_FEATURE_RO_COMPAT_METADATA_CSUM)
+ ext4_has_metadata_csum(E)

https://bugzilla.kernel.org/show_bug.cgi?id=82201

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org


# f4bb2981 05-Oct-2014 Theodore Ts'o <tytso@mit.edu>

ext4: add ext4_iget_normal() which is to be used for dir tree lookups

If there is a corrupted file system which has directory entries that
point at reserved, metadata inodes, prohibit them from being used by
treating them the same way we treat Boot Loader inodes --- that is,
mark them to be bad inodes. This prohibits them from being opened,
deleted, or modified via chmod, chown, utimes, etc.

In particular, this prevents a corrupted file system which has a
directory entry which points at the journal inode from being deleted
and its blocks released, after which point Much Hilarity Ensues.

Reported-by: Sami Liedes <sami.liedes@iki.fi>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org


# e2bfb088 05-Oct-2014 Theodore Ts'o <tytso@mit.edu>

ext4: don't orphan or truncate the boot loader inode

The boot loader inode (inode #5) should never be visible in the
directory hierarchy, but it's possible if the file system is corrupted
that there will be a directory entry that points at inode #5. In
order to avoid accidentally trashing it, when such a directory inode
is opened, the inode will be marked as a bad inode, so that it's not
possible to modify (or read) the inode from userspace.

Unfortunately, when we unlink this (invalid/illegal) directory entry,
we will put the bad inode on the ophan list, and then when try to
unlink the directory, we don't actually remove the bad inode from the
orphan list before freeing in-memory inode structure. This means the
in-memory orphan list is corrupted, leading to a kernel oops.

In addition, avoid truncating a bad inode in ext4_destroy_inode(),
since truncating the boot loader inode is not a smart thing to do.

Reported-by: Sami Liedes <sami.liedes@iki.fi>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org


# a9cfcd63 03-Sep-2014 Theodore Ts'o <tytso@mit.edu>

ext4: avoid trying to kfree an ERR_PTR pointer

Thanks to Dan Carpenter for extending smatch to find bugs like this.
(This was found using a development version of smatch.)

Fixes: 36de928641ee48b2078d3fe9514242aaa2f92013
Reported-by: Dan Carpenter <dan.carpenter@oracle.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org


# 52c826db 29-Aug-2014 Wang Shilong <wshilong@ddn.com>

ext4: remove a duplicate call in ext4_init_new_dir()

ext4_journal_get_write_access() has just been called in ext4_append()
calling it again here is duplicated.

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# f8b3b59d 29-Aug-2014 Theodore Ts'o <tytso@mit.edu>

ext4: convert do_split() to use the ERR_PTR convention

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# dd73b5d5 29-Aug-2014 Theodore Ts'o <tytso@mit.edu>

ext4: convert dx_probe() to use the ERR_PTR convention

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 1c215028 29-Aug-2014 Theodore Ts'o <tytso@mit.edu>

ext4: convert ext4_bread() to use the ERR_PTR convention

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 10560082 29-Aug-2014 Theodore Ts'o <tytso@mit.edu>

ext4: convert ext4_getblk() to use the ERR_PTR convention

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 537d8f93 29-Aug-2014 Theodore Ts'o <tytso@mit.edu>

ext4: convert ext4_dx_find_entry() to use the ERR_PTR convention

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# d80d448c 27-Aug-2014 Darrick J. Wong <darrick.wong@oracle.com>

ext4: fix same-dir rename when inline data directory overflows

When performing a same-directory rename, it's possible that adding or
setting the new directory entry will cause the directory to overflow
the inline data area, which causes the directory to be converted to an
extent-based directory. Under this circumstance it is necessary to
re-read the directory when deleting the old dirent because the "old
directory" context still points to i_block in the inode table, which
is now an extent tree root! The delete fails with an FS error, and
the subsequent fsck complains about incorrect link counts and
hardlinked directories.

Test case (originally found with flat_dir_test in the metadata_csum
test program):

# mkfs.ext4 -O inline_data /dev/sda
# mount /dev/sda /mnt
# mkdir /mnt/x
# touch /mnt/x/changelog.gz /mnt/x/copyright /mnt/x/README.Debian
# sync
# for i in /mnt/x/*; do mv $i $i.longer; done
# ls -la /mnt/x/
total 0
-rw-r--r-- 1 root root 0 Aug 25 12:03 changelog.gz.longer
-rw-r--r-- 1 root root 0 Aug 25 12:03 copyright
-rw-r--r-- 1 root root 0 Aug 25 12:03 copyright.longer
-rw-r--r-- 1 root root 0 Aug 25 12:03 README.Debian.longer

(Hey! Why are there four files now??)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org


# 36de9286 23-Aug-2014 Theodore Ts'o <tytso@mit.edu>

ext4: propagate errors up to ext4_find_entry()'s callers

If we run into some kind of error, such as ENOMEM, while calling
ext4_getblk() or ext4_dx_find_entry(), we need to make sure this error
gets propagated up to ext4_find_entry() and then to its callers. This
way, transient errors such as ENOMEM can get propagated to the VFS.
This is important so that the system calls return the appropriate
error, and also so that in the case of ext4_lookup(), we return an
error instead of a NULL inode, since that will result in a negative
dentry cache entry that will stick around long past the OOM condition
which caused a transient ENOMEM error.

Google-Bug-Id: #17142205

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org


# 7177a9c4 23-Jul-2014 Miklos Szeredi <mszeredi@suse.cz>

fs: call rename2 if exists

Christoph Hellwig suggests:

1) make vfs_rename call ->rename2 if it exists instead of ->rename
2) switch all filesystems that you're adding NOREPLACE support for to
use ->rename2
3) see how many ->rename instances we'll have left after a few
iterations of 2.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d745a8c2 26-May-2014 Jan Kara <jack@suse.cz>

ext4: reduce contention on s_orphan_lock

Shuffle code around in ext4_orphan_add() and ext4_orphan_del() so that
we avoid taking global s_orphan_lock in some cases and hold it for
shorter time in other cases.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# cd2c080c 26-May-2014 Jan Kara <jack@suse.cz>

ext4: use sbi in ext4_orphan_{add|del}()

Use sbi pointer consistently in ext4_orphan_del() instead of opencoding
it sometimes. Also ext4_orphan_add() uses EXT4_SB(sb) often so create
sbi variable for it as well and use it.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 5d601255 12-May-2014 liang xie <xieliang007@gmail.com>

ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access

Make them more consistently

Signed-off-by: xieliang <xieliang@xiaomi.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 236f5ecb 21-Apr-2014 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: remove obsoleted check

BH can not be NULL at this point, ext4_read_dirblock() always return
non null value, and we already have done all necessery checks.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# bd42998a 01-Apr-2014 Miklos Szeredi <mszeredi@suse.cz>

ext4: add cross rename support

Implement RENAME_EXCHANGE flag in renameat2 syscall.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>


# bd1af145 01-Apr-2014 Miklos Szeredi <mszeredi@suse.cz>

ext4: rename: split out helper functions

Cross rename (exchange source and dest) will need to call some of these
helpers for both source and dest, while overwriting rename currently only
calls them for one or the other. This also makes the code easier to
follow.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>


# 0d7d5d67 01-Apr-2014 Miklos Szeredi <mszeredi@suse.cz>

ext4: rename: move EMLINK check up

Move checking i_nlink from after ext4_get_first_dir_block() to before. The
check doesn't rely on the result of that function and the function only
fails on fs corruption, so the order shouldn't matter.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>


# c0d268c3 01-Apr-2014 Miklos Szeredi <mszeredi@suse.cz>

ext4: rename: create ext4_renament structure for local vars

Need to split up ext4_rename() into helpers but there are too many local
variables involved, so create a new structure. This also, apparently,
makes the generated code size slightly smaller.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>


# 0a7c3937 01-Apr-2014 Miklos Szeredi <mszeredi@suse.cz>

vfs: add RENAME_NOREPLACE flag

If this flag is specified and the target of the rename exists then the
rename syscall fails with EEXIST.

The VFS does the existence checking, so it is trivial to enable for most
local filesystems. This patch only enables it in ext4.

For network filesystems the VFS check is not enough as there may be a race
between a remote create and the rename, so these filesystems need to handle
this flag in their ->rename() implementations to ensure atomicity.

Andy writes about why this is useful:

"The trivial answer: to eliminate the race condition from 'mv -i'.

Another answer: there's a common pattern to atomically create a file
with contents: open a temporary file, write to it, optionally fsync
it, close it, then link(2) it to the final name, then unlink the
temporary file.

The reason to use link(2) is because it won't silently clobber the destination.

This is annoying:
- It requires an extra system call that shouldn't be necessary.
- It doesn't work on (IMO sensible) filesystems that don't support
hard links (e.g. vfat).
- It's not atomic -- there's an intermediate state where both files exist.
- It's ugly.

The new rename flag will make this totally sensible.

To be fair, on new enough kernels, you can also use O_TMPFILE and
linkat to achieve the same thing even more cleanly."

Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>


# 64e178a7 20-Dec-2013 Christoph Hellwig <hch@infradead.org>

ext2/3/4: use generic posix ACL infrastructure

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a34e15cc 06-Jan-2014 David Howells <dhowells@redhat.com>

ext4: use %pd printk specificer

Use the new %pd printk() specifier in Ext4 to replace passing of
dentry name or dentry name and name length * 2 with just passing the
dentry.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
cc: Andreas Dilger <adilger.kernel@dilger.ca>
cc: linux-ext4@vger.kernel.org


# 43ae9e3f 10-Oct-2013 Miklos Szeredi <mszeredi@suse.cz>

ext[34]: fix double put in tmpfile

d_tmpfile() already swallowed the inode ref.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 0e202704 16-Aug-2013 Theodore Ts'o <tytso@mit.edu>

ext4: allocate delayed allocation blocks before rename

When ext4_rename() overwrites an already existing file, call
ext4_alloc_da_blocks() before starting the journal handle which
actually does the rename, instead of doing this afterwards. This
improves the likelihood that the contents will survive a crash if an
application replaces a file using the sequence:

1) write replacement contents to foo.new
2) <omit fsync of foo.new>
3) rename foo.new to foo

It is still not a guarantee, since ext4_alloc_da_blocks() is *not*
doing a file integrity sync; this means if foo.new is a very large
file, it may not be completely flushed out to disk.

However, for files smaller than a megabyte or so, any dirty pages
should be flushed out before we do the rename operation, and so at the
next journal commit, the CACHE FLUSH command will make sure al of
these pages are safely on the disk platter.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 5b61de75 16-Aug-2013 Theodore Ts'o <tytso@mit.edu>

ext4: start handle at least possible moment when renaming files

In ext4_rename(), don't start the journal handle until the the
directory entries have been successfully looked up.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# e94bd349 20-Jul-2013 Zheng Liu <wenqing.lz@taobao.com>

ext4: fix a BUG when opening a file with O_TMPFILE flag

When we try to open a file with O_TMPFILE flag, we will trigger a bug.
The root cause is that in ext4_orphan_add() we check ->i_nlink == 0 and
this check always fails because we set ->i_nlink = 1 in
inode_init_always(). We can use the following program to trigger it:

int main(int argc, char *argv[])
{
int fd;

fd = open(argv[1], O_TMPFILE, 0666);
if (fd < 0) {
perror("open ");
return -1;
}
close(fd);
return 0;
}

The oops message looks like this:

kernel BUG at fs/ext4/namei.c:2572!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: dlci bridge stp hidp cmtp kernelcapi l2tp_ppp l2tp_netlink l2tp_core sctp libcrc32c rfcomm tun fuse nfnetli
nk can_raw ipt_ULOG can_bcm x25 scsi_transport_iscsi ipx p8023 p8022 appletalk phonet psnap vmw_vsock_vmci_transport af_key vmw_vmci rose vsock atm can netrom ax25 af_rxrpc ir
da pppoe pppox ppp_generic slhc bluetooth nfc rfkill rds caif_socket caif crc_ccitt af_802154 llc2 llc snd_hda_codec_realtek snd_hda_intel snd_hda_codec serio_raw snd_pcm pcsp
kr edac_core snd_page_alloc snd_timer snd soundcore r8169 mii sr_mod cdrom pata_atiixp radeon backlight drm_kms_helper ttm
CPU: 1 PID: 1812571 Comm: trinity-child2 Not tainted 3.11.0-rc1+ #12
Hardware name: Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H, BIOS F12a 04/23/2010
task: ffff88007dfe69a0 ti: ffff88010f7b6000 task.ti: ffff88010f7b6000
RIP: 0010:[<ffffffff8125ce69>] [<ffffffff8125ce69>] ext4_orphan_add+0x299/0x2b0
RSP: 0018:ffff88010f7b7cf8 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff8800966d3020 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff88007dfe70b8 RDI: 0000000000000001
RBP: ffff88010f7b7d40 R08: ffff880126a3c4e0 R09: ffff88010f7b7ca0
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801271fd668
R13: ffff8800966d2f78 R14: ffff88011d7089f0 R15: ffff88007dfe69a0
FS: 00007f70441a3740(0000) GS:ffff88012a800000(0000) knlGS:00000000f77c96c0
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000002834000 CR3: 0000000107964000 CR4: 00000000000007e0
DR0: 0000000000780000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Stack:
0000000000002000 00000020810b6dde 0000000000000000 ffff88011d46db00
ffff8800966d3020 ffff88011d7089f0 ffff88009c7f4c10 ffff88010f7b7f2c
ffff88007dfe69a0 ffff88010f7b7da8 ffffffff8125cfac ffff880100000004
Call Trace:
[<ffffffff8125cfac>] ext4_tmpfile+0x12c/0x180
[<ffffffff811cba78>] path_openat+0x238/0x700
[<ffffffff8100afc4>] ? native_sched_clock+0x24/0x80
[<ffffffff811cc647>] do_filp_open+0x47/0xa0
[<ffffffff811db73f>] ? __alloc_fd+0xaf/0x200
[<ffffffff811ba2e4>] do_sys_open+0x124/0x210
[<ffffffff81010725>] ? syscall_trace_enter+0x25/0x290
[<ffffffff811ba3ee>] SyS_open+0x1e/0x20
[<ffffffff816ca8d4>] tracesys+0xdd/0xe2
[<ffffffff81001001>] ? start_thread_common.constprop.6+0x1/0xa0
Code: 04 00 00 00 89 04 24 31 c0 e8 c4 77 04 00 e9 43 fe ff ff 66 25 00 d0 66 3d 00 80 0f 84 0e fe ff ff 83 7b 48 00 0f 84 04 fe ff ff <0f> 0b 49 8b 8c 24 50 07 00 00 e9 88 fe ff ff 0f 1f 84 00 00 00

Here we couldn't call clear_nlink() directly because in d_tmpfile() we
will call inode_dec_link_count() to decrease ->i_nlink. So this commit
tries to call d_tmpfile() before ext4_orphan_add() to fix this problem.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Tested-by: Darrick J. Wong <darrick.wong@oracle.com>
Tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>


# af51a2ac 29-Jun-2013 Al Viro <viro@zeniv.linux.org.uk>

ext4: ->tmpfile() support

very similar to ext3 counterpart...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 64cb9273 01-Jul-2013 Al Viro <viro@ZenIV.linux.org.uk>

ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree()

Both ext3 and ext4 htree_dirblock_to_tree() is just filling the
in-core rbtree for use by call_filldir(). All updates of ->f_pos are
done by the latter; bumping it here (on error) is obviously wrong - we
might very well have it nowhere near the block we'd found an error in.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org


# 8af0f082 19-Apr-2013 Tao Ma <boyu.mt@taobao.com>

ext4: fix readdir error in the case of inline_data+dir_index

Zach reported a problem that if inline data is enabled, we don't
tell the difference between the offset of '.' and '..'. And a
getdents will fail if the user only want to get '.' and what's worse,
if there is a conversion happens when the user calls getdents
many times, he/she may get the same entry twice.

In theory, a dir block would also fail if it is converted to a
hashed-index based dir since f_pos will become a hash value, not the
real one, but it doesn't happen. And a deep investigation shows that
we uses a hash based solution even for a normal dir if the dir_index
feature is enabled.

So this patch just adds a new htree_inlinedir_to_tree for inline dir,
and if we find that the hash index is supported, we will do like what
we do for a dir block.

Reported-by: Zach Brown <zab@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# eb9cc7e1 19-Apr-2013 Jan Kara <jack@suse.cz>

ext4: move quota initialization out of inode allocation transaction

Inode allocation transaction is pretty heavy (246 credits with quotas
and extents before previous patch, still around 200 after it). This is
mostly due to credits required for allocation of quota structures
(credits there are heavily overestimated but it's difficult to make
better estimates if we don't want to wire non-trivial assumptions about
quota format into filesystem).

So move quota initialization out of allocation transaction. That way
transaction for quota structure allocation will be started only if we
need to look up quota structure on disk (rare) and furthermore it will
be started for each quota type separately, not for all of them at once.
This reduces maximum transaction size to 34 is most cases and to 73 in
the worst case.

[ Modified by tytso to clean up the cleanup paths for error handling.
Also use a separate call to ext4_std_error() for each failure so it
is easier for someone who is debugging a problem in this function to
determine which function call failed. ]

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# d6a77105 09-Apr-2013 Theodore Ts'o <tytso@mit.edu>

ext4: fix miscellaneous big endian warnings

None of these result in any bug, but they makes sparse complain.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 496ad9aa 23-Jan-2013 Al Viro <viro@zeniv.linux.org.uk>

new helper: file_inode(file)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 0f70b406 15-Feb-2013 Theodore Ts'o <tytso@mit.edu>

ext4: use ERR_PTR() abstraction for ext4_append()

Use ERR_PTR()/IS_ERR() abstraction instead of passing in a separate
pointer to an integer for the error code, as a code cleanup.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# dc6982ff 14-Feb-2013 Theodore Ts'o <tytso@mit.edu>

ext4: refactor code to read directory blocks into ext4_read_dirblock()

The code to read in directory blocks and verify their metadata
checksums was replicated in ten different places across
fs/ext4/namei.c, and the code was buggy in subtle ways in a number of
those replicated sites. In some cases, ext4_error() was called with a
training newline. In others, in particularly in empty_dir(), it was
possible to call ext4_dirent_csum_verify() on an index block, which
would trigger false warnings requesting the system adminsitrator to
run e2fsck.

By refactoring the code, we make the code more readable, as well as
shrinking the compiled object file by over 700 bytes and 50 lines of
code.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 1139575a 09-Feb-2013 Theodore Ts'o <tytso@mit.edu>

ext4: start handle at the last possible moment when creating inodes

In ext4_{create,mknod,mkdir,symlink}(), don't start the journal handle
until the inode has been succesfully allocated. In order to do this,
we need to start the handle in the ext4_new_inode(). So create a new
variant of this function, ext4_new_inode_start_handle(), so the handle
can be created at the last possible minute, before we need to modify
the inode allocation bitmap block.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 64044abf 09-Feb-2013 Theodore Ts'o <tytso@mit.edu>

ext4: fix the number of credits needed for ext4_unlink() and ext4_rmdir()

The ext4_unlink() and ext4_rmdir() don't actually release the blocks
associated with the file/directory. This gets done in a separate jbd2
handle called via ext4_evict_inode(). Thus, we don't need to reserve
lots of journal credits for the truncate.

Note that using too many journal credits is non-optimal because it can
leading to the journal transmit getting closed too early, before it is
strictly necessary.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# 8dcfaad2 09-Feb-2013 Theodore Ts'o <tytso@mit.edu>

ext4: start handle at the last possible moment in ext4_rmdir()

Don't start the jbd2 transaction handle until after the directory
entry has been found, to minimize the amount of time that a handle is
held active.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# 931b6864 09-Feb-2013 Theodore Ts'o <tytso@mit.edu>

ext4: start handle at the last possible moment in ext4_unlink()

Don't start the jbd2 transaction handle until after the directory
entry has been found, to minimize the amount of time that a handle is
held active.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# 9924a92a 08-Feb-2013 Theodore Ts'o <tytso@mit.edu>

ext4: pass context information to jbd2__journal_start()

So we can better understand what bits of ext4 are responsible for
long-running jbd2 handles, use jbd2__journal_start() so we can pass
context information for logging purposes.

The recommended way for finding the longer-running handles is:

T=/sys/kernel/debug/tracing
EVENT=$T/events/jbd2/jbd2_handle_stats
echo "interval > 5" > $EVENT/filter
echo 1 > $EVENT/enable

./run-my-fs-benchmark

cat $T/trace > /tmp/problem-handles

This will list handles that were active for longer than 20ms. Having
longer-running handles is bad, because a commit started at the wrong
time could stall for those 20+ milliseconds, which could delay an
fsync() or an O_SYNC operation. Here is an example line from the
trace file describing a handle which lived on for 311 jiffies, or over
1.2 seconds:

postmark-2917 [000] .... 196.435786: jbd2_handle_stats: dev 254,32
tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
dirtied_blocks 0

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# b1deefc9 28-Jan-2013 Guo Chao <yan@linux.vnet.ibm.com>

ext4: remove unnecessary NULL pointer check

brelse() and ext4_journal_force_commit() are both inlined and able
to handle NULL.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 41be871f 28-Jan-2013 Guo Chao <yan@linux.vnet.ibm.com>

ext4: remove useless assignment in dx_probe()

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 2bbbee2a 28-Jan-2013 Guo Chao <yan@linux.vnet.ibm.com>

ext4: remove unused variable in add_dirent_to_buf()

After commit 978fef9 (create __ext4_insert_dentry for dir entry
insertion), 'reclen' is not used anymore.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>


# d5ac7773 28-Jan-2013 Guo Chao <yan@linux.vnet.ibm.com>

ext4: release buffer when checksum failed

Commit b0336e8d (ext4: calculate and verify checksums of directory
leaf blocks) and commit dbe89444 (ext4: Calculate and verify checksums
for htree nodes) forget to release buffer when checksum failed, at
some places.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>


# 306a7492 18-Dec-2012 Guo Chao <yan@linux.vnet.ibm.com>

ext3, ext4, ocfs2: remove unused macro NAMEI_RA_INDEX

This macro, initially introduced by ext2 in v0.99.15, does not
have any users from the beginning. It has been removed in later
ext2 version but still remains in the code of ext3, ext4, ocfs2.
Remove this macro there.

Cc: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org
Cc: ocfs2-devel@oss.oracle.com
Acked-by: Mark Fasheh <mfasheh@suse.de>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>


# fef0ebdb 06-Jan-2013 Guo Chao <yan@linux.vnet.ibm.com>

ext4: remove duplicate call to ext4_bread() in ext4_init_new_dir()

This fixes a buffer cache leak when creating a directory, introduced
in commit a774f9c20.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Tao Ma <boyu.mt@taobao.com>


# 0ecaef06 06-Jan-2013 Guo Chao <yan@linux.vnet.ibm.com>

ext4: release buffer in failed path in dx_probe()

If checksum fails, we should also release the buffer
read from previous iteration.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>-
Cc: stable@vger.kernel.org
--
fs/ext4/namei.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)


# 0e9a9a1a 26-Dec-2012 Theodore Ts'o <tytso@mit.edu>

ext4: avoid hang when mounting non-journal filesystems with orphan list

When trying to mount a file system which does not contain a journal,
but which does have a orphan list containing an inode which needs to
be truncated, the mount call with hang forever in
ext4_orphan_cleanup() because ext4_orphan_del() will return
immediately without removing the inode from the orphan list, leading
to an uninterruptible loop in kernel code which will busy out one of
the CPU's on the system.

This can be trivially reproduced by trying to mount the file system
found in tests/f_orphan_extents_inode/image.gz from the e2fsprogs
source tree. If a malicious user were to put this on a USB stick, and
mount it on a Linux desktop which has automatic mounts enabled, this
could be considered a potential denial of service attack. (Not a big
deal in practice, but professional paranoids worry about such things,
and have even been known to allocate CVE numbers for such problems.)

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: stable@vger.kernel.org


# 939da108 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: Remove CONFIG_EXT4_FS_XATTR

Ted has sent out a RFC about removing this feature. Eric and Jan
confirmed that both RedHat and SUSE enable this feature in all their
product. David also said that "As far as I know, it's enabled in all
Android kernels that use ext4." So it seems OK for us.

And what's more, as inline data depends its implementation on xattr,
and to be frank, I don't run any test again inline data enabled while
xattr disabled. So I think we should add inline data and remove this
config option in the same release.

[ The savings if you disable CONFIG_EXT4_FS_XATTR is only 27k, which
isn't much in the grand scheme of things. Since no one seems to be
testing this configuration except for some automated compile farms, on
balance we are better removing this config option, and so that it is
effectively always enabled. -- tytso ]

Cc: David Brown <davidb@codeaurora.org>
Cc: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 32f7f22c 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: let ext4_rename handle inline dir

In case we rename a directory, ext4_rename has to read the dir block
and change its dotdot's information. The old ext4_rename encapsulated
the dir_block read into itself. So this patch adds a new function
ext4_get_first_dir_block() which gets the dir buffer information so
the ext4_rename can handle it properly. As it will also change the
parent inode number, we return the parent_de so that ext4_rename() can
handle it more easily.

ext4_find_entry is also changed so that the caller(rename) can tell
whether the found entry is an inlined one or not and journaling the
corresponding buffer head.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 61f86638 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: let empty_dir handle inline dir

empty_dir is used when deleting a dir. So it should handle inline dir
properly.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 9f40fe54 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: let ext4_delete_entry() handle inline data

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 05019a9e 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: make ext4_delete_entry generic

Currently ext4_delete_entry() is used only for dir entry removing from
a dir block. So let us create a new function
ext4_generic_delete_entry and this function takes a entry_buf and a
buf_size so that it can be used for inline data.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# e8e948e7 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: let ext4_find_entry handle inline data

Create a new function ext4_find_inline_entry() to handle the case of
inline data.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 7335cd3b 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: create a new function search_dir

search_dirblock is used to search a dir block, but the code is almost
the same for searching an inline dir.

So create a new fuction search_dir and let search_dirblock call it.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 3c47d541 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: let add_dir_entry handle inline data properly

This patch let add_dir_entry handle the inline data case. So the
dir is initialized as inline dir first and then we can try to add
some files to it, when the inline space can't hold all the entries,
a dir block will be created and the dir entry will be moved to it.

Also for an inlined dir, "." and ".." are removed and we only use
4 bytes to store the parent inode number. These 2 entries will be
added when we convert an inline dir to a block-based one.

[ Folded in patch from Dan Carpenter to remove an unused variable. ]

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 978fef91 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: create __ext4_insert_dentry for dir entry insertion

The old add_dirent_to_buf handles all the work related to the
work of adding dir entry to a dir block. Now we have inline data,
so create 2 new function __ext4_find_dest_de and __ext4_insert_dentry
that do the real work and let add_dirent_to_buf call them.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 226ba972 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: refactor __ext4_check_dir_entry() to accept start and size

The __ext4_check_dir_entry() function() is used to check whether the
de is over the block boundary. Now with inline data, it could be
within the block boundary while exceeds the inode size. So check this
function to check the overflow more precisely.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# a774f9c2 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: make ext4_init_dot_dotdot for inline dir usage

Currently, the initialization of dot and dotdot are encapsulated in
ext4_mkdir and also bond with dir_block. So create a new function
named ext4_init_new_dir and the initialization is moved to
ext4_init_dot_dotdot. Now it will called either in the normal non-inline
case(rec_len of ".." will cover the whole block) or when we converting an
inline dir to a block(rec len of ".." will be the real length). The start
of the next entry is also returned for inline dir usage.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# c6af8803 12-Nov-2012 Darrick J. Wong <darrick.wong@oracle.com>

ext4: don't verify checksums of dx non-leaf nodes during fallback scan

During a directory entry lookup of a hashed directory, if the
hash-based lookup functions fail and we fall back to a linear scan,
don't try to verify the dirent checksum on the internal nodes of the
hash tree because they don't store a checksum in a hidden dirent like
the leaf nodes do.

Reported-by: George Spelvin <linux@horizon.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# dffe9d8d 10-Nov-2012 Theodore Ts'o <tytso@mit.edu>

ext4: do not use ext4_error() when there is no space in dir leaf for csum

If there is no space for a checksum in a directory leaf node,
previously we would use EXT4_ERROR_INODE() which would mark the file
system as inconsistent. While it would be nice to use e2fsck -D, it
certainly isn't required, so just print a warning using
ext4_warning().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>


# 6d1ab10e 27-Sep-2012 Carlos Maiolino <cmaiolino@redhat.com>

ext4: ext4_bread usage audit

When ext4_bread() returns NULL and err is set to zero, this means
there is no phyical block mapped to the specified logical block
number. (Previous to commit 90b0a97323, err was uninitialized in this
case, which caused other problems.)

The directory handling routines use ext4_bread() in many places, the
fact that ext4_bread() now returns NULL with err set to zero could
cause problems since a number of these functions will simply return
the value of err if the result of ext4_bread() was the NULL pointer,
causing the caller of the function to think that the function was
successful.

Since directories should never contain holes, this case can only
happen if the file system is corrupted. This commit audits all of the
callers of ext4_bread(), and makes sure they do the right thing if a
hole in a directory is found by ext4_bread().

Some ext4_bread() callers did not need any changes either because they
already had its own hole detector paths.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 6a08f447 26-Sep-2012 Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>

ext4: always set i_op in ext4_mknod()

ext4_special_inode_operations have their own ifdef CONFIG_EXT4_FS_XATTR
to mask those methods. And ext4_iget also always sets it, so there is
an inconsistency.

Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org


# c9b92530 18-Sep-2012 Anatol Pomozov <anatol.pomozov@gmail.com>

ext4: make orphan functions be no-op in no-journal mode

Instead of checking whether the handle is valid, we check if journal
is enabled. This avoids taking the s_orphan_lock mutex in all cases
when there is no journal in use, including the error paths where
ext4_orphan_del() is called with a handle set to NULL.

Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 90b0a973 17-Sep-2012 Carlos Maiolino <cmaiolino@redhat.com>

ext4: fix possible non-initialized variable in htree_dirblock_to_tree()

htree_dirblock_to_tree() declares a non-initialized 'err' variable,
which is passed as a reference to another functions expecting them to
set this variable with their error codes.

It's passed to ext4_bread(), which then passes it to ext4_getblk(). If
ext4_map_blocks() returns 0 due to a lookup failure, leaving the
ext4_getblk() buffer_head uninitialized, it will make ext4_getblk()
return to ext4_bread() without initialize the 'err' variable, and
ext4_bread() will return to htree_dirblock_to_tree() with this variable
still uninitialized. htree_dirblock_to_tree() will pass this variable
with garbage back to ext4_htree_fill_tree(), which expects a number of
directory entries added to the rb-tree. which, in case, might return a
fake non-zero value due the garbage left in the 'err' variable, leading
the kernel to an Oops in ext4_dx_readdir(), once this is expecting a
filled rb-tree node, when in turn it will have a NULL-ed one, causing an
invalid page request when trying to get a fname struct from this NULL-ed
rb-tree node in this line:

fname = rb_entry(info->curr_node, struct fname, rb_hash);

The patch itself initializes the err variable in
htree_dirblock_to_tree() to avoid usage mistakes by the called
functions, and also fix ext4_getblk() to return a initialized 'err'
variable when ext4_map_blocks() fails a lookup.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# df981d03 17-Aug-2012 Theodore Ts'o <tytso@mit.edu>

ext4: add max_dir_size_kb mount option

Very large directories can cause significant performance problems, or
perhaps even invoke the OOM killer, if the process is running in a
highly constrained memory environment (whether it is VM's with a small
amount of memory or in a small memory cgroup).

So it is useful, in cloud server/data center environments, to be able
to set a filesystem-wide cap on the maximum size of a directory, to
ensure that directories never get larger than a sane size. We do this
via a new mount option, max_dir_size_kb. If there is an attempt to
grow the directory larger than max_dir_size_kb, the system call will
return ENOSPC instead.

Google-Bug-Id: 6863013

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# b50924c2 22-Jul-2012 Artem Bityutskiy <artem.bityutskiy@linux.intel.com>

ext4: remove unnecessary argument from __ext4_handle_dirty_metadata()

The '__ext4_handle_dirty_metadata()' does not need the 'now' argument
anymore and we can kill it.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# 8fc37ec5 18-Jul-2012 Al Viro <viro@zeniv.linux.org.uk>

don't expose I_NEW inodes via dentry->d_inode

d_instantiate(dentry, inode);
unlock_new_inode(inode);

is a bad idea; do it the other way round...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ebfc3b49 10-Jun-2012 Al Viro <viro@zeniv.linux.org.uk>

don't pass nameidata to ->create()

boolean "does it have to be exclusive?" flag is passed instead;
Local filesystem should just ignore it - the object is guaranteed
not to be there yet.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 00cd8dd3 10-Jun-2012 Al Viro <viro@zeniv.linux.org.uk>

stop passing nameidata to ->lookup()

Just the flags; only NFS cares even about that, but there are
legitimate uses for such argument. And getting rid of that
completely would require splitting ->lookup() into a couple
of methods (at least), so let's leave that alone for now...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ef58f69c 09-Jul-2012 Tao Ma <boyu.mt@taobao.com>

ext4: use proper csum calculation in ext4_rename

In ext4_rename, when the old name is a dir, we need to
change ".." to its new parent and journal the change, so
with metadata_csum enabled, we have to re-calc the csum.

As the first block of the dir can be either a htree root
or a normal directory block and we have different csum
calculation for these 2 types, we have to choose the right
one in ext4_rename.

btw, it is found by xfstests 013.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Darrick J. Wong <djwong@us.ibm.com>


# 7e936b73 28-May-2012 Andreas Dilger <adilger@dilger.ca>

ext4: disallow hard-linked directory in ext4_lookup

A hard-linked directory to its parent can cause the VFS to deadlock,
and is a sign of a corrupted file system. So detect this case in
ext4_lookup(), before the rmdir() lockup scenario can take place.

Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org


# 26fe5750 10-May-2012 Linus Torvalds <torvalds@linux-foundation.org>

vfs: make it possible to access the dentry hash/len as one 64-bit entry

This allows comparing hash and len in one operation on 64-bit
architectures. Right now only __d_lookup_rcu() takes advantage of this,
since that is the case we care most about.

The use of anonymous struct/unions hides the alternate 64-bit approach
from most users, the exception being a few cases where we initialize a
'struct qstr' with a static initializer. This makes the problematic
cases use a new QSTR_INIT() helper function for that (but initializing
just the name pointer with a "{ .name = xyzzy }" initializer remains
valid, as does just copying another qstr structure).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b09de7fa 30-Apr-2012 Theodore Ts'o <tytso@mit.edu>

ext4: remove unnecessary check in add_dirent_to_buf()

None of this function callers ever pass in a NULL inode pointer, so
this check is unnecessary, and the else clause is dead code. (This
change should make the code coverage people a little happier. :-)

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# b0336e8d 29-Apr-2012 Darrick J. Wong <djwong@us.ibm.com>

ext4: calculate and verify checksums of directory leaf blocks

Calculate and verify the checksums for directory leaf blocks
(i.e. blocks that only contain actual directory entries). The
checksum lives in what looks to be an unused directory entry with a 0
name_len at the end of the block. This scheme is not used for
internal htree nodes because the mechanism in place there only costs
one dx_entry, whereas the "empty" directory entry would cost two
dx_entries.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# dbe89444 29-Apr-2012 Darrick J. Wong <djwong@us.ibm.com>

ext4: Calculate and verify checksums for htree nodes

Calculate and verify the checksum for directory index tree (htree)
node blocks. The checksum is stored in the last 4 bytes of the htree
block and requires the dx_entry array to stop 1 dx_entry short of the
end of the block.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# a9c47317 29-Apr-2012 Darrick J. Wong <djwong@us.ibm.com>

ext4: calculate and verify superblock checksum

Calculate and verify the superblock checksum. Since the UUID and
block group number are embedded in each copy of the superblock, we
need only checksum the entire block. Refactor some of the code to
eliminate open-coding of the checksum update call.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# e6153918 29-Apr-2012 Darrick J. Wong <djwong@us.ibm.com>

ext4: change on-disk layout to support extended metadata checksumming

Define flags and change structure definitions to allow checksumming of
ext4 metadata.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 9ee49302 20-Feb-2012 Zheng Liu <gnehzuil.liu@gmail.com>

ext4: format flag in dx_probe()

Fix ext4_warning format flag in dx_probe().

CC: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 0ce8c010 08-Jan-2012 Al Viro <viro@zeniv.linux.org.uk>

ext[34]: avoid i_nlink warnings triggered by drop_nlink/inc_nlink kludge in symlink()

Both ext3 and ext4 put the half-created symlink inode into the orphan list
for a while (see the comment in ext[34]_symlink() for gory details). Then,
if everything went fine, they pull it out of the orphan list and bump the
link count back to 1. The thing is, inc_nlink() is going to complain about
seeing somebody changing i_nlink from 0 to 1. With a good reason, since
normally something like that is a bug. Explicit set_nlink(inode, 1) does
the same thing as inc_nlink() here, but it does *not* complain - exactly
because it should be usable in strange situations like this one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 1a67aafb 25-Jul-2011 Al Viro <viro@zeniv.linux.org.uk>

switch ->mknod() to umode_t

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4acdaf27 25-Jul-2011 Al Viro <viro@zeniv.linux.org.uk>

switch ->create() to umode_t

vfs_create() ignores everything outside of 16bit subset of its
mode argument; switching it to umode_t is obviously equivalent
and it's the only caller of the method

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 18bb1db3 25-Jul-2011 Al Viro <viro@zeniv.linux.org.uk>

switch vfs_mkdir() and ->mkdir() to umode_t

vfs_mkdir() gets int, but immediately drops everything that might not
fit into umode_t and that's the only caller of ->mkdir()...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# bfe86848 28-Oct-2011 Miklos Szeredi <mszeredi@suse.cz>

filesystems: add set_nlink()

Replace remaining direct i_nlink updates with a new set_nlink()
updater function.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>


# 6d6b77f1 28-Oct-2011 Miklos Szeredi <mszeredi@suse.cz>

filesystems: add missing nlink wrappers

Replace direct i_nlink updates with the respective updater function
(inc_nlink, drop_nlink, clear_nlink, inode_dec_link_count).

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>


# 5cb81dab 29-Oct-2011 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: fix quota accounting during migration

The tmp_inode should have same uid/gid as the original inode.
Otherwise new metadata blocks will be accounted to wrong quota-id,
which will result in a quota leak after the inode migration is
completed.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 909a4cf1 26-Oct-2011 Andreas Dilger <adilger@whamcloud.com>

ext4: avoid setting directory i_nlink to zero

If a directory with more than EXT4_LINK_MAX subdirectories, the nlink
count is set to 1. Subsequently, if any subdirectories are deleted,
ext4_dec_count() decrements the i_nlink count, which may go to 0
temporarily before being incremented back to 1.

While this is done under i_mutex, which prevents races for directory
and inode operations that check i_nlink, the temporary i_nlink == 0
case is exposed to userspace via stat() and similar calls that do not
hold i_mutex.

Instead, change the code to not decrement i_nlink count for any
directories that do not already have i_nlink larger than 2.

Reported-by: Cliff White <cliffw@whamcloud.com>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 5930ea64 30-Aug-2011 Theodore Ts'o <tytso@mit.edu>

ext4: call ext4_handle_dirty_metadata with correct inode in ext4_dx_add_entry

ext4_dx_add_entry manipulates bh2 and frames[0].bh, which are two buffer_heads
that point to directory blocks assigned to the directory inode. However, the
function calls ext4_handle_dirty_metadata with the inode of the file that's
being added to the directory, not the directory inode itself. Therefore,
correct the code to dirty the directory buffers with the directory inode, not
the file inode.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org


# f9287c1f 30-Aug-2011 Darrick J. Wong <djwong@us.ibm.com>

ext4: ext4_mkdir should dirty dir_block with newly created directory inode

ext4_mkdir calls ext4_handle_dirty_metadata with dir_block and the inode "dir".
Unfortunately, dir_block belongs to the newly created directory (which is
"inode"), not the parent directory (which is "dir"). Fix the incorrect
association.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org


# bcaa9929 31-Aug-2011 Darrick J. Wong <djwong@us.ibm.com>

ext4: ext4_rename should dirty dir_bh with the correct directory

When ext4_rename performs a directory rename (move), dir_bh is a
buffer that is modified to update the '..' link in the directory being
moved (old_inode). However, ext4_handle_dirty_metadata is called with
the old parent directory inode (old_dir) and dir_bh, which is
incorrect because dir_bh does not belong to the parent inode. Fix
this error.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org


# 65299a3b 23-Aug-2011 Christoph Hellwig <hch@infradead.org>

block: separate priority boosting from REQ_META

Add a new REQ_PRIO to let requests preempt others in the cfq I/O schedule,
and lave REQ_META purely for marking requests as metadata in blktrace.

All existing callers of REQ_META except for XFS are updated to also
set REQ_PRIO for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 5dc06c5a 23-Aug-2011 Christoph Hellwig <hch@infradead.org>

block: remove READ_META and WRITE_META

Replace all occurnanced of the undocumented READ_META with READ | REQ_META
and remove the unused WRITE_META define.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 8c208719 11-Aug-2011 Eric Sandeen <sandeen@redhat.com>

ext4: Properly count journal credits for long symlinks

Commit df5e6223407e ("ext4: fix deadlock in ext4_symlink() in ENOSPC
conditions") recalculated the number of credits needed for a long
symlink, in the process of splitting it into two transactions. However,
the first credit calculation under-counted because if selinux is
enabled, credits are needed to create the selinux xattr as well.

Overrunning the reservation will result in an OOPS in
jbd2_journal_dirty_metadata() due to this assert:

J_ASSERT_JH(jh, handle->h_buffer_credits > 0);

Fix this by increasing the reservation size.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4e34e719 23-Jul-2011 Christoph Hellwig <hch@lst.de>

fs: take the ACL checks to common code

Replace the ->check_acl method with a ->get_acl method that simply reads an
ACL from disk after having a cache miss. This means we can replace the ACL
checking boilerplate code with a single implementation in namei.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a9049376 08-Jul-2011 Al Viro <viro@zeniv.linux.org.uk>

make d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err)

... and simplify the living hell out of callers

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 265c6a0f 16-Jul-2011 Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>

ext4: fix compilation with -DDX_DEBUG

Compilation of ext4/namei.c brought up an error and warning messages
when compiled with -DDX_DEBUG

Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# afb86178 11-Jul-2011 Lukas Czerner <lczerner@redhat.com>

ext4: remove unnecessary comments in ext4_orphan_add()

The comment from Al Viro about possible race in the ext4_orphan_add() is
not justified. There is no race possible as we always have either i_mutex
locked, or the inode can not be referenced from outside hence the
J_ASSERS should not be hit from the reason described in comment.

This commit replaces it with notion that we are holding i_mutex so it
should not be possible for i_nlink to be changed while waiting for
s_orphan_lock.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 40ebc0af 24-May-2011 Sage Weil <sage@newdream.net>

ext4: remove unnecessary dentry_unhash on rmdir/rename_dir

ext4 has no problems with lingering references to unlinked directory
inodes.

CC: "Theodore Ts'o" <tytso@mit.edu>
CC: Andreas Dilger <adilger.kernel@dilger.ca>
CC: linux-ext4@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e4eaac06 24-May-2011 Sage Weil <sage@newdream.net>

vfs: push dentry_unhash on rename_dir into file systems

Only a few file systems need this. Start by pushing it down into each
rename method (except gfs2 and xfs) so that it can be dealt with on a
per-fs basis.

Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 79bf7c73 24-May-2011 Sage Weil <sage@newdream.net>

vfs: push dentry_unhash on rmdir into file systems

Only a few file systems need this. Start by pushing it down into each
fs rmdir method (except gfs2 and xfs) so it can be dealt with on a per-fs
basis.

This does not change behavior for any in-tree file systems.

Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6976a6f2 14-May-2011 Allison Henderson <achender@linux.vnet.ibm.com>

ext4: don't dereference null pointer when make_indexed_dir() fails

Fix for a null pointer bug found while running punch hole tests

Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# df5e6223 03-May-2011 Jan Kara <jack@suse.cz>

ext4: fix deadlock in ext4_symlink() in ENOSPC conditions

ext4_symlink() cannot call __page_symlink() with transaction open.
__page_symlink() calls ext4_write_begin() which can wait for
transaction commit if we are running out of space thus causing a
deadlock. Also error recovery in ext4_truncate_failed_write() does not
count with the transaction being already started (although I'm not
aware of any particular deadlock here).

Fix the problem by stopping a transaction before calling
__page_symlink() (we have to be careful and put inode to orphan list
so that it gets deleted in case of crash) and starting another one
after __page_symlink() returns for addition of symlink into a
directory.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 7ad8e4e6 03-May-2011 Jan Kara <jack@suse.cz>

ext4: Fix fs corruption when make_indexed_dir() fails

When make_indexed_dir() fails (e.g. because of ENOSPC) after it has
allocated block for index tree root, we did not properly mark all
changed buffers dirty. This lead to only some of these buffers being
written out and thus effectively corrupting the directory.

Fix the issue by marking all changed data dirty even in the error
failure case.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 0562e0ba 21-Mar-2011 Jiaying Zhang <jiayingz@google.com>

ext4: add more tracepoints and use dev_t in the trace buffer

- Add more ext4 tracepoints.
- Change ext4 tracepoints to use dev_t field with MAJOR/MINOR macros
so that we can save 4 bytes in the ring buffer on some platforms.
- Add sync_mode to ext4_da_writepages, ext4_da_write_pages, and
ext4_da_writepages_result tracepoints. Also remove for_reclaim
field from ext4_da_writepages since it is usually not very useful.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# ef607893 20-Mar-2011 Amir Goldstein <amir73il@gmail.com>

ext4: handle errors in ext4_rename

Checking return code from ext4_journal_get_write_access() is important
with snapshots, because this function invokes COW, so may return new
errors, such as ENOSPC.

We move the call to ext4_journal_get_write_access earlier in the
function, to simplify error handling in the case that this function
returns returns an error.

Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# f17b6042 29-Jan-2011 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

fs: Remove i_nlink check from file system link callback

Now that VFS check for inode->i_nlink == 0 and returns proper
error, remove similar check from file system

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b4097142 09-Jan-2011 Theodore Ts'o <tytso@mit.edu>

ext4: add error checking to calls to ext4_handle_dirty_metadata()

Call ext4_std_error() in various places when we can't bail out
cleanly, so the file system can be marked as in error.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# dabd991f 09-Jan-2011 Namhyung Kim <namhyung@gmail.com>

ext4: add more error checks to ext4_mkdir()

Check return value of ext4_journal_get_write_access,
ext4_journal_dirty_metadata and ext4_mark_inode_dirty. Move brelse()
under 'out_stop' to release bh properly in case of journal error.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# f7c21177 09-Jan-2011 Theodore Ts'o <tytso@mit.edu>

ext4: Use ext4_error_file() to print the pathname to the corrupted inode

Where the file pointer is available, use ext4_error_file() instead of
ext4_error_inode().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# cad3f007 19-Dec-2010 Theodore Ts'o <tytso@mit.edu>

ext4: optimize ext4_check_dir_entry() with unlikely() annotations

This function gets called a lot for large directories, and the answer
is almost always "no, no, there's no problem". This means using
unlikely() is a good thing.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 6ca7b13d 19-Dec-2010 Tobias Klauser <tklauser@distanz.ch>

ext4: Remove redundant unlikely()

IS_ERR() already implies unlikely(), so it can be omitted here.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 6d5c3aa8 14-Dec-2010 Aaro Koskinen <aaro.koskinen@nokia.com>

ext4: fix typo which broke '..' detection in ext4_find_entry()

There should be a check for the NUL character instead of '0'.

Fortunately the only thing that cares about this is NFS serving, which
is why we didn't notice this in the merge window testing.

Reported-by: Phil Carmody <ext-phil.2.carmody@nokia.com>
Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 7845c049 27-Oct-2010 Theodore Ts'o <tytso@mit.edu>

ext4: use search_dirblock() in ext4_dx_find_entry()

Use the search_dirblock() in ext4_dx_find_entry(). It makes the code
easier to read, and it takes advantage of common code. It also saves
100 bytes or so of text space.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Brad Spengler <spender@grsecurity.net>


# 8941ec8b 27-Oct-2010 Theodore Ts'o <tytso@mit.edu>

ext4: avoid uninitialized memory references in ext3_htree_next_block()

If the first block of htree directory is missing '.' or '..' but is
otherwise a valid directory, and we do a lookup for '.' or '..', it's
possible to dereference an uninitialized memory pointer in
ext4_htree_next_block().

We avoid this by moving the special case from ext4_dx_find_entry() to
ext4_find_entry(); this also means we can optimize ext4_find_entry()
slightly when NFS looks up "..".

Thanks to Brad Spengler for pointing a Clang warning that led me to
look more closely at this code. The warning was harmless, but it was
useful in pointing out code that was too ugly to live. This warning was
also reported by Roman Borisov.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Brad Spengler <spender@grsecurity.net>


# 7de9c6ee 23-Oct-2010 Al Viro <viro@zeniv.linux.org.uk>

new helper: ihold()

Clones an existing reference to inode; caller must already hold one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 0cfc9255 04-Aug-2010 Eric Sandeen <sandeen@redhat.com>

ext4: re-inline ext4_rec_len_(to|from)_disk functions

commit 3d0518f4, "ext4: New rec_len encoding for very
large blocksizes" made several changes to this path, but from
a perf perspective, un-inlining ext4_rec_len_from_disk() seems
most significant. This function is called from ext4_check_dir_entry(),
which on a file-creation workload is called extremely often.

I tested this with bonnie:

# bonnie++ -u root -s 0 -f -x 200 -d /mnt/test -n 32

(this does 200 iterations) and got this for the file creations:

ext4 stock: Average = 21206.8 files/s
ext4 inlined: Average = 22346.7 files/s (+5%)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 60fd4da3 27-Jul-2010 Theodore Ts'o <tytso@mit.edu>

ext4: Cleanup ext4_check_dir_entry so __func__ is now implicit

Also start passing the line number to ext4_check_dir since we're going
to need it in upcoming patch.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 5a0790c2 14-Jun-2010 Andi Kleen <andi@firstfloor.org>

ext4: remove initialized but not read variables

No real bugs found, just removed some dead code.

Found by gcc 4.6's new warnings.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 14ece102 17-May-2010 Frank Mayhar <fmayhar@google.com>

ext4: Make fsync sync new parent directories in no-journal mode

Add a new ext4 state to tell us when a file has been newly created; use
that state in ext4_sync_file in no-journal mode to tell us when we need
to sync the parent directory as well as the inode and data itself. This
fixes a problem in which a panic or power failure may lose the entire
file even when using fsync, since the parent directory entry is lost.

Addresses-Google-Bug: #2480057

Signed-off-by: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 60e6679e 17-May-2010 Theodore Ts'o <tytso@mit.edu>

ext4: Drop whitespace at end of lines

This patch was generated using:

#!/usr/bin/perl -i
while (<>) {
s/[ ]+$//;
print;
}

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 12e9b892 16-May-2010 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: Use bitops to read/modify i_flags in struct ext4_inode_info

At several places we modify EXT4_I(inode)->i_flags without holding
i_mutex (ext4_do_update_inode, ...). These modifications are racy and
we can lose updates to i_flags. So convert handling of i_flags to use
bitops which are atomic.

https://bugzilla.kernel.org/show_bug.cgi?id=15792

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 24676da4 16-May-2010 Theodore Ts'o <tytso@mit.edu>

ext4: Convert calls of ext4_error() to EXT4_ERROR_INODE()

EXT4_ERROR_INODE() tends to provide better error information and in a
more consistent format. Some errors were not even identifying the inode
or directory which was corrupted, which made them not very useful.

Addresses-Google-Bug: #2507977

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 871a2931 03-Mar-2010 Christoph Hellwig <hch@infradead.org>

dquot: cleanup dquot initialize routine

Get rid of the initialize dquot operation - it is now always called from
the filesystem and if a filesystem really needs it's own (which none
currently does) it can just call into it's own routine directly.

Rename the now static low-level dquot_initialize helper to __dquot_initialize
and vfs_dq_init to dquot_initialize to have a consistent namespace.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>


# 907f4554 03-Mar-2010 Christoph Hellwig <hch@infradead.org>

dquot: move dquot initialization responsibility into the filesystem

Currently various places in the VFS call vfs_dq_init directly. This means
we tie the quota code into the VFS. Get rid of that and make the
filesystem responsible for the initialization. For most metadata operations
this is a straight forward move into the methods, but for truncate and
open it's a bit more complicated.

For truncate we currently only call vfs_dq_init for the sys_truncate case
because open already takes care of it for ftruncate and open(O_TRUNC) - the
new code causes an additional vfs_dq_init for those which is harmless.

For open the initialization is moved from do_filp_open into the open method,
which means it happens slightly earlier now, and only for regular files.
The latter is fine because we don't need to initialize it for operations
on special files, and we already do it as part of the namespace operations
for directories.

Add a dquot_file_open helper that filesystems that support generic quotas
can use to fill in ->open.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>


# 6e3617e5 01-Mar-2010 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: Handle non empty on-disk orphan link

In case of truncate errors we explicitly remove inode from in-core
orphan list via orphan_del(NULL, inode) without modifying the on-disk list.

But later on, the same inode may be inserted in the orphan list again
which will result the on-disk linked list getting corrupted. If inode
i_dtime contains valid value, then skip on-disk list modification.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 73b50c1c 16-Feb-2010 Curt Wohlgemuth <curtw@google.com>

ext4: Fix BUG_ON at fs/buffer.c:652 in no journal mode

Calls to ext4_handle_dirty_metadata should only pass in an inode
pointer for inode-specific metadata, and not for shared metadata
blocks such as inode table blocks, block group descriptors, the
superblock, etc.

The BUG_ON can get tripped when updating a special device (such as a
block device) that is opened (so that i_mapping is set in
fs/block_dev.c) and the file system is mounted in no journal mode.

Addresses-Google-Bug: #2404870

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 12062ddd 15-Feb-2010 Eric Sandeen <sandeen@redhat.com>

ext4: move __func__ into a macro for ext4_warning, ext4_error

Just a pet peeve of mine; we had a mishash of calls with either __func__
or "function_name" and the latter tends to get out of sync.

I think it's easier to just hide the __func__ in a macro, and it'll
be consistent from then on.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 5aca07eb 08-Dec-2009 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: quota macros cleanup

Currently all quota block reservation macros contains hard-coded "2"
aka MAXQUOTAS value. This is no good because in some places it is not
obvious to understand what does this digit represent. Let's introduce
new macro with self descriptive name.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Acked-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 2de770a4 23-Nov-2009 Theodore Ts'o <tytso@mit.edu>

ext4: fix potential buffer head leak when add_dirent_to_buf() returns ENOSPC

Previously add_dirent_to_buf() did not free its passed-in buffer head
in the case of ENOSPC, since in some cases the caller still needed it.
However, this led to potential buffer head leaks since not all callers
dealt with this correctly. Fix this by making simplifying the freeing
convention; now add_dirent_to_buf() *never* frees the passed-in buffer
head, and leaves that to the responsibility of its caller. This makes
things cleaner and easier to prove that the code is neither leaking
buffer heads or calling brelse() one time too many.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Curt Wohlgemuth <curtw@google.com>
Cc: stable@kernel.org


# 1e424a348 08-Nov-2009 Theodore Ts'o <tytso@mit.edu>

ext4: partial revert to fix double brelse WARNING()

This is a partial revert of commit 6487a9d (only the changes made to
fs/ext4/namei.c), since it is causing the following brelse()
double-free warning when running fsstress on a file system with 1k
blocksize and we run into a block allocation failure while converting
a single-block directory to a multi-block hash-tree indexed directory.

WARNING: at fs/buffer.c:1197 __brelse+0x2e/0x33()
Hardware name:
VFS: brelse: Trying to free free buffer
Modules linked in:
Pid: 2226, comm: jbd2/sdd-8 Not tainted 2.6.32-rc6-00577-g0003f55 #101
Call Trace:
[<c01587fb>] warn_slowpath_common+0x65/0x95
[<c0158869>] warn_slowpath_fmt+0x29/0x2c
[<c021168e>] __brelse+0x2e/0x33
[<c0288a9f>] jbd2_journal_refile_buffer+0x67/0x6c
[<c028a9ed>] jbd2_journal_commit_transaction+0x319/0x14d8
[<c0164d73>] ? try_to_del_timer_sync+0x58/0x60
[<c0175bcc>] ? sched_clock_cpu+0x12a/0x13e
[<c017f6b4>] ? trace_hardirqs_off+0xb/0xd
[<c0175c1f>] ? cpu_clock+0x3f/0x5b
[<c017f6ec>] ? lock_release_holdtime+0x36/0x137
[<c0664ad0>] ? _spin_unlock_irqrestore+0x44/0x51
[<c0180af3>] ? trace_hardirqs_on_caller+0x103/0x124
[<c0180b1f>] ? trace_hardirqs_on+0xb/0xd
[<c0164d73>] ? try_to_del_timer_sync+0x58/0x60
[<c0290d1c>] kjournald2+0x11a/0x310
[<c017118e>] ? autoremove_wake_function+0x0/0x38
[<c0290c02>] ? kjournald2+0x0/0x310
[<c0170ee6>] kthread+0x66/0x6b
[<c0170e80>] ? kthread+0x0/0x6b
[<c01251b3>] kernel_thread_helper+0x7/0x10
---[ end trace 5579351b86af61e3 ]---

Commit 6487a9d was an attempt some buffer head leaks in an ENOSPC
error path, but in some cases it actually results in an excess ENOSPC,
as shown above. Fixing this means cleaning up who is responsible for
releasing the buffer heads from the callee to the caller of
add_dirent_to_buf().

Since that's a relatively complex change, and we're late in the rcX
development cycle, I'm reverting this now, and holding back a more
complete fix until after 2.6.32 ships. We've lived with this
buffer_head leak on ENOSPC in ext3 and ext4 for a very long time; a
few more months won't kill us.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Curt Wohlgemuth <curtw@google.com>


# d3d1faf6 29-Sep-2009 Curt Wohlgemuth <curtw@google.com>

ext4: Handle nested ext4_journal_start/stop calls without a journal

This patch fixes a problem with handling nested calls to
ext4_journal_start/ext4_journal_stop, when there is no journal present.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 1f7bebb9 10-Sep-2009 Andreas Schlick <schlick@lavabit.com>

ext4: Always set dx_node's fake_dirent explicitly.

When ext4_dx_add_entry() has to split an index node, it has to ensure that
name_len of dx_node's fake_dirent is also zero, because otherwise e2fsck
won't recognise it as an intermediate htree node and consider the htree to
be corrupted.

Signed-off-by: Andreas Schlick <schlick@lavabit.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 1d5ccd1c 28-Aug-2009 Linus Torvalds <torvalds@linux-foundation.org>

ext[234]: move over to 'check_acl' permission model

Don't implement per-filesystem 'extX_permission()' functions that have
to be called for every path component operation, and instead just expose
the actual ACL checking so that the VFS layer can now do it for us.

Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b05ab1dc 29-Aug-2009 Theodore Ts'o <tytso@mit.edu>

ext4: Limit number of links that can be created by ext4_link()

In ext4_link we need to check using EXT4_LINK_MAX, and not
EXT4_DIR_LINK_MAX(), since ext4_link() is creating hard links of
regular files, and not directories.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 2c94eb86 28-Aug-2009 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Allow rename to create more than EXT4_LINK_MAX subdirectories

Use EXT4_DIR_LINK_MAX so that rename() can move a directory into new
parent directory without running into the EXT4_LINK_MAX limit.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 6487a9d3 17-Jul-2009 Curt Wohlgemuth <curtw@google.com>

ext4: More buffer head reference leaks

After the patch I posted last week regarding buffer head ref leaks in
no-journal mode, I looked at all the code that uses buffer heads and
searched for more potential leaks.

The patch below fixes the issues I found; these can occur even when a
journal is present.

The change to inode.c fixes a double release if
ext4_journal_get_create_access() fails.

The changes to namei.c are more complicated. add_dirent_to_buf() will
release the input buffer head EXCEPT when it returns -ENOSPC. There are
some callers of this routine that don't always do the brelse() in the event
that -ENOSPC is returned. Unfortunately, to put this fix into ext4_add_entry()
required capturing the return value of make_indexed_dir() and
add_dirent_to_buf().

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 11013911 13-Jun-2009 Andreas Dilger <adilger@sun.com>

ext4: teach the inode allocator to use a goal inode number

Enhance the inode allocator to take a goal inode number as a
paremeter; if it is specified, it takes precedence over Orlov or
parent directory inode allocation algorithms.

The extents migration function uses the goal inode number so that the
extent trees allocated the migration function use the correct flex_bg.
In the future, the goal inode functionality will also be used to
allocate an adjacent inode for the extended attributes.

Also, for testing purposes the goal inode number can be specified via
/sys/fs/{dev}/inode_goal. This can be useful for testing inode
allocation beyond 2^32 blocks on very large filesystems.

Signed-off-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# f157a4aa 13-Jun-2009 Theodore Ts'o <tytso@mit.edu>

ext4: Use a hash of the topdir directory name for the Orlov parent group

Instead of using a random number to determine the goal parent grop for
the Orlov top directories, use a hash of the directory name. This
allows for repeatable results when trying to benchmark filesystem
layout algorithms.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 9aee2286 07-Jun-2009 Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>

ext4: fix dx_map_entry to support 256k directory blocks

The dx_map_entry structure doesn't support over 64KB block size by
current usage of its member("offs"). Because "offs" treats an offset
of copies of the ext4_dir_entry_2 structure as is. This member size is
16 bits. But real offset for over 64KB(256KB) block size needs 18
bits. However, real offset keeps 4 byte boundary, so lower 2 bits is
not used.

Therefore, we do the following to fix this limitation:
For "store":
we divide the real offset by 4 and then store this result to "offs"
member.
For "use":
we multiply "offs" member by 4 and then use this result
as real offset.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# abc8746e 02-May-2009 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: hook fiemap operation for directories

Add fiemap callback for directories

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 596397b7 01-May-2009 Theodore Ts'o <tytso@mit.edu>

ext4: Move fs/ext4/namei.h into ext4.h

The fs/ext4/namei.h header file had only a single function
declaration, and should have never been a standalone file. Move it
into ext4.h, where should have been from the beginning.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 3b9d4ed2 25-Apr-2009 Theodore Ts'o <tytso@mit.edu>

ext4: Replace lock/unlock_super() with an explicit lock for the orphan list

Use a separate lock to protect the orphan list, so we can stop
overloading the use of lock_super().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# a269eb18 26-Jan-2009 Jan Kara <jack@suse.cz>

ext4: Use lowercase names of quota functions

Use lowercase names of quota functions instead of old uppercase ones.

Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mingming Cao <cmm@us.ibm.com>
CC: linux-ext4@vger.kernel.org


# afd4672d 16-Mar-2009 Theodore Ts'o <tytso@mit.edu>

ext4: Add auto_da_alloc mount option

Add a mount option which allows the user to disable automatic
allocation of blocks whose allocation by delayed allocation when the
file was originally truncated or when the file is renamed over an
existing file. This feature is intended to save users from the
effects of naive application writers, but it reduces the effectiveness
of the delayed allocation code. This mount option disables this
safety feature, which may be desirable for prodcutions systems where
the risk of unclean shutdowns or unexpected system crashes is low.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 8750c6d5 23-Feb-2009 Theodore Ts'o <tytso@mit.edu>

ext4: Automatically allocate delay allocated blocks on rename

When renaming a file such that a link to another inode is overwritten,
force any delay allocated blocks that to be allocated so that if the
filesystem is mounted with data=ordered, the data blocks will be
pushed out to disk along with the journal commit. Many application
programs expect this, so we do this to avoid zero length files if the
system crashes unexpectedly.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# e6f009b0 22-Feb-2009 Bryan Donlan <bdonlan@gmail.com>

ext4: return -EIO not -ESTALE on directory traversal through deleted inode

ext4_iget() returns -ESTALE if invoked on a deleted inode, in order to
report errors to NFS properly. However, in ext4_lookup(), this
-ESTALE can be propagated to userspace if the filesystem is corrupted
such that a directory entry references a deleted inode. This leads to
a misleading error message - "Stale NFS file handle" - and confusion
on the part of the admin.

The bug can be easily reproduced by creating a new filesystem, making
a link to an unused inode using debugfs, then mounting and attempting
to ls -l said link.

This patch thus changes ext4_lookup to return -EIO if it receives
-ESTALE from ext4_iget(), as ext4 does for other filesystem metadata
corruption; and also invokes the appropriate ext*_error functions when
this case is detected.

Signed-off-by: Bryan Donlan <bdonlan@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 3d0518f4 14-Feb-2009 Wei Yongjun <yjwei@cn.fujitsu.com>

ext4: New rec_len encoding for very large blocksizes

The rec_len field in the directory entry is 16 bits, so to encode
blocksizes larger than 64k becomes problematic. This patch allows us
to supprot block sizes up to 256k, by using the low 2 bits to extend
the range of rec_len to 2**18-1 (since valid rec_len sizes must be a
multiple of 4). We use the convention that a rec_len of 0 or 65535
means the filesystem block size, for compatibility with older kernels.

It's unlikely we'll see VM pages of up to 256k, but at some point we
might find that the Linux VM has been enhanced to support filesystem
block sizes > than the VM page size, at which point it might be useful
for some applications to allow very large filesystem block sizes.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 8bad4597 14-Feb-2009 Theodore Ts'o <tytso@mit.edu>

ext4: Use unsigned int for blocksize in dx_make_map() and dx_pack_dirents()

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# e6b8bc09 16-Jan-2009 Theodore Ts'o <tytso@mit.edu>

ext4: Add sanity check to make_indexed_dir

Make sure the rec_len field in the '..' entry is sane, lest we overrun
the directory block and cause a kernel oops on a purposefully
corrupted filesystem.

Thanks to Sami Liedes for reporting this bug.

http://bugzilla.kernel.org/show_bug.cgi?id=12430

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org


# 97e133b4 07-Jan-2009 Wu Fengguang <fengguang.wu@intel.com>

generic swap(): ext4: remove local swap() macro

Use the new generic implementation.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 54566b2c 04-Jan-2009 Nick Piggin <npiggin@suse.de>

fs: symlink write_begin allocation context fix

With the write_begin/write_end aops, page_symlink was broken because it
could no longer pass a GFP_NOFS type mask into the point where the
allocations happened. They are done in write_begin, which would always
assume that the filesystem can be entered from reclaim. This bug could
cause filesystem deadlocks.

The funny thing with having a gfp_t mask there is that it doesn't really
allow the caller to arbitrarily tinker with the context in which it can be
called. It couldn't ever be GFP_ATOMIC, for example, because it needs to
take the page lock. The only thing any callers care about is __GFP_FS
anyway, so turn that into a single flag.

Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on
this flag in their write_begin function. Change __grab_cache_page to
accept a nofs argument as well, to honour that flag (while we're there,
change the name to grab_cache_page_write_begin which is more instructive
and does away with random leading underscores).

This is really a more flexible way to go in the end anyway -- if a
filesystem happens to want any extra allocations aside from the pagecache
ones in ints write_begin function, it may now use GFP_KERNEL (rather than
GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a
random example).

[kosaki.motohiro@jp.fujitsu.com: fix ubifs]
[kosaki.motohiro@jp.fujitsu.com: fix fuse]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Cleaned up the calling convention: just pass in the AOP flags
untouched to the grab_cache_page_write_begin() function. That
just simplifies everybody, and may even allow future expansion of the
logic. - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6b38e842 30-Dec-2008 Al Viro <viro@zeniv.linux.org.uk>

nfsd race fixes: ext4

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 498e5f24 04-Nov-2008 Theodore Ts'o <tytso@mit.edu>

ext4: Change unsigned long to unsigned int

Convert the unsigned longs that are most responsible for bloating the
stack usage on 64-bit systems.

Nearly all places in the ext3/4 code which uses "unsigned long" is
probably a bug, since on 32-bit systems a ulong a 32-bits, which means
we are wasting stack space on 64-bit systems.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 0390131b 06-Jan-2009 Frank Mayhar <fmayhar@google.com>

ext4: Allow ext4 to run without a journal

A few weeks ago I posted a patch for discussion that allowed ext4 to run
without a journal. Since that time I've integrated the excellent
comments from Andreas and fixed several serious bugs. We're currently
running with this patch and generating some performance numbers against
both ext2 (with backported reservations code) and ext4 with and without
a journal. It just so happens that running without a journal is
slightly faster for most everything.

We did
iozone -T -t 4 s 2g -r 256k -T -I -i0 -i1 -i2

which creates 4 threads, each of which create and do reads and writes on
a 2G file, with a buffer size of 256K, using O_DIRECT for all file opens
to bypass the page cache. Results:

ext2 ext4, default ext4, no journal
initial writes 13.0 MB/s 15.4 MB/s 15.7 MB/s
rewrites 13.1 MB/s 15.6 MB/s 15.9 MB/s
reads 15.2 MB/s 16.9 MB/s 17.2 MB/s
re-reads 15.3 MB/s 16.9 MB/s 17.2 MB/s
random readers 5.6 MB/s 5.6 MB/s 5.7 MB/s
random writers 5.1 MB/s 5.3 MB/s 5.4 MB/s

So it seems that, so far, this was a useful exercise.

Signed-off-by: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 59e315b4 06-Dec-2008 Theodore Ts'o <tytso@mit.edu>

ext3/4: Fix loop index in do_split() so it is signed

This fixes a gcc warning but it doesn't appear able to result in a
failure, since the primary way the loop is exited is the first
conditional in the for loop, and at least for a consistent filesystem,
the signed/unsigned should in practice never be exposed.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# f99b2589 28-Oct-2008 Theodore Ts'o <tytso@mit.edu>

ext4: Add support for non-native signed/unsigned htree hash algorithms

The original ext3 hash algorithms assumed that variables of type char
were signed, as God and K&R intended. Unfortunately, this assumption
is not true on some architectures. Userspace support for marking
filesystems with non-native signed/unsigned chars was added two years
ago, but the kernel-side support was never added (until now).

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 3856d30d 23-Oct-2008 Christoph Hellwig <hch@lst.de>

ext4: remove unused variable in ext4_get_parent

Signed-off-by: Christoph Hellwig <hch@lst.de>
[ All users removed in "switch all filesystems over to d_obtain_alias",
aka commit 440037287c5ebb07033ab927ca16bb68c291d309 ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 44003728 11-Aug-2008 Christoph Hellwig <hch@lst.de>

[PATCH] switch all filesystems over to d_obtain_alias

Switch all users of d_alloc_anon to d_obtain_alias.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 03010a33 10-Oct-2008 Theodore Ts'o <tytso@mit.edu>

ext4: Rename ext4dev to ext4

The ext4 filesystem is getting stable enough that it's time to drop
the "dev" prefix. Also remove the requirement for the TEST_FILESYS
flag.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# f702ba0f 22-Sep-2008 Theodore Ts'o <tytso@mit.edu>

ext4: Don't use 'struct dentry' for internal lookups

This is a port of a patch from Linus which fixes a 200+ byte stack
usage problem in ext4_get_parent().

It's more efficient to pass down only the actual parts of the dentry
that matter: the parent inode and the name, instead of allocating a
struct dentry on the stack.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# af5bc92d 08-Sep-2008 Theodore Ts'o <tytso@mit.edu>

ext4: Fix whitespace checkpatch warnings/errors

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 4776004f 08-Sep-2008 Theodore Ts'o <tytso@mit.edu>

ext4: Add printk priority levels to clean up checkpatch warnings

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# d9c769b7 11-Jul-2008 Li Zefan <lizf@cn.fujitsu.com>

ext4: cleanup never-used magic numbers from htree code

dx_root_limit() will had some dead code which forced it to always return
20, and dx_node_limit to always return 22 for debugging purposes.
Remove it.

Acked-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# f795e140 11-Jul-2008 Li Zefan <lizf@cn.fujitsu.com>

ext4: fix build failure if DX_DEBUG is enabled

ext4_next_entry() is used by the debugging function dx_show_leaf(), so
it must be defined before that function.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# f3b35f06 11-Jul-2008 Duane Griffin <duaneg@dghda.com>

ext4: validate directory entry data before use

ext4_dx_find_entry uses ext4_next_entry without verifying that the entry is
valid. If its rec_len == 0 this causes an infinite loop. Refactor the loop
to check the validity of entries before checking whether they match and
moving onto the next one.

There are other uses of ext4_next_entry in this file which also look
problematic. They should be reviewed and fixed if/when we have a test-case
that triggers them.

This patch fixes the first case (image hdb.25.softlockup.gz) reported in
http://bugzilla.kernel.org/show_bug.cgi?id=10882.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 3dcf5451 29-Apr-2008 Christoph Hellwig <hch@lst.de>

ext4: move headers out of include/linux

Move ext4 headers out of include/linux. This is just the trivial move,
there's some more thing that could be done later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 53b7e9f6 29-Apr-2008 Jan Kara <jack@suse.cz>

ext4: Fix update of mtime and ctime on rename

The patch below makes ext4 update mtime and ctime of the directory
into which we move file even if the directory entry already exists.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 46e665e9 17-Apr-2008 Harvey Harrison <harvey.harrison@gmail.com>

ext4: replace remaining __FUNCTION__ occurrences

__FUNCTION__ is gcc-specific, use __func__

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# a871611b 17-Apr-2008 Akinobu Mita <akinobu.mita@gmail.com>

ext4: check ext4_journal_get_write_access() errors

Check ext4_journal_get_write_access() errors.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: adilger@clusterfs.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mingming Cao <cmm@us.ibm.com>


# e65187e6 29-Apr-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Enable extent format for symlinks.

This patch enables extent-formatted normal symlinks. Using extents
format allows a symlink to refer to a block number larger than 2^32
on large filesystems. We still don't enable extent format for fast
symlinks, which are contained in the inode itself.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 42bf0383 25-Feb-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: set EXT4_EXTENTS_FL only for directory and regular files

In addition, don't inherit EXT4_EXTENTS_FL from parent directory.
If we have a directory with extent flag set and later mount the file
system with -o noextents, the files created in that directory will also
have extent flag set but we would not have called ext4_ext_tree_init for
them. This will cause error later when we are verifying the extent header

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 825f1481 15-Feb-2008 Theodore Ts'o <tytso@mit.edu>

ext4: Don't use ext4_dec_count() if not needed

The ext4_dec_count() function is only needed when dropping the i_nlink
count on inodes which are (or which could be) directories. If we
*know* that the inode in question can't possibly be a directory, use
drop_nlink or clear_nlink() if we know i_nlink is 1.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 4cdeed86 22-Feb-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Don't leave behind a half-created inode if ext4_mkdir() fails

If ext4_mkdir() fails to allocate the initial block for the directory,
don't leave behind a half-created directory inode with the link count
left at one. This was caused by an inappropriate call to ext4_dec_count().

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 1d1fe1ee 07-Feb-2008 David Howells <dhowells@redhat.com>

iget: stop EXT4 from using iget() and read_inode()

Stop the EXT4 filesystem from using iget() and read_inode(). Replace
ext4_read_inode() with ext4_iget(), and call that instead of iget().
ext4_iget() then uses iget_locked() directly and returns a proper error code
instead of an inode in the event of an error.

ext4_fill_super() returns any error incurred when getting the root inode
instead of EINVAL.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b8356c46 05-Feb-2008 Valerie Clement <valerie.clement@bull.net>

ext4: Don't set EXTENTS_FL flag for fast symlinks

For fast symbolic links, the file content is stored in the i_block[]
array, which is not compatible with the new file extents format.
e2fsck reports error on such files because EXTENTS_FL is set.
Don't set the EXTENTS_FL flag when creating fast symlinks.

In the case of file migration, skip fast symbolic links.

Signed-off-by: Valerie Clement <valerie.clement@bull.net>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 01f4adc0 28-Jan-2008 Mariusz Kozlowski <m.kozlowski@tuxland.pl>

ext4: remove unused code from ext4_find_entry()

The unused code found in ext3_find_entry() is also present (and still
unused) in the ext4_find_entry() code. This patch removes it.

Signed-off-by: Mariusz Kozlowski <m.kozlowski@tuxland.pl>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 725d26d3 28-Jan-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Introduce ext4_lblk_t

This patch adds a new data type ext4_lblk_t to represent
the logical file blocks.

This is the preparatory patch to support large files in ext4
The follow up patch with convert the ext4_inode i_blocks to
represent the number of blocks in file system block size. This
changes makes it possible to have a block number 2**32 -1 which
will result in overflow if the block number is represented by
signed long. This patch convert all the block number to type
ext4_lblk_t which is typedef to __u32

Also remove dead code ext4_ext_walk_space

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>


# a72d7f83 28-Jan-2008 Jan Kara <jack@suse.cz>

ext4: Avoid rec_len overflow with 64KB block size

With 64KB blocksize, a directory entry can have size 64KB which does not fit
into 16 bits we have for entry lenght. So we store 0xffff instead and convert
value when read from / written to disk. The patch also converts some places
to use ext4_next_entry() when we are changing them anyway.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>


# 4074fe37 16-Oct-2007 Eric Sandeen <sandeen@redhat.com>

ext4: remove #ifdef CONFIG_EXT4_INDEX

CONFIG_EXT4_INDEX is not an exposed config option in the kernel, and it is
unconditionally defined in ext4_fs.h. tune2fs is already able to turn off
dir indexing, so at this point it's just cluttering up the code. Remove
it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ef2b02d3 18-Sep-2007 Eric Sandeen <sandeen@redhat.com>

ext34: ensure do_split leaves enough free space in both blocks

The do_split() function for htree dir blocks is intended to split a leaf
block to make room for a new entry. It sorts the entries in the original
block by hash value, then moves the last half of the entries to the new
block - without accounting for how much space this actually moves. (IOW,
it moves half of the entry *count* not half of the entry *space*). If by
chance we have both large & small entries, and we move only the smallest
entries, and we have a large new entry to insert, we may not have created
enough space for it.

The patch below stores each record size when calculating the dx_map, and
then walks the hash-sorted dx_map, calculating how many entries must be
moved to more evenly split the existing entries between the old block and
the new block, guaranteeing enough space for the new entry.

The dx_map "offs" member is reduced to u16 so that the overall map size
does not change - it is temporarily stored at the end of the new block, and
if it grows too large it may be overwritten. By making offs and size both
u16, we won't grow the map size.

Also add a few comments to the functions involved.

This fixes the testcase reported by hooanon05@yahoo.co.jp on the
linux-ext4 list, "ext3 dir_index causes an error"

Thanks to Andreas Dilger for discussing the problem & solution with me.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Tested-by: Junjiro Okajima <hooanon05@yahoo.co.jp>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: <linux-ext4@vger.kernel.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3d82abae 18-Sep-2007 Eric Sandeen <sandeen@redhat.com>

dir_index: error out instead of BUG on corrupt dx dirs

Convert asserts (BUGs) in dx_probe from bad on-disk data to recoverable
errors with helpful warnings. With help catching other asserts from Duane
Griffin <duaneg@dghda.com>

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: Duane Griffin <duaneg@dghda.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f8628a14 18-Jul-2007 Andreas Dilger <adilger@clusterfs.com>

ext4: Remove 65000 subdirectory limit

This patch adds support to ext4 for allowing more than 65000
subdirectories. Currently the maximum number of subdirectories is capped
at 32000.

If we exceed 65000 subdirectories in an htree directory it sets the
inode link count to 1 and no longer counts subdirectories. The
directory link count is not actually used when determining if a
directory is empty, as that only counts subdirectories and not regular
files that might be in there.

A EXT4_FEATURE_RO_COMPAT_DIR_NLINK flag has been added and it is set if
the subdir count for any directory crosses 65000. A later fsck will clear
EXT4_FEATURE_RO_COMPAT_DIR_NLINK if there are no longer any directory
with >65000 subdirs.

Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Kalpak Shah <kalpak@clusterfs.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# ef7f3835 18-Jul-2007 Kalpak Shah <kalpak@clusterfs.com>

ext4: Add nanosecond timestamps

This patch adds nanosecond timestamps for ext4. This involves adding
*time_extra fields to the ext4_inode to extend the timestamps to
64-bits. Creation time is also added by this patch.

These extended fields will fit into an inode if the filesystem was
formatted with large inodes (-I 256 or larger) and there are currently
no EAs consuming all of the available space. For new inodes we always
reserve enough space for the kernel's known extended fields, but for
inodes created with an old kernel this might not have been the case. So
this patch also adds the EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE feature
flag(ro-compat so that older kernels can't create inodes with a smaller
extra_isize). which indicates if the fields fitting inside
s_min_extra_isize are available or not. If the expansion of inodes if
unsuccessful then this feature will be disabled. This feature is only
enabled if requested by the sysadmin.

None of the extended inode fields is critical for correct filesystem
operation.

Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Kalpak Shah <kalpak@clusterfs.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# a6c15c2b 16-Jul-2007 Vasily Averin <vvs@sw.ru>

ext3/ext4: orphan list corruption due bad inode

After ext3 orphan list check has been added into ext3_destroy_inode()
(please see my previous patch) the following situation has been detected:

EXT3-fs warning (device sda6): ext3_unlink: Deleting nonexistent file (37901290), 0
Inode 00000101a15b7840: orphan list check failed!
00000773 6f665f00 74616d72 00000573 65725f00 06737270 66000000 616d726f
...
Call Trace: [<ffffffff80211ea9>] ext3_destroy_inode+0x79/0x90
[<ffffffff801a2b16>] sys_unlink+0x126/0x1a0
[<ffffffff80111479>] error_exit+0x0/0x81
[<ffffffff80110aba>] system_call+0x7e/0x83

First messages said that unlinked inode has i_nlink=0, then ext3_unlink()
adds this inode into orphan list.

Second message means that this inode has not been removed from orphan list.
Inode dump has showed that i_fop = &bad_file_ops and it can be set in
make_bad_inode() only. Then I've found that ext3_read_inode() can call
make_bad_inode() without any error/warning messages, for example in the
following case:

...
if (inode->i_nlink == 0) {
if (inode->i_mode == 0 ||
!(EXT3_SB(inode->i_sb)->s_mount_state & EXT3_ORPHAN_FS)) {
/* this inode is deleted */
brelse (bh);
goto bad_inode;
...

Bad inode can live some time, ext3_unlink can add it to orphan list, but
ext3_delete_inode() do not deleted this inode from orphan list. As result
we can have orphan list corruption detected in ext3_destroy_inode().

However it is not clear for me how to fix this issue correctly.

As far as i see is_bad_inode() is called after iget() in all places
excluding ext3_lookup() and ext3_get_parent(). I believe it makes sense to
add bad inode check to these functions too and call iput if bad inode
detected.

Signed-off-by: Vasily Averin <vvs@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8c55e204 24-May-2007 Dave Kleikamp <shaggy@austin.ibm.com>

EXT4: Fix whitespace

Replace a lot of spaces with tabs

Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# e63340ae 08-May-2007 Randy Dunlap <randy.dunlap@oracle.com>

header cleaning: don't include smp_lock.h when not used

Remove includes of <linux/smp_lock.h> where it is not used/needed.
Suggested by Al Viro.

Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
sparc64, and arm (all 59 defconfigs).

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fedee54d 08-May-2007 Dmitriy Monakhov <dmonakhov@sw.ru>

ext3: dirindex error pointer issues

- ext3_dx_find_entry() exit with out setting proper error pointer

- do_split() exit with out setting proper error pointer
it is realy painful because many callers contain folowing code:

de = do_split(handle,dir, &bh, frame, &hinfo, &retval);
if (!(de))
return retval;
<<< WOW retval wasn't changed by do_split(), so caller failed
<<< but return SUCCESS :)

- Rearrange do_split() error path. Current error path is realy ugly, all
this up and down jump stuff doesn't make code easy to understand.

[dmonakhov@sw.ru: fix annoying fake error messages]
Signed-off-by: Monakhov Dmitriy <dmonakhov@openvz.org>
Cc: Andreas Dilger <adilger@clusterfs.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Monakhov Dmitriy <dmonakhov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 754661f1 12-Feb-2007 Arjan van de Ven <arjan@linux.intel.com>

[PATCH] mark struct inode_operations const 1

Many struct inode_operations in the kernel can be "const". Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data. In addition it'll catch accidental writes at compile time to
these shared resources.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 731b9a54 10-Feb-2007 Eric Sandeen <sandeen@redhat.com>

[PATCH] remove ext[34]_inc_count and _dec_count

- Naming is confusing, ext3_inc_count manipulates i_nlink not i_count
- handle argument passed in is not used
- ext3 and ext4 already call inc_nlink and dec_nlink directly in other places

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2988a774 10-Feb-2007 Eric Sandeen <sandeen@redhat.com>

[PATCH] return ENOENT from ext3_link when racing with unlink

Return -ENOENT from ext[34]_link if we've raced with unlink and i_nlink is
0. Doing otherwise has the potential to corrupt the orphan inode list,
because we'd wind up with an inode with a non-zero link count on the list,
and it will never get properly cleaned up & removed from the orphan list
before it is freed.

[akpm@osdl.org: build fix]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9d549890 08-Dec-2006 Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>

[PATCH] ext4: change uses of f_{dentry, vfsmnt} to use f_path

Change all the uses of f_{dentry,vfsmnt} to f_path.{dentry,mnt} in the ext4
filesystem.

Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e6c40211 06-Dec-2006 Eric Sandeen <sandeen@redhat.com>

[PATCH] handle ext4 directory corruption better

I've been using Steve Grubb's purely evil "fsfuzzer" tool, at
http://people.redhat.com/sgrubb/files/fsfuzzer-0.4.tar.gz

Basically it makes a filesystem, splats some random bits over it, then
tries to mount it and do some simple filesystem actions.

At best, the filesystem catches the corruption gracefully. At worst,
things spin out of control.

As you might guess, we found a couple places in ext4 where things spin out
of control :)

First, we had a corrupted directory that was never checked for
consistency... it was corrupt, and pointed to another bad "entry" of
length 0. The for() loop looped forever, since the length of
ext4_next_entry(de) was 0, and we kept looking at the same pointer over and
over and over and over... I modeled this check and subsequent action on
what is done for other directory types in ext4_readdir...

(adding this check adds some computational expense; I am testing a followup
patch to reduce the number of times we check and re-check these directory
entries, in all cases. Thanks for the idea, Andreas).

Next we had a root directory inode which had a corrupted size, claimed to
be > 200M on a 4M filesystem. There was only really 1 block in the
directory, but because the size was so large, readdir kept coming back for
more, spewing thousands of printk's along the way.

Per Andreas' suggestion, if we're in this read error condition and we're
trying to read an offset which is greater than i_blocks worth of bytes,
stop trying, and break out of the loop.

With these two changes fsfuzz test survives quite well on ext4.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 63f57933 11-Oct-2006 Andrew Morton <akpm@osdl.org>

[PATCH] ext4 whitespace cleanups

Someone's tab key is emitting spaces. Attempt to repair some of the damage.

Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# dab291af 11-Oct-2006 Mingming Cao <cmm@us.ibm.com>

[PATCH] jbd2: enable building of jbd2 and have ext4 use it rather than jbd

Reworked from a patch by Mingming Cao and Randy Dunlap

Signed-off-By: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 617ba13b 11-Oct-2006 Mingming Cao <cmm@us.ibm.com>

[PATCH] ext4: rename ext4 symbols to avoid duplication of ext3 symbols

Mingming Cao originally did this work, and Shaggy reproduced it using some
scripts from her.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# ac27a0ec 11-Oct-2006 Dave Kleikamp <shaggy@austin.ibm.com>

[PATCH] ext4: initial copy of files from ext3

Start of the ext4 patch series. See Documentation/filesystems/ext4.txt for
details.

This is a simple copy of the files in fs/ext3 to fs/ext4 and
/usr/incude/linux/ext3* to /usr/include/ex4*

Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>