History log of /linux-master/fs/ext4/inline.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# a4fc4a0c 07-Nov-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: add folio_zero_tail() and use it in ext4

Patch series "Add folio_zero_tail() and folio_fill_tail()".

I'm trying to make it easier for filesystems with tailpacking / stuffing /
inline data to use folios. The primary function here is
folio_fill_tail(). You give it a pointer to memory where the data
currently is, and it takes care of copying it into the folio at that
offset. That works for gfs2 & iomap. Then There's Ext4. Rather than gin
up some kind of specialist "Here's a two pointers to two blocks of memory"
routine, just let it do its current thing, and let it call
folio_zero_tail(), which is also called by folio_fill_tail().

Other filesystems can be converted later; these ones seemed like good
examples as they're already partly or completely converted to folios.


This patch (of 3):

Instead of unmapping the folio after copying the data to it, then mapping
it again to zero the tail, provide folio_zero_tail() to zero the tail of
an already-mapped folio.

[akpm@linux-foundation.org: fix kerneldoc argument ordering]
Link: https://lkml.kernel.org/r/20231107212643.3490372-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20231107212643.3490372-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# b898ab23 04-Oct-2023 Jeff Layton <jlayton@kernel.org>

ext4: convert to new timestamp accessors

Convert to using the new inode timestamp accessor functions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20231004185347.80880-33-jlayton@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>


# eb8ab444 16-Jun-2023 Jan Kara <jack@suse.cz>

ext4: make ext4_forced_shutdown() take struct super_block

Currently ext4_forced_shutdown() takes struct ext4_sb_info but most
callers need to get it from struct super_block anyway. So just pass in
struct super_block to save all callers from some boilerplate code. No
functional changes.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230616165109.21695-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 1bc33893 05-Jul-2023 Jeff Layton <jlayton@kernel.org>

ext4: convert to ctime accessor functions

In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-40-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# ed5d285b 23-Apr-2023 Baokun Li <libaokun1@huawei.com>

ext4: make ext4_es_remove_extent() return void

Now ext4_es_remove_extent() never fails, so make it return void.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-10-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# d19500da 15-May-2023 Ritesh Harjani <ritesh.list@gmail.com>

ext4: Make ext4_write_inline_data_end() use folio

ext4_write_inline_data_end() is completely converted to work with folio.
Also all callers of ext4_write_inline_data_end() already works on folio
except ext4_da_write_end(). Mostly for consistency and saving few
instructions maybe, this patch just converts ext4_da_write_end() to work
with folio which makes the last caller of ext4_write_inline_data_end()
also converted to work with folio.
We then make ext4_write_inline_data_end() take folio instead of page.

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/1bcea771720ff451a5a59b3f1bcd5fae51cb7ce7.1684122756.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 0b956de1 15-May-2023 Ritesh Harjani <ritesh.list@gmail.com>

ext4: kill unused function ext4_journalled_write_inline_data

Commit 3f079114bf522 ("ext4: Convert data=journal writeback to use ext4_writepages()")
Added support for writeback of journalled data into ext4_writepages()
and killed function __ext4_journalled_writepage() which used to call
ext4_journalled_write_inline_data() for inline data.
This function got left over by mistake. Hence kill it's definition as
no one uses it.

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/122b2a8d5e0650686f23ed6da26ed9e04105562b.1684122756.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 2a534e1d 12-May-2023 Theodore Ts'o <tytso@mit.edu>

ext4: bail out of ext4_xattr_ibody_get() fails for any reason

In ext4_update_inline_data(), if ext4_xattr_ibody_get() fails for any
reason, it's best if we just fail as opposed to stumbling on,
especially if the failure is EFSCORRUPTED.

Cc: stable@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 2220eaf9 12-May-2023 Theodore Ts'o <tytso@mit.edu>

ext4: add bounds checking in get_max_inline_xattr_value_size()

Normally the extended attributes in the inode body would have been
checked when the inode is first opened, but if someone is writing to
the block device while the file system is mounted, it's possible for
the inode table to get corrupted. Add bounds checking to avoid
reading beyond the end of allocated memory if this happens.

Reported-by: syzbot+1966db24521e5f6e23f7@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=1966db24521e5f6e23f7
Cc: stable@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# f4ce24f5 06-May-2023 Theodore Ts'o <tytso@mit.edu>

ext4: fix deadlock when converting an inline directory in nojournal mode

In no journal mode, ext4_finish_convert_inline_dir() can self-deadlock
by calling ext4_handle_dirty_dirblock() when it already has taken the
directory lock. There is a similar self-deadlock in
ext4_incvert_inline_data_nolock() for data files which we'll fix at
the same time.

A simple reproducer demonstrating the problem:

mke2fs -Fq -t ext2 -O inline_data -b 4k /dev/vdc 64
mount -t ext4 -o dirsync /dev/vdc /vdc
cd /vdc
mkdir file0
cd file0
touch file0
touch file1
attr -s BurnSpaceInEA -V abcde .
touch supercalifragilisticexpialidocious

Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20230507021608.1290720-1-tytso@mit.edu
Reported-by: syzbot+91dccab7c64e2850a4e5@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=ba84cc80a9491d65416bc7877e1650c87530fe8a
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 7fa8a8ee 27-Apr-2023 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.

- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
Raghav.

- zsmalloc performance improvements from Sergey Senozhatsky.

- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.

- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page()
- make __filemap_get_folio()'s return value more useful

- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.

- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.

- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).

- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.

- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
rather than exclusive, and has fixed a bunch of errors which were
caused by its unintuitive meaning.

- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.

- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.

- Cleanups to do_fault_around() by Lorenzo Stoakes.

- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.

- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.

- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.

- Lorenzo has also contributed further cleanups of vma_merge().

- Chaitanya Prakash provides some fixes to the mmap selftesting code.

- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.

- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.

- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.

- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.

- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.

- Yosry Ahmed has improved the performance of memcg statistics
flushing.

- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.

- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.

- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.

- Pankaj Raghav has made page_endio() unneeded and has removed it.

- Peter Xu contributed some rationalizations of the userfaultfd
selftests.

- Yosry Ahmed has fixed an issue around memcg's page recalim
accounting.

- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.

- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.

- Peter Xu fixes a few issues with uffd-wp at fork() time.

- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.

* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
shmem: restrict noswap option to initial user namespace
mm/khugepaged: fix conflicting mods to collapse_file()
sparse: remove unnecessary 0 values from rc
mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
maple_tree: fix allocation in mas_sparse_area()
mm: do not increment pgfault stats when page fault handler retries
zsmalloc: allow only one active pool compaction context
selftests/mm: add new selftests for KSM
mm: add new KSM process and sysfs knobs
mm: add new api to enable ksm per process
mm: shrinkers: fix debugfs file permissions
mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
migrate_pages_batch: fix statistics for longterm pin retry
userfaultfd: use helper function range_in_vma()
lib/show_mem.c: use for_each_populated_zone() simplify code
mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
fs/buffer: convert create_page_buffers to folio_create_buffers
fs/buffer: add folio_create_empty_buffers helper
...


# 6b90d413 24-Mar-2023 Matthew Wilcox <willy@infradead.org>

ext4: Convert ext4_write_inline_data_end() to use a folio

Convert the incoming page to a folio so that we call compound_head()
only once instead of seven times.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20230324180129.1220691-16-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 6b87fbe4 24-Mar-2023 Matthew Wilcox <willy@infradead.org>

ext4: Convert ext4_read_inline_page() to ext4_read_inline_folio()

All callers now have a folio, so pass it and use it. The folio may
be large, although I doubt we'll want to use a large folio for an
inline file.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20230324180129.1220691-15-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 9a9d01f0 24-Mar-2023 Matthew Wilcox <willy@infradead.org>

ext4: Convert ext4_da_write_inline_data_begin() to use a folio

Saves a number of calls to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20230324180129.1220691-14-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 4ed9b598 24-Mar-2023 Matthew Wilcox <willy@infradead.org>

ext4: Convert ext4_da_convert_inline_data_to_extent() to use a folio

Saves a number of calls to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20230324180129.1220691-13-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# f8f8c89f 24-Mar-2023 Matthew Wilcox <willy@infradead.org>

ext4: Convert ext4_try_to_write_inline_data() to use a folio

Saves a number of calls to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20230324180129.1220691-12-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 83eba701 24-Mar-2023 Matthew Wilcox <willy@infradead.org>

ext4: Convert ext4_convert_inline_data_to_extent() to use a folio

Saves a number of calls to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20230324180129.1220691-11-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 3edde93e 24-Mar-2023 Matthew Wilcox <willy@infradead.org>

ext4: Convert ext4_readpage_inline() to take a folio

Use the folio API in this function, saves a few calls to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20230324180129.1220691-10-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 1dcdce59 06-Mar-2023 Ye Bin <yebin10@huawei.com>

ext4: move where set the MAY_INLINE_DATA flag is set

The only caller of ext4_find_inline_data_nolock() that needs setting of
EXT4_STATE_MAY_INLINE_DATA flag is ext4_iget_extra_inode(). In
ext4_write_inline_data_end() we just need to update inode->i_inline_off.
Since we are going to add one more caller that does not need to set
EXT4_STATE_MAY_INLINE_DATA, just move setting of EXT4_STATE_MAY_INLINE_DATA
out to ext4_iget_extra_inode().

Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230307015253.2232062-2-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 66267814 16-Aug-2022 Jiangshan Yi <yijiangshan@kylinos.cn>

fs/ext4: replace ternary operator with min()/max() and min_t()

Fix the following coccicheck warning:

fs/ext4/inline.c:183: WARNING opportunity for min().
fs/ext4/extents.c:2631: WARNING opportunity for max().
fs/ext4/extents.c:2632: WARNING opportunity for min().
fs/ext4/extents.c:5559: WARNING opportunity for max().
fs/ext4/super.c:6908: WARNING opportunity for min().

min()/max() and min_t() macro is defined in include/linux/minmax.h.
It avoids multiple evaluations of the arguments when non-constant and
performs strict type-checking.

Reported-by: kernel test robot <lkp@intel.com>
Suggested-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jiangshan Yi <yijiangshan@kylinos.cn>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20220817025928.612851-1-13667453960@163.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# c9fd167d 15-Jun-2022 Baokun Li <libaokun1@huawei.com>

ext4: correct max_inline_xattr_value_size computing

If the ext4 inode does not have xattr space, 0 is returned in the
get_max_inline_xattr_value_size function. Otherwise, the function returns
a negative value when the inode does not contain EXT4_STATE_XATTR.

Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220616021358.2504451-4-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 5a57bca9 30-Jun-2022 Zhang Yi <yi.zhang@huawei.com>

ext4: fix reading leftover inlined symlinks

Since commit 6493792d3299 ("ext4: convert symlink external data block
mapping to bdev"), create new symlink with inline_data is not supported,
but it missing to handle the leftover inlined symlinks, which could
cause below error message and fail to read symlink.

ls: cannot read symbolic link 'foo': Structure needs cleaning

EXT4-fs error (device sda): ext4_map_blocks:605: inode #12: block
2021161080: comm ls: lblock 0 mapped to illegal pblock 2021161080
(length 1)

Fix this regression by adding ext4_read_inline_link(), which read the
inline data directly and convert it through a kmalloced buffer.

Fixes: 6493792d3299 ("ext4: convert symlink external data block mapping to bdev")
Cc: stable@kernel.org
Reported-by: Torge Matthies <openglfreak@googlemail.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Tested-by: Torge Matthies <openglfreak@googlemail.com>
Link: https://lore.kernel.org/r/20220630090100.2769490-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# fdaf9a58 24-May-2022 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache

Pull page cache updates from Matthew Wilcox:

- Appoint myself page cache maintainer

- Fix how scsicam uses the page cache

- Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS

- Remove the AOP flags entirely

- Remove pagecache_write_begin() and pagecache_write_end()

- Documentation updates

- Convert several address_space operations to use folios:
- is_dirty_writeback
- readpage becomes read_folio
- releasepage becomes release_folio
- freepage becomes free_folio

- Change filler_t to require a struct file pointer be the first
argument like ->read_folio

* tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache: (107 commits)
nilfs2: Fix some kernel-doc comments
Appoint myself page cache maintainer
fs: Remove aops->freepage
secretmem: Convert to free_folio
nfs: Convert to free_folio
orangefs: Convert to free_folio
fs: Add free_folio address space operation
fs: Convert drop_buffers() to use a folio
fs: Change try_to_free_buffers() to take a folio
jbd2: Convert release_buffer_page() to use a folio
jbd2: Convert jbd2_journal_try_to_free_buffers to take a folio
reiserfs: Convert release_buffer_page() to use a folio
fs: Remove last vestiges of releasepage
ubifs: Convert to release_folio
reiserfs: Convert to release_folio
orangefs: Convert to release_folio
ocfs2: Convert to release_folio
nilfs2: Remove comment about releasepage
nfs: Convert to release_folio
jfs: Convert to release_folio
...


# ef09ed5d 16-May-2022 Ye Bin <yebin10@huawei.com>

ext4: fix bug_on in ext4_writepages

we got issue as follows:
EXT4-fs error (device loop0): ext4_mb_generate_buddy:1141: group 0, block bitmap and bg descriptor inconsistent: 25 vs 31513 free cls
------------[ cut here ]------------
kernel BUG at fs/ext4/inode.c:2708!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
CPU: 2 PID: 2147 Comm: rep Not tainted 5.18.0-rc2-next-20220413+ #155
RIP: 0010:ext4_writepages+0x1977/0x1c10
RSP: 0018:ffff88811d3e7880 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88811c098000
RDX: 0000000000000000 RSI: ffff88811c098000 RDI: 0000000000000002
RBP: ffff888128140f50 R08: ffffffffb1ff6387 R09: 0000000000000000
R10: 0000000000000007 R11: ffffed10250281ea R12: 0000000000000001
R13: 00000000000000a4 R14: ffff88811d3e7bb8 R15: ffff888128141028
FS: 00007f443aed9740(0000) GS:ffff8883aef00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020007200 CR3: 000000011c2a4000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
do_writepages+0x130/0x3a0
filemap_fdatawrite_wbc+0x83/0xa0
filemap_flush+0xab/0xe0
ext4_alloc_da_blocks+0x51/0x120
__ext4_ioctl+0x1534/0x3210
__x64_sys_ioctl+0x12c/0x170
do_syscall_64+0x3b/0x90

It may happen as follows:
1. write inline_data inode
vfs_write
new_sync_write
ext4_file_write_iter
ext4_buffered_write_iter
generic_perform_write
ext4_da_write_begin
ext4_da_write_inline_data_begin -> If inline data size too
small will allocate block to write, then mapping will has
dirty page
ext4_da_convert_inline_data_to_extent ->clear EXT4_STATE_MAY_INLINE_DATA
2. fallocate
do_vfs_ioctl
ioctl_preallocate
vfs_fallocate
ext4_fallocate
ext4_convert_inline_data
ext4_convert_inline_data_nolock
ext4_map_blocks -> fail will goto restore data
ext4_restore_inline_data
ext4_create_inline_data
ext4_write_inline_data
ext4_set_inode_state -> set inode EXT4_STATE_MAY_INLINE_DATA
3. writepages
__ext4_ioctl
ext4_alloc_da_blocks
filemap_flush
filemap_fdatawrite_wbc
do_writepages
ext4_writepages
if (ext4_has_inline_data(inode))
BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA))

The root cause of this issue is we destory inline data until call
ext4_writepages under delay allocation mode. But there maybe already
convert from inline to extent. To solve this issue, we call
filemap_flush first..

Cc: stable@kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220516122634.1690462-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# c30365b9 01-Apr-2022 Yu Zhe <yuzhe@nfschina.com>

ext4: remove unnecessary type castings

remove unnecessary void* type castings.

Signed-off-by: Yu Zhe <yuzhe@nfschina.com>
Link: https://lore.kernel.org/r/20220401081321.73735-1-yuzhe@nfschina.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# b7446e7c 22-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Remove aop flags parameter from grab_cache_page_write_begin()

There are no more aop flags left, so remove the parameter.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

# 832ee62d 22-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

ext4: Use scoped memory APIs in ext4_write_begin()

Instead of setting AOP_FLAG_NOFS, use memalloc_nofs_save() and
memalloc_nofs_restore() to prevent GFP_FS allocations recursing
into the filesystem with a journal already started.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Theodore Ts'o <tytso@mit.edu>

# 36d116e9 22-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

ext4: Use scoped memory APIs in ext4_da_write_begin()

Instead of setting AOP_FLAG_NOFS, use memalloc_nofs_save() and
memalloc_nofs_restore() to prevent GFP_FS allocations recursing
into the filesystem with a journal already started.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Theodore Ts'o <tytso@mit.edu>

# 7333ed35 22-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

ext4: Allow GFP_FS allocations in ext4_da_convert_inline_data_to_extent()

Since commit 8bc1379b82b8, the transaction is stopped before calling
ext4_da_convert_inline_data_to_extent(), which means we can do GFP_FS
allocations and recurse into the filesystem.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Theodore Ts'o <tytso@mit.edu>

# 7aab5c84 27-Feb-2022 Ye Bin <yebin10@huawei.com>

ext4: fix fs corruption when tring to remove a non-empty directory with IO error

We inject IO error when rmdir non empty direcory, then got issue as follows:
step1: mkfs.ext4 -F /dev/sda
step2: mount /dev/sda test
step3: cd test
step4: mkdir -p 1/2
step5: rmdir 1
[ 110.920551] ext4_empty_dir: inject fault
[ 110.921926] EXT4-fs warning (device sda): ext4_rmdir:3113: inode #12:
comm rmdir: empty directory '1' has too many links (3)
step6: cd ..
step7: umount test
step8: fsck.ext4 -f /dev/sda
e2fsck 1.42.9 (28-Dec-2013)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Entry '..' in .../??? (13) has deleted/unused inode 12. Clear<y>? yes
Pass 3: Checking directory connectivity
Unconnected directory inode 13 (...)
Connect to /lost+found<y>? yes
Pass 4: Checking reference counts
Inode 13 ref count is 3, should be 2. Fix<y>? yes
Pass 5: Checking group summary information

/dev/sda: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda: 12/131072 files (0.0% non-contiguous), 26157/524288 blocks

ext4_rmdir
if (!ext4_empty_dir(inode))
goto end_rmdir;
ext4_empty_dir
bh = ext4_read_dirblock(inode, 0, DIRENT_HTREE);
if (IS_ERR(bh))
return true;
Now if read directory block failed, 'ext4_empty_dir' will return true, assume
directory is empty. Obviously, it will lead to above issue.
To solve this issue, if read directory block failed 'ext4_empty_dir' just
return false. To avoid making things worse when file system is already
corrupted, 'ext4_empty_dir' also return false.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20220228024815.3952506-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# d8ad2ce8 06-Feb-2022 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
"Various bug fixes for ext4 fast commit and inline data handling.

Also fix regression introduced as part of moving to the new mount API"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
fs/ext4: fix comments mentioning i_mutex
ext4: fix incorrect type issue during replay_del_range
jbd2: fix kernel-doc descriptions for jbd2_journal_shrink_{scan,count}()
ext4: fix potential NULL pointer dereference in ext4_fill_super()
jbd2: refactor wait logic for transaction updates into a common function
jbd2: cleanup unused functions declarations from jbd2.h
ext4: fix error handling in ext4_fc_record_modified_inode()
ext4: remove redundant max inline_size check in ext4_da_write_inline_data_begin()
ext4: fix error handling in ext4_restore_inline_data()
ext4: fast commit may miss file actions
ext4: fast commit may not fallback for ineligible commit
ext4: modify the logic of ext4_mb_new_blocks_simple
ext4: prevent used blocks from being allocated during fast commit replay


# 09355d9d 17-Jan-2022 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: remove redundant max inline_size check in ext4_da_write_inline_data_begin()

ext4_prepare_inline_data() already checks for ext4_get_max_inline_size()
and returns -ENOSPC. So there is no need to check it twice within
ext4_da_write_inline_data_begin(). This patch removes the extra check.

It also makes it more clean.

No functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/cdd1654128d5105550c65fd13ca5da53b2162cc4.1642416995.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 897026aa 17-Jan-2022 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: fix error handling in ext4_restore_inline_data()

While running "./check -I 200 generic/475" it sometimes gives below
kernel BUG(). Ideally we should not call ext4_write_inline_data() if
ext4_create_inline_data() has failed.

<log snip>
[73131.453234] kernel BUG at fs/ext4/inline.c:223!

<code snip>
212 static void ext4_write_inline_data(struct inode *inode, struct ext4_iloc *iloc,
213 void *buffer, loff_t pos, unsigned int len)
214 {
<...>
223 BUG_ON(!EXT4_I(inode)->i_inline_off);
224 BUG_ON(pos + len > EXT4_I(inode)->i_inline_size);

This patch handles the error and prints out a emergency msg saying potential
data loss for the given inode (since we couldn't restore the original
inline_data due to some previous error).

[ 9571.070313] EXT4-fs (dm-0): error restoring inline_data for inode -- potential data loss! (inode 1703982, error -30)

Reported-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/9f4cd7dfd54fa58ff27270881823d94ddf78dd07.1642416995.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

# 4034247a 14-Jan-2022 NeilBrown <neilb@suse.de>

mm: introduce memalloc_retry_wait()

Various places in the kernel - largely in filesystems - respond to a
memory allocation failure by looping around and re-trying. Some of
these cannot conveniently use __GFP_NOFAIL, for reasons such as:

- a GFP_ATOMIC allocation, which __GFP_NOFAIL doesn't work on
- a need to check for the process being signalled between failures
- the possibility that other recovery actions could be performed
- the allocation is quite deep in support code, and passing down an
extra flag to say if __GFP_NOFAIL is wanted would be clumsy.

Many of these currently use congestion_wait() which (in almost all
cases) simply waits the given timeout - congestion isn't tracked for
most devices.

It isn't clear what the best delay is for loops, but it is clear that
the various filesystems shouldn't be responsible for choosing a timeout.

This patch introduces memalloc_retry_wait() with takes on that
responsibility. Code that wants to retry a memory allocation can call
this function passing the GFP flags that were used. It will wait
however is appropriate.

For now, it only considers __GFP_NORETRY and whatever
gfpflags_allow_blocking() tests. If blocking is allowed without
__GFP_NORETRY, then alloc_page either made some reclaim progress, or
waited for a while, before failing. So there is no need for much
further waiting. memalloc_retry_wait() will wait until the current
jiffie ends. If this condition is not met, then alloc_page() won't have
waited much if at all. In that case memalloc_retry_wait() waits about
200ms. This is the delay that most current loops uses.

linux/sched/mm.h needs to be included in some files now,
but linux/backing-dev.h does not.

Link: https://lkml.kernel.org/r/163754371968.13692.1277530886009912421@noble.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 0add491d 19-Aug-2021 Eric Whitney <enwlinux@gmail.com>

ext4: remove extent cache entries when truncating inline data

Conditionally remove all cached extents belonging to an inode
when truncating its inline data. It's only necessary to attempt to
remove cached extents when a conversion from inline to extent storage
has been initiated (!EXT4_STATE_MAY_INLINE_DATA). This avoids
unnecessary es lock overhead in the more common inline case.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210819144927.25163-2-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 11ef08c9 04-Sep-2021 Theodore Ts'o <tytso@mit.edu>

Merge branch 'delalloc-buffer-write' into dev

Fix a bug in how we update i_disksize, and the error path in
inline_data_end. Finally, drop an unnecessary creation of a journal
handle which was only needed for inline data, which can give us a
large performance gain in delayed allocation writes.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 6984aef5 16-Jul-2021 Zhang Yi <yi.zhang@huawei.com>

ext4: factor out write end code of inline file

Now that the inline_data file write end procedure are falled into the
common write end functions, it is not clear. Factor them out and do
some cleanup. This patch also drop ext4_da_write_inline_data_end()
and switch to use ext4_write_inline_data_end() instead because we also
need to do the same error processing if we failed to write data into
inline entry.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210716122024.1105856-4-yi.zhang@huawei.com

# 55ce2f64 16-Jul-2021 Zhang Yi <yi.zhang@huawei.com>

ext4: correct the error path of ext4_write_inline_data_end()

Current error path of ext4_write_inline_data_end() is not correct.

Firstly, it should pass out the error value if ext4_get_inode_loc()
return fail, or else it could trigger infinite loop if we inject error
here. And then it's better to add inode to orphan list if it return fail
in ext4_journal_stop(), otherwise we could not restore inline xattr
entry after power failure. Finally, we need to reset the 'ret' value if
ext4_write_inline_data_end() return success in ext4_write_end() and
ext4_journalled_write_end(), otherwise we could not get the error return
value of ext4_journal_stop().

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210716122024.1105856-3-yi.zhang@huawei.com

# 188c299e 16-Aug-2021 Jan Kara <jack@suse.cz>

ext4: Support for checksumming from journal triggers

JBD2 layer support triggers which are called when journaling layer moves
buffer to a certain state. We can use the frozen trigger, which gets
called when buffer data is frozen and about to be written out to the
journal, to compute block checksums for some buffer types (similarly as
does ocfs2). This avoids unnecessary repeated recomputation of the
checksum (at the cost of larger window where memory corruption won't be
caught by checksumming) and is even necessary when there are
unsynchronized updaters of the checksummed data.

So add superblock and journal trigger type arguments to
ext4_journal_get_write_access() and ext4_journal_get_create_access() so
that frozen triggers can be set accordingly. Also add inode argument to
ext4_walk_page_buffers() and all the callbacks used with that function
for the same purpose. This patch is mostly only a change of prototype of
the above mentioned functions and a few small helpers. Real checksumming
will come later.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210816095713.16537-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# a54c4613 20-Aug-2021 Theodore Ts'o <tytso@mit.edu>

ext4: fix race writing to an inline_data file while its xattrs are changing

The location of the system.data extended attribute can change whenever
xattr_sem is not taken. So we need to recalculate the i_inline_off
field since it mgiht have changed between ext4_write_begin() and
ext4_write_end().

This means that caching i_inline_off is probably not helpful, so in
the long run we should probably get rid of it and shrink the in-memory
ext4 inode slightly, but let's fix the race the simple way for now.

Cc: stable@kernel.org
Fixes: f19d5870cbf72 ("ext4: add normal write support for inline data")
Reported-by: syzbot+13146364637c7363a7de@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 310c097c 02-Jun-2021 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: remove duplicate definition of ext4_xattr_ibody_inline_set()

ext4_xattr_ibody_inline_set() & ext4_xattr_ibody_set() have the exact
same definition. Hence remove ext4_xattr_ibody_inline_set() and all
its call references. Convert the callers of it to call
ext4_xattr_ibody_set() instead.

[ Modified to preserve ext4_xattr_ibody_set() and remove
ext4_xattr_ibody_inline_set() instead. -- TYT ]

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/fd566b799bbbbe9b668eb5eecde5b5e319e3694f.1622685482.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 3088e5a5 27-Mar-2021 Bhaskar Chowdhury <unixbhaskar@gmail.com>

ext4: fix various seppling typos

Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Link: https://lore.kernel.org/r/cover.1616840203.git.unixbhaskar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 471fbbea 19-Mar-2021 Daniel Rosenberg <drosen@google.com>

ext4: handle casefolding with encryption

This adds support for encryption with casefolding.

Since the name on disk is case preserving, and also encrypted, we can no
longer just recompute the hash on the fly. Additionally, to avoid
leaking extra information from the hash of the unencrypted name, we use
siphash via an fscrypt v2 policy.

The hash is stored at the end of the directory entry for all entries
inside of an encrypted and casefolded directory apart from those that
deal with '.' and '..'. This way, the change is backwards compatible
with existing ext4 filesystems.

[ Changed to advertise this feature via the file:
/sys/fs/ext4/features/encrypted_casefold -- TYT ]

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210319073414.1381041-2-drosen@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 7067b261 02-Nov-2020 Joseph Qi <joseph.qi@linux.alibaba.com>

ext4: unlock xattr_sem properly in ext4_inline_data_truncate()

It takes xattr_sem to check inline data again but without unlock it
in case not have. So unlock it before return.

Fixes: aef1c8513c1f ("ext4: let ext4_truncate handle inline data correctly")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/1604370542-124630-1-git-send-email-joseph.qi@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

# b483bb77 04-Aug-2020 Randy Dunlap <rdunlap@infradead.org>

ext4: delete duplicated words + other fixes

Delete repeated words in fs/ext4/.
{the, this, of, we, after}

Also change spelling of "xttr" in inline.c to "xattr" in 2 places.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200805024850.12129-1-rdunlap@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 7ca4fcba 24-Apr-2020 kyoungho koo <rnrudgh@gmail.com>

ext4: Fix comment typo "the the".

I have found double typed comments "the the". So i modified it to
one "the"

Signed-off-by: kyoungho koo <rnrudgh@gmail.com>
Link: https://lore.kernel.org/r/20200424171620.GA11943@koo-Z370-HD3
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 2fe34d29 10-Aug-2020 Kyoungho Koo <rnrudgh@gmail.com>

ext4: remove unused parameter of ext4_generic_delete_entry function

The ext4_generic_delete_entry function does not use the parameter
handle, so it can be removed.

Signed-off-by: Kyoungho Koo <rnrudgh@gmail.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200810080701.GA14160@koo-Z370-HD3
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 4209ae12 26-Apr-2020 Harshad Shirwadkar <harshadshirwadkar@gmail.com>

ext4: handle ext4_mark_inode_dirty errors

ext4_mark_inode_dirty() can fail for real reasons. Ignoring its return
value may lead ext4 to ignore real failures that would result in
corruption / crashes. Harden ext4_mark_inode_dirty error paths to fail
as soon as possible and return errors to the caller whenever
appropriate.

One of the possible scnearios when this bug could affected is that
while creating a new inode, its directory entry gets added
successfully but while writing the inode itself mark_inode_dirty
returns error which is ignored. This would result in inconsistency
that the directory entry points to a non-existent inode.

Ran gce-xfstests smoke tests and verified that there were no
regressions.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20200427013438.219117-1-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 54d3adbc 28-Mar-2020 Theodore Ts'o <tytso@mit.edu>

ext4: save all error info in save_error_info() and drop ext4_set_errno()

Using a separate function, ext4_set_errno() to set the errno is
problematic because it doesn't do the right thing once
s_last_error_errorcode is non-zero. It's also less racy to set all of
the error information all at once. (Also, as a bonus, it shrinks code
size slightly.)

Link: https://lore.kernel.org/r/20200329020404.686965-1-tytso@mit.edu
Fixes: 878520ac45f9 ("ext4: save the error code which triggered...")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# d3b6f23f 28-Feb-2020 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: move ext4_fiemap to use iomap framework

This patch moves ext4_fiemap to use iomap framework.
For xattr a new 'ext4_iomap_xattr_ops' is added.

Reported-by: kbuild test robot <lkp@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/b9f45c885814fcdd0631747ff0fe08886270828c.1582880246.git.riteshh@linux.ibm.com
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 8d6ce136 22-Jan-2020 Shijie Luo <luoshijie1@huawei.com>

ext4,jbd2: fix comment and code style

Fix comment and remove unneccessary blank.

Signed-off-by: Shijie Luo <luoshijie1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200123064325.36358-1-luoshijie1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 878520ac 19-Nov-2019 Theodore Ts'o <tytso@mit.edu>

ext4: save the error code which triggered an ext4_error() in the superblock

This allows the cause of an ext4_error() report to be categorized
based on whether it was triggered due to an I/O error, or an memory
allocation error, or other possible causes. Most errors are caused by
a detected file system inconsistency, so the default code stored in
the superblock will be EXT4_ERR_EFSCORRUPTED.

Link: https://lore.kernel.org/r/20191204032335.7683-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 7a14826e 12-Aug-2019 Colin Ian King <colin.i.king@gmail.com>

ext4: set error return correctly when ext4_htree_store_dirent fails

Currently when the call to ext4_htree_store_dirent fails the error return
variable 'ret' is is not being set to the error code and variable count is
instead, hence the error code is not being returned. Fix this by assigning
ret to the error return code.

Addresses-Coverity: ("Unused value")
Fixes: 8af0f0822797 ("ext4: fix readdir error in the case of inline_data+dir_index")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 7633b08b 21-Jun-2019 Theodore Ts'o <tytso@mit.edu>

ext4: rename htree_inline_dir_to_tree() to ext4_inlinedir_to_tree()

Clean up namespace pollution by the inline_data code.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# ddce3b94 21-Jun-2019 Theodore Ts'o <tytso@mit.edu>

ext4: refactor initialize_dirent_tail()

Move the calculation of the location of the dirent tail into
initialize_dirent_tail(). Also prefix the function with ext4_ to fix
kernel namepsace polution.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# f036adb3 21-Jun-2019 Theodore Ts'o <tytso@mit.edu>

ext4: rename "dirent_csum" functions to use "dirblock"

Functions such as ext4_dirent_csum_verify() and ext4_dirent_csum_set()
don't actually operate on a directory entry, but a directory block.
And while they take a struct ext4_dir_entry *dirent as an argument, it
had better be the first directory at the beginning of the direct
block, or things will go very wrong.

Rename the following functions so that things make more sense, and
remove a lot of confusing casts along the way:

ext4_dirent_csum_verify -> ext4_dirblock_csum_verify
ext4_dirent_csum_set -> ext4_dirblock_csum_set
ext4_dirent_csum -> ext4_dirblock_csum
ext4_handle_dirty_dirent_node -> ext4_handle_dirty_dirblock

Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# b886ee3e 25-Apr-2019 Gabriel Krisman Bertazi <krisman@collabora.co.uk>

ext4: Support case-insensitive file name lookups

This patch implements the actual support for case-insensitive file name
lookups in ext4, based on the feature bit and the encoding stored in the
superblock.

A filesystem that has the casefold feature set is able to configure
directories with the +F (EXT4_CASEFOLD_FL) attribute, enabling lookups
to succeed in that directory in a case-insensitive fashion, i.e: match
a directory entry even if the name used by userspace is not a byte per
byte match with the disk name, but is an equivalent case-insensitive
version of the Unicode string. This operation is called a
case-insensitive file name lookup.

The feature is configured as an inode attribute applied to directories
and inherited by its children. This attribute can only be enabled on
empty directories for filesystems that support the encoding feature,
thus preventing collision of file names that only differ by case.

* dcache handling:

For a +F directory, Ext4 only stores the first equivalent name dentry
used in the dcache. This is done to prevent unintentional duplication of
dentries in the dcache, while also allowing the VFS code to quickly find
the right entry in the cache despite which equivalent string was used in
a previous lookup, without having to resort to ->lookup().

d_hash() of casefolded directories is implemented as the hash of the
casefolded string, such that we always have a well-known bucket for all
the equivalencies of the same string. d_compare() uses the
utf8_strncasecmp() infrastructure, which handles the comparison of
equivalent, same case, names as well.

For now, negative lookups are not inserted in the dcache, since they
would need to be invalidated anyway, because we can't trust missing file
dentries. This is bad for performance but requires some leveraging of
the vfs layer to fix. We can live without that for now, and so does
everyone else.

* on-disk data:

Despite using a specific version of the name as the internal
representation within the dcache, the name stored and fetched from the
disk is a byte-per-byte match with what the user requested, making this
implementation 'name-preserving'. i.e. no actual information is lost
when writing to storage.

DX is supported by modifying the hashes used in +F directories to make
them case/encoding-aware. The new disk hashes are calculated as the
hash of the full casefolded string, instead of the string directly.
This allows us to efficiently search for file names in the htree without
requiring the user to provide an exact name.

* Dealing with invalid sequences:

By default, when a invalid UTF-8 sequence is identified, ext4 will treat
it as an opaque byte sequence, ignoring the encoding and reverting to
the old behavior for that unique file. This means that case-insensitive
file name lookup will not work only for that file. An optional bit can
be set in the superblock telling the filesystem code and userspace tools
to enforce the encoding. When that optional bit is set, any attempt to
create a file name using an invalid UTF-8 sequence will fail and return
an error to userspace.

* Normalization algorithm:

The UTF-8 algorithms used to compare strings in ext4 is implemented
lives in fs/unicode, and is based on a previous version developed by
SGI. It implements the Canonical decomposition (NFD) algorithm
described by the Unicode specification 12.1, or higher, combined with
the elimination of ignorable code points (NFDi) and full
case-folding (CF) as documented in fs/unicode/utf8_norm.c.

NFD seems to be the best normalization method for EXT4 because:

- It has a lower cost than NFC/NFKC (which requires
decomposing to NFD as an intermediary step)
- It doesn't eliminate important semantic meaning like
compatibility decompositions.

Although:

- This implementation is not completely linguistic accurate, because
different languages have conflicting rules, which would require the
specialization of the filesystem to a given locale, which brings all
sorts of problems for removable media and for users who use more than
one language.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 2b08b1f1 24-Dec-2018 Theodore Ts'o <tytso@mit.edu>

ext4: fix a potential fiemap/page fault deadlock w/ inline_data

The ext4_inline_data_fiemap() function calls fiemap_fill_next_extent()
while still holding the xattr semaphore. This is not necessary and it
triggers a circular lockdep warning. This is because
fiemap_fill_next_extent() could trigger a page fault when it writes
into page which triggers a page fault. If that page is mmaped from
the inline file in question, this could very well result in a
deadlock.

This problem can be reproduced using generic/519 with a file system
configuration which has the inline_data feature enabled.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

# 132d00be 03-Dec-2018 Maurizio Lombardi <mlombard@redhat.com>

ext4: missing unlock/put_page() in ext4_try_to_write_inline_data()

In case of error, ext4_try_to_write_inline_data() should unlock
and release the page it holds.

Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data")
Cc: stable@kernel.org # 3.8
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 625ef8a3 02-Oct-2018 Lukas Czerner <lczerner@redhat.com>

ext4: initialize retries variable in ext4_da_write_inline_data_begin()

Variable retries is not initialized in ext4_da_write_inline_data_begin()
which can lead to nondeterministic number of retries in case we hit
ENOSPC. Initialize retries to zero as we do everywhere else.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fixes: bc0ca9df3b2a ("ext4: retry allocation when inline->extent conversion failed")
Cc: stable@kernel.org

# 4d982e25 27-Aug-2018 Theodore Ts'o <tytso@mit.edu>

ext4: avoid divide by zero fault when deleting corrupted inline directories

A specially crafted file system can trick empty_inline_dir() into
reading past the last valid entry in a inline directory, and then run
into the end of xattr marker. This will trigger a divide by zero
fault. Fix this by using the size of the inline directory instead of
dir->i_size.

Also clean up error reporting in __ext4_check_dir_entry so that the
message is clearer and more understandable --- and avoids the division
by zero trap if the size passed in is zero. (I'm not sure why we
coded it that way in the first place; printing offset % size is
actually more confusing and less useful.)

https://bugzilla.kernel.org/show_bug.cgi?id=200933

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: stable@vger.kernel.org

# 362eca70 09-Jul-2018 Theodore Ts'o <tytso@mit.edu>

ext4: fix inline data updates with checksums enabled

The inline data code was updating the raw inode directly; this is
problematic since if metadata checksums are enabled,
ext4_mark_inode_dirty() must be called to update the inode's checksum.
In addition, the jbd2 layer requires that get_write_access() be called
before the metadata buffer is modified. Fix both of these problems.

https://bugzilla.kernel.org/show_bug.cgi?id=200443

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org

# 70a2dc6a 08-Jul-2018 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
"Bug fixes for ext4; most of which relate to vulnerabilities where a
maliciously crafted file system image can result in a kernel OOPS or
hang.

At least one fix addresses an inline data bug could be triggered by
userspace without the need of a crafted file system (although it does
require that the inline data feature be enabled)"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: check superblock mapped prior to committing
ext4: add more mount time checks of the superblock
ext4: add more inode number paranoia checks
ext4: avoid running out of journal credits when appending to an inline file
jbd2: don't mark block as modified if the handle is out of credits
ext4: never move the system.data xattr out of the inode body
ext4: clear i_data in ext4_inode_info when removing inline data
ext4: include the illegal physical block in the bad map ext4_error msg
ext4: verify the depth of extent tree in ext4_find_extent()
ext4: only look at the bg_flags field if it is valid
ext4: make sure bitmaps and the inode table don't overlap with bg descriptors
ext4: always check block group bounds in ext4_init_block_bitmap()
ext4: always verify the magic number in xattr blocks
ext4: add corruption check in ext4_xattr_set_entry()
ext4: add warn_on_error mount option


# 8bc1379b 16-Jun-2018 Theodore Ts'o <tytso@mit.edu>

ext4: avoid running out of journal credits when appending to an inline file

Use a separate journal transaction if it turns out that we need to
convert an inline file to use an data block. Otherwise we could end
up failing due to not having journal credits.

This addresses CVE-2018-10883.

https://bugzilla.kernel.org/show_bug.cgi?id=200071

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

# 6e8ab72a 14-Jun-2018 Theodore Ts'o <tytso@mit.edu>

ext4: clear i_data in ext4_inode_info when removing inline data

When converting from an inode from storing the data in-line to a data
block, ext4_destroy_inline_data_nolock() was only clearing the on-disk
copy of the i_blocks[] array. It was not clearing copy of the
i_blocks[] in ext4_inode_info, in i_data[], which is the copy actually
used by ext4_map_blocks().

This didn't matter much if we are using extents, since the extents
header would be invalid and thus the extents could would re-initialize
the extents tree. But if we are using indirect blocks, the previous
contents of the i_blocks array will be treated as block numbers, with
potentially catastrophic results to the file system integrity and/or
user data.

This gets worse if the file system is using a 1k block size and
s_first_data is zero, but even without this, the file system can get
quite badly corrupted.

This addresses CVE-2018-10881.

https://bugzilla.kernel.org/show_bug.cgi?id=200015

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

# 6567af78 05-Jun-2018 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'xfs-4.18-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
"New features this cycle include the ability to relabel mounted
filesystems, support for fallocated swapfiles, and using FUA for pure
data O_DSYNC directio writes. With this cycle we begin to integrate
online filesystem repair and refactor the growfs code in preparation
for eventual subvolume support, though the road ahead for both
features is quite long.

There are also numerous refactorings of the iomap code to remove
unnecessary log overhead, to disentangle some of the quota code, and
to prepare for buffer head removal in a future upstream kernel.

Metadata validation continues to improve, both in the hot path
veifiers and the online filesystem check code. I anticipate sending a
second pull request in a few days with more metadata validation
improvements.

This series has been run through a full xfstests run over the weekend
and through a quick xfstests run against this morning's master, with
no major failures reported.

Summary:

- Strengthen inode number and structure validation when allocating
inodes.

- Reduce pointless buffer allocations during cache miss

- Use FUA for pure data O_DSYNC directio writes

- Various iomap refactorings

- Strengthen quota metadata verification to avoid unfixable broken
quota

- Make AGFL block freeing a deferred operation to avoid blowing out
transaction reservations when running complex operations

- Get rid of the log item descriptors to reduce log overhead

- Fix various reflink bugs where inodes were double-joined to
transactions

- Don't issue discards when trimming unwritten extents

- Refactor incore dquot initialization and retrieval interfaces

- Fix some locking problmes in the quota scrub code

- Strengthen btree structure checks in scrub code

- Rewrite swapfile activation to use iomap and support unwritten
extents

- Make scrub exit to userspace sooner when corruptions or
cross-referencing problems are found

- Make scrub invoke the data fork scrubber directly on metadata
inodes

- Don't do background reclamation of post-eof and cow blocks when the
fs is suspended

- Fix secondary superblock buffer lifespan hinting

- Refactor growfs to use table-dispatched functions instead of long
stringy functions

- Move growfs code to libxfs

- Implement online fs label getting and setting

- Introduce online filesystem repair (in a very limited capacity)

- Fix unit conversion problems in the realtime freemap iteration
functions

- Various refactorings and cleanups in preparation to remove buffer
heads in a future release

- Reimplement the old bmap call with iomap

- Remove direct buffer head accesses from seek hole/data

- Various bug fixes"

* tag 'xfs-4.18-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (121 commits)
fs: use ->is_partially_uptodate in page_cache_seek_hole_data
fs: remove the buffer_unwritten check in page_seek_hole_data
fs: move page_cache_seek_hole_data to iomap.c
xfs: use iomap_bmap
iomap: add an iomap-based bmap implementation
iomap: add a iomap_sector helper
iomap: use __bio_add_page in iomap_dio_zero
iomap: move IOMAP_F_BOUNDARY to gfs2
iomap: fix the comment describing IOMAP_NOWAIT
iomap: inline data should be an iomap type, not a flag
mm: split ->readpages calls to avoid non-contiguous pages lists
mm: return an unsigned int from __do_page_cache_readahead
mm: give the 'ret' variable a better name __do_page_cache_readahead
block: add a lower-level bio_add_page interface
xfs: fix error handling in xfs_refcount_insert()
xfs: fix xfs_rtalloc_rec units
xfs: strengthen rtalloc query range checks
xfs: xfs_rtbuf_get should check the bmapi_read results
xfs: xfs_rtword_t should be unsigned, not signed
dax: change bdev_dax_supported() to support boolean returns
...


# 19319b53 01-Jun-2018 Christoph Hellwig <hch@lst.de>

iomap: inline data should be an iomap type, not a flag

Inline data is fundamentally different from our normal mapped case in that
it doesn't even have a block address. So instead of having a flag for it
it should be an entirely separate iomap range type.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

# 117166ef 22-May-2018 Theodore Ts'o <tytso@mit.edu>

ext4: do not allow external inodes for inline data

The inline data feature was implemented before we added support for
external inodes for xattrs. It makes no sense to support that
combination, but the problem is that there are a number of extended
attribute checks that are skipped if e_value_inum is non-zero.

Unfortunately, the inline data code is completely e_value_inum
unaware, and attempts to interpret the xattr fields as if it were an
inline xattr --- at which point, Hilarty Ensues.

This addresses CVE-2018-11412.

https://bugzilla.kernel.org/show_bug.cgi?id=199803

Reported-by: Jann Horn <jannh@google.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fixes: e50e5129f384 ("ext4: xattr-in-inode support")
Cc: stable@kernel.org

# 6fbac201 07-Feb-2018 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'iversion-v4.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull inode->i_version cleanup from Jeff Layton:
"Goffredo went ahead and sent a patch to rename this function, and
reverse its sense, as we discussed last week.

The patch is very straightforward and I figure it's probably best to
go ahead and merge this to get the API as settled as possible"

* tag 'iversion-v4.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
iversion: Rename make inode_cmp_iversion{+raw} to inode_eq_iversion{+raw}


# 23aedc4b 03-Feb-2018 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
"Only miscellaneous cleanups and bug fixes for ext4 this cycle"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: create ext4_kset dynamically
ext4: create ext4_feat kobject dynamically
ext4: release kobject/kset even when init/register fail
ext4: fix incorrect indentation of if statement
ext4: correct documentation for grpid mount option
ext4: use 'sbi' instead of 'EXT4_SB(sb)'
ext4: save error to disk in __ext4_grp_locked_error()
jbd2: fix sphinx kernel-doc build warnings
ext4: fix a race in the ext4 shutdown path
mbcache: make sure c_entry_count is not decremented past zero
ext4: no need flush workqueue before destroying it
ext4: fixed alignment and minor code cleanup in ext4.h
ext4: fix ENOSPC handling in DAX page fault handler
dax: pass detailed error code from dax_iomap_fault()
mbcache: revert "fs/mbcache.c: make count_objects() more robust"
mbcache: initialize entry->e_referenced in mb_cache_entry_create()
ext4: fix up remaining files with SPDX cleanups


# c472c07b 01-Feb-2018 Goffredo Baroncelli <kreijack@inwind.it>

iversion: Rename make inode_cmp_iversion{+raw} to inode_eq_iversion{+raw}

The function inode_cmp_iversion{+raw} is counter-intuitive, because it
returns true when the counters are different and false when these are equal.

Rename it to inode_eq_iversion{+raw}, which will returns true when
the counters are equal and false otherwise.

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
Signed-off-by: Jeff Layton <jlayton@redhat.com>

# ee73f9a5 09-Jan-2018 Jeff Layton <jlayton@kernel.org>

ext4: convert to new i_version API

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>

# f5166768 17-Dec-2017 Theodore Ts'o <tytso@mit.edu>

ext4: fix up remaining files with SPDX cleanups

A number of ext4 source files were skipped due because their copyright
permission statements didn't match the expected text used by the
automated conversion utilities. I've added SPDX tags for the rest.

While looking at some of these files, I've noticed that we have quite
a bit of variation on the licenses that were used --- in particular
some of the Red Hat licenses on the jbd2 files use a GPL2+ license,
and we have some files that have a LGPL-2.1 license (which was quite
surprising).

I've not attempted to do any license changes. Even if it is perfectly
legal to relicense to GPL 2.0-only for consistency's sake, that should
be done with ext4 developer community discussion.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 559db4c6 12-Oct-2017 Ross Zwisler <zwisler@kernel.org>

ext4: prevent data corruption with inline data + DAX

If an inode has inline data it is currently prevented from using DAX by a
check in ext4_set_inode_flags(). When the inode grows inline data via
ext4_create_inline_data() or removes its inline data via
ext4_destroy_inline_data_nolock(), the value of S_DAX can change.

Currently these changes are unsafe because we don't hold off page faults
and I/O, write back dirty radix tree entries and invalidate all mappings.
There are also issues with mm-level races when changing the value of S_DAX,
as well as issues with the VM_MIXEDMAP flag:

https://www.spinics.net/lists/linux-xfs/msg09859.html

The unsafe transition of S_DAX can reliably cause data corruption, as shown
by the following fstest:

https://patchwork.kernel.org/patch/9948381/

Fix this issue by preventing the DAX mount option from being used on
filesystems that were created to support inline data. Inline data is an
option given to mkfs.ext4.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
CC: stable@vger.kernel.org

# 7046ae35 01-Oct-2017 Andreas Gruenbacher <agruenba@redhat.com>

ext4: Add iomap support for inline data

Report inline data as a IOMAP_F_DATA_INLINE mapping. This allows to use
iomap_seek_hole and iomap_seek_data in ext4_llseek and makes switching
to iomap_fiemap in ext4_fiemap easier.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>

# e50e5129 21-Jun-2017 Andreas Dilger <andreas.dilger@intel.com>

ext4: xattr-in-inode support

Large xattr support is implemented for EXT4_FEATURE_INCOMPAT_EA_INODE.

If the size of an xattr value is larger than will fit in a single
external block, then the xattr value will be saved into the body
of an external xattr inode.

The also helps support a larger number of xattr, since only the headers
will be stored in the in-inode space or the single external block.

The inode is referenced from the xattr header via "e_value_inum",
which was formerly "e_value_block", but that field was never used.
The e_value_size still contains the xattr size so that listing
xattrs does not need to look up the inode if the data is not accessed.

struct ext4_xattr_entry {
__u8 e_name_len; /* length of name */
__u8 e_name_index; /* attribute name index */
__le16 e_value_offs; /* offset in disk block of value */
__le32 e_value_inum; /* inode in which value is stored */
__le32 e_value_size; /* size of attribute value */
__le32 e_hash; /* hash value of name and value */
char e_name[0]; /* attribute name */
};

The xattr inode is marked with the EXT4_EA_INODE_FL flag and also
holds a back-reference to the owning inode in its i_mtime field,
allowing the ext4/e2fsck to verify the correct inode is accessed.

[ Applied fix by Dan Carpenter to avoid freeing an ERR_PTR. ]

Lustre-Jira: https://jira.hpdd.intel.com/browse/LU-80
Lustre-bugzilla: https://bugzilla.lustre.org/show_bug.cgi?id=4424
Signed-off-by: Kalpak Shah <kalpak.shah@sun.com>
Signed-off-by: James Simmons <uja.ornl@gmail.com>
Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

# d6b97550 24-May-2017 Eric Biggers <ebiggers@google.com>

ext4: remove unused d_name argument from ext4_search_dir() et al.

Now that we are passing a struct ext4_filename, we do not need to pass
around the original struct qstr too.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>

# 1bc0af60 29-Apr-2017 Eric Biggers <ebiggers@google.com>

ext4: trim return value and 'dir' argument from ext4_insert_dentry()

In the initial implementation of ext4 encryption, the filename was
encrypted in ext4_insert_dentry(), which could fail and also required
access to the 'dir' inode. Since then ext4 filename encryption has been
changed to encrypt the filename earlier, so we can revert the additions
to ext4_insert_dentry().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# b9cf625d 15-Mar-2017 Eric Biggers <ebiggers@google.com>

ext4: mark inode dirty after converting inline directory

If ext4_convert_inline_data() was called on a directory with inline
data, the filesystem was left in an inconsistent state (as considered by
e2fsck) because the file size was not increased to cover the new block.
This happened because the inode was not marked dirty after i_disksize
was updated. Fix this by marking the inode dirty at the end of
ext4_finish_convert_inline_dir().

This bug was probably not noticed before because most users mark the
inode dirty afterwards for other reasons. But if userspace executed
FS_IOC_SET_ENCRYPTION_POLICY with invalid parameters, as exercised by
'kvm-xfstests -c adv generic/396', then the inode was never marked dirty
after updating i_disksize.

Cc: stable@vger.kernel.org # 3.10+
Fixes: 3c47d54170b6a678875566b1b8d6dcf57904e49b
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 0db1ff22 04-Feb-2017 Theodore Ts'o <tytso@mit.edu>

ext4: add shutdown bit and check for it

Add a shutdown bit that will cause ext4 processing to fail immediately
with EIO.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# eb5efbcb 04-Feb-2017 Theodore Ts'o <tytso@mit.edu>

ext4: fix inline data error paths

The write_end() function must always unlock the page and drop its ref
count, even on an error.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org

# 01daf945 22-Jan-2017 Theodore Ts'o <tytso@mit.edu>

ext4: propagate error values from ext4_inline_data_truncate()

Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# b907f2d5 11-Jan-2017 Theodore Ts'o <tytso@mit.edu>

ext4: avoid calling ext4_mark_inode_dirty() under unneeded semaphores

There is no need to call ext4_mark_inode_dirty while holding xattr_sem
or i_data_sem, so where it's easy to avoid it, move it out from the
critical region.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# c755e251 11-Jan-2017 Theodore Ts'o <tytso@mit.edu>

ext4: fix deadlock between inline_data and ext4_expand_extra_isize_ea()

The xattr_sem deadlock problems fixed in commit 2e81a4eeedca: "ext4:
avoid deadlock when expanding inode size" didn't include the use of
xattr_sem in fs/ext4/inline.c. With the addition of project quota
which added a new extra inode field, this exposed deadlocks in the
inline_data code similar to the ones fixed by 2e81a4eeedca.

The deadlock can be reproduced via:

dmesg -n 7
mke2fs -t ext4 -O inline_data -Fq -I 256 /dev/vdc 32768
mount -t ext4 -o debug_want_extra_isize=24 /dev/vdc /vdc
mkdir /vdc/a
umount /vdc
mount -t ext4 /dev/vdc /vdc
echo foo > /vdc/a/foo

and looks like this:

[ 11.158815]
[ 11.160276] =============================================
[ 11.161960] [ INFO: possible recursive locking detected ]
[ 11.161960] 4.10.0-rc3-00015-g011b30a8a3cf #160 Tainted: G W
[ 11.161960] ---------------------------------------------
[ 11.161960] bash/2519 is trying to acquire lock:
[ 11.161960] (&ei->xattr_sem){++++..}, at: [<c1225a4b>] ext4_expand_extra_isize_ea+0x3d/0x4cd
[ 11.161960]
[ 11.161960] but task is already holding lock:
[ 11.161960] (&ei->xattr_sem){++++..}, at: [<c1227941>] ext4_try_add_inline_entry+0x3a/0x152
[ 11.161960]
[ 11.161960] other info that might help us debug this:
[ 11.161960] Possible unsafe locking scenario:
[ 11.161960]
[ 11.161960] CPU0
[ 11.161960] ----
[ 11.161960] lock(&ei->xattr_sem);
[ 11.161960] lock(&ei->xattr_sem);
[ 11.161960]
[ 11.161960] *** DEADLOCK ***
[ 11.161960]
[ 11.161960] May be due to missing lock nesting notation
[ 11.161960]
[ 11.161960] 4 locks held by bash/2519:
[ 11.161960] #0: (sb_writers#3){.+.+.+}, at: [<c11a2414>] mnt_want_write+0x1e/0x3e
[ 11.161960] #1: (&type->i_mutex_dir_key){++++++}, at: [<c119508b>] path_openat+0x338/0x67a
[ 11.161960] #2: (jbd2_handle){++++..}, at: [<c123314a>] start_this_handle+0x582/0x622
[ 11.161960] #3: (&ei->xattr_sem){++++..}, at: [<c1227941>] ext4_try_add_inline_entry+0x3a/0x152
[ 11.161960]
[ 11.161960] stack backtrace:
[ 11.161960] CPU: 0 PID: 2519 Comm: bash Tainted: G W 4.10.0-rc3-00015-g011b30a8a3cf #160
[ 11.161960] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1 04/01/2014
[ 11.161960] Call Trace:
[ 11.161960] dump_stack+0x72/0xa3
[ 11.161960] __lock_acquire+0xb7c/0xcb9
[ 11.161960] ? kvm_clock_read+0x1f/0x29
[ 11.161960] ? __lock_is_held+0x36/0x66
[ 11.161960] ? __lock_is_held+0x36/0x66
[ 11.161960] lock_acquire+0x106/0x18a
[ 11.161960] ? ext4_expand_extra_isize_ea+0x3d/0x4cd
[ 11.161960] down_write+0x39/0x72
[ 11.161960] ? ext4_expand_extra_isize_ea+0x3d/0x4cd
[ 11.161960] ext4_expand_extra_isize_ea+0x3d/0x4cd
[ 11.161960] ? _raw_read_unlock+0x22/0x2c
[ 11.161960] ? jbd2_journal_extend+0x1e2/0x262
[ 11.161960] ? __ext4_journal_get_write_access+0x3d/0x60
[ 11.161960] ext4_mark_inode_dirty+0x17d/0x26d
[ 11.161960] ? ext4_add_dirent_to_inline.isra.12+0xa5/0xb2
[ 11.161960] ext4_add_dirent_to_inline.isra.12+0xa5/0xb2
[ 11.161960] ext4_try_add_inline_entry+0x69/0x152
[ 11.161960] ext4_add_entry+0xa3/0x848
[ 11.161960] ? __brelse+0x14/0x2f
[ 11.161960] ? _raw_spin_unlock_irqrestore+0x44/0x4f
[ 11.161960] ext4_add_nondir+0x17/0x5b
[ 11.161960] ext4_create+0xcf/0x133
[ 11.161960] ? ext4_mknod+0x12f/0x12f
[ 11.161960] lookup_open+0x39e/0x3fb
[ 11.161960] ? __wake_up+0x1a/0x40
[ 11.161960] ? lock_acquire+0x11e/0x18a
[ 11.161960] path_openat+0x35c/0x67a
[ 11.161960] ? sched_clock_cpu+0xd7/0xf2
[ 11.161960] do_filp_open+0x36/0x7c
[ 11.161960] ? _raw_spin_unlock+0x22/0x2c
[ 11.161960] ? __alloc_fd+0x169/0x173
[ 11.161960] do_sys_open+0x59/0xcc
[ 11.161960] SyS_open+0x1d/0x1f
[ 11.161960] do_int80_syscall_32+0x4f/0x61
[ 11.161960] entry_INT80_32+0x2f/0x2f
[ 11.161960] EIP: 0xb76ad469
[ 11.161960] EFLAGS: 00000286 CPU: 0
[ 11.161960] EAX: ffffffda EBX: 08168ac8 ECX: 00008241 EDX: 000001b6
[ 11.161960] ESI: b75e46bc EDI: b7755000 EBP: bfbdb108 ESP: bfbdafc0
[ 11.161960] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b

Cc: stable@vger.kernel.org # 3.10 (requires 2e81a4eeedca as a prereq)
Reported-by: George Spelvin <linux@sciencehorizons.net>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 578620f4 10-Dec-2016 Dan Carpenter <error27@gmail.com>

ext4: return -ENOMEM instead of success

We should set the error code if kzalloc() fails.

Fixes: 67cf5b09a46f ("ext4: add the basic function for inline data support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org

# a3caa24b 20-Nov-2016 Jan Kara <jack@suse.cz>

ext4: only set S_DAX if DAX is really supported

Currently we have S_DAX set inode->i_flags for a regular file whenever
ext4 is mounted with dax mount option. However in some cases we cannot
really do DAX - e.g. when inode is marked to use data journalling, when
inode data is being encrypted, or when inode is stored inline. Make sure
S_DAX flag is appropriately set/cleared in these cases.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# eeca7ea1 14-Nov-2016 Deepa Dinamani <deepa.kernel@gmail.com>

ext4: use current_time() for inode timestamps

CURRENT_TIME_SEC and CURRENT_TIME are not y2038 safe.
current_time() will be transitioned to be y2038 safe
along with vfs.

current_time() returns timestamps according to the
granularities set in the super_block.
The granularity check in ext4_current_time() to call
current_time() or CURRENT_TIME_SEC is not required.
Use current_time() directly to obtain timestamps
unconditionally, and remove ext4_current_time().

Quota files are assumed to be on the same filesystem.
Hence, use current_time() for these files as well.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>

# a7550b30 10-Jul-2016 Jaegeuk Kim <jaegeuk@kernel.org>

ext4 crypto: migrate into vfs's crypto engine

This patch removes the most parts of internal crypto codes.
And then, it modifies and adds some ext4-specific crypt codes to use the generic
facility.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 8d2ae1cb 26-Apr-2016 Jakub Wilk <jwilk@jwilk.net>

ext4: remove trailing \n from ext4_warning/ext4_error calls

Messages passed to ext4_warning() or ext4_error() don't need trailing
newlines, because these function add the newlines themselves.

Signed-off-by: Jakub Wilk <jwilk@jwilk.net>

# 09cbfeaf 01-Apr-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# a8ed9b86 09-Mar-2016 Geliang Tang <geliangtang@163.com>

ext4: drop unneeded BUFFER_TRACE in ext4_delete_inline_entry()

BUFFER_TRACE info "call ext4_handle_dirty_metadata" doesn't match the
code, so drop it.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 705965bd 08-Mar-2016 Jan Kara <jack@suse.cz>

ext4: rename and split get blocks functions

Rename ext4_get_blocks_write() to ext4_get_blocks_unwritten() to better
describe what it does. Also split out get blocks functions for direct
IO. Later we move functionality from _ext4_get_blocks() there. There's no
functional change in this patch.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 56a04915 08-Jan-2016 Theodore Ts'o <tytso@mit.edu>

ext4 crypto: simplify interfaces to directory entry insert functions

A number of functions include ext4_add_dx_entry, make_indexed_dir,
etc. are being passed a dentry even though the only thing they use is
the containing parent. We can shrink the code size slightly by making
this replacement. This will also be useful in cases where we don't
have a dentry as the argument to the directory entry insert functions.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# e2b911c5 17-Oct-2015 Darrick J. Wong <darrick.wong@oracle.com>

ext4: clean up feature test macros with predicate functions

Create separate predicate functions to test/set/clear feature flags,
thereby replacing the wordy old macros. Furthermore, clean out the
places where we open-coded feature tests.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

# 5b643f9c 18-May-2015 Theodore Ts'o <tytso@mit.edu>

ext4 crypto: optimize filename encryption

Encrypt the filename as soon it is passed in by the user. This avoids
our needing to encrypt the filename 2 or 3 times while in the process
of creating a filename.

Similarly, when looking up a directory entry, encrypt the filename
early, or if the encryption key is not available, base-64 decode the
file syystem so that the hash value and the last 16 bytes of the
encrypted filename is available in the new struct ext4_filename data
structure.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 9ec3a646 26-Apr-2015 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull fourth vfs update from Al Viro:
"d_inode() annotations from David Howells (sat in for-next since before
the beginning of merge window) + four assorted fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
RCU pathwalk breakage when running into a symlink overmounting something
fix I_DIO_WAKEUP definition
direct-io: only inc/dec inode->i_dio_count for file systems
fs/9p: fix readdir()
VFS: assorted d_backing_inode() annotations
VFS: fs/inode.c helpers: d_inode() annotations
VFS: fs/cachefiles: d_backing_inode() annotations
VFS: fs library helpers: d_inode() annotations
VFS: assorted weird filesystems: d_inode() annotations
VFS: normal filesystems (and lustre): d_inode() annotations
VFS: security/: d_inode() annotations
VFS: security/: d_backing_inode() annotations
VFS: net/: d_inode() annotations
VFS: net/unix: d_backing_inode() annotations
VFS: kernel/: d_inode() annotations
VFS: audit: d_backing_inode() annotations
VFS: Fix up some ->d_inode accesses in the chelsio driver
VFS: Cachefiles should perform fs modifications on the top layer only
VFS: AF_UNIX sockets should call mknod on the top layer only


# 2b0143b5 17-Mar-2015 David Howells <dhowells@redhat.com>

VFS: normal filesystems (and lustre): d_inode() annotations

that's the bulk of filesystem drivers dealing with inodes of their own

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 4bdfc873 11-Apr-2015 Michael Halcrow <mhalcrow@google.com>

ext4 crypto: insert encrypted filenames into a leaf directory block

Signed-off-by: Uday Savagaonkar <savagaon@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 2f61830a 11-Apr-2015 Theodore Ts'o <tytso@mit.edu>

ext4 crypto: teach ext4_htree_store_dirent() to store decrypted filenames

For encrypted directories, we need to pass in a separate parameter for
the decrypted filename, since the directory entry contains the
encrypted filename.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 80cfb71e 02-Apr-2015 Rasmus Villemoes <linux@rasmusvillemoes.dk>

ext4: fix transposition typo in format string

According to C99, %*.s means the same as %*.0s, in other words, print as
many spaces as the field width argument says and effectively ignore the
string argument. That is certainly not what was meant here. The kernel's
printf implementation, however, treats it as if the . was not there,
i.e. as %*s. I don't know if de->name is nul-terminated or not, but in
any case I'm guessing the intention was to use de->name_len as precision
instead of field width.

[ Note: this is debugging code which is commented out, so this is not
security issue; a developer would have to explicitly enable
INLINE_DIR_DEBUG before this would be an issue. ]

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 50db71ab 05-Dec-2014 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: ext4_da_convert_inline_data_to_extent drop locked page after error

Testcase:
xfstests generic/270
MKFS_OPTIONS="-q -I 256 -O inline_data,64bit"

Call Trace:
[<ffffffff81144c76>] lock_page+0x35/0x39 -------> DEADLOCK
[<ffffffff81145260>] pagecache_get_page+0x65/0x15a
[<ffffffff811507fc>] truncate_inode_pages_range+0x1db/0x45c
[<ffffffff8120ea63>] ? ext4_da_get_block_prep+0x439/0x4b6
[<ffffffff811b29b7>] ? __block_write_begin+0x284/0x29c
[<ffffffff8120e62a>] ? ext4_change_inode_journal_flag+0x16b/0x16b
[<ffffffff81150af0>] truncate_inode_pages+0x12/0x14
[<ffffffff81247cb4>] ext4_truncate_failed_write+0x19/0x25
[<ffffffff812488cf>] ext4_da_write_inline_data_begin+0x196/0x31c
[<ffffffff81210dad>] ext4_da_write_begin+0x189/0x302
[<ffffffff810c07ac>] ? trace_hardirqs_on+0xd/0xf
[<ffffffff810ddd13>] ? read_seqcount_begin.clone.1+0x9f/0xcc
[<ffffffff8114309d>] generic_perform_write+0xc7/0x1c6
[<ffffffff810c040e>] ? mark_held_locks+0x59/0x77
[<ffffffff811445d1>] __generic_file_write_iter+0x17f/0x1c5
[<ffffffff8120726b>] ext4_file_write_iter+0x2a5/0x354
[<ffffffff81185656>] ? file_start_write+0x2a/0x2c
[<ffffffff8107bcdb>] ? bad_area_nosemaphore+0x13/0x15
[<ffffffff811858ce>] new_sync_write+0x8a/0xb2
[<ffffffff81186e7b>] vfs_write+0xb5/0x14d
[<ffffffff81186ffb>] SyS_write+0x5c/0x8c
[<ffffffff816f2529>] system_call_fastpath+0x12/0x17

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# d952d69e 02-Dec-2014 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: ext4_inline_data_fiemap should respect callers argument

Currently ext4_inline_data_fiemap ignores requested arguments (start
and len) which may lead endless loop if start != 0. Also fix incorrect
extent length determination.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 5cc28a9e 02-Dec-2014 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: prevent fsreentrance deadlock for inline_data

ext4_da_convert_inline_data_to_extent() invokes
grab_cache_page_write_begin(). grab_cache_page_write_begin performs
memory allocation, so fs-reentrance should be prohibited because we
are inside journal transaction.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 9aa5d32b 13-Oct-2014 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: Replace open coded mdata csum feature to helper function

Besides the fact that this replacement improves code readability
it also protects from errors caused direct EXT4_S(sb)->s_es manipulation
which may result attempt to use uninitialized csum machinery.

#Testcase_BEGIN
IMG=/dev/ram0
MNT=/mnt
mkfs.ext4 $IMG
mount $IMG $MNT
#Enable feature directly on disk, on mounted fs
tune2fs -O metadata_csum $IMG
# Provoke metadata update, likey result in OOPS
touch $MNT/test
umount $MNT
#Testcase_END

# Replacement script
@@
expression E;
@@
- EXT4_HAS_RO_COMPAT_FEATURE(E, EXT4_FEATURE_RO_COMPAT_METADATA_CSUM)
+ ext4_has_metadata_csum(E)

https://bugzilla.kernel.org/show_bug.cgi?id=82201

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org

# 684de574 11-Sep-2014 Darrick J. Wong <darrick.wong@oracle.com>

ext4: don't keep using page if inline conversion fails

If inline->extent conversion fails (most probably due to ENOSPC) and
we release the temporary page that we allocated to transfer the file
contents, don't keep using the page pointer after releasing the page.
This occasionally leads to complaints about evicting locked pages or
hangs when blocksize > pagesize, because it's possible for the page to
get reallocated elsewhere in the meantime.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Tao Ma <tm@tao.ma>

# 40b163f1 28-Jul-2014 Darrick J. Wong <darrick.wong@oracle.com>

ext4: check inline directory before converting

Before converting an inline directory to a regular directory, check
the directory entries to make sure they're not obviously broken.
This helps us to avoid a BUG_ON if one of the dirents is trashed.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>

# 83447ccb 15-Jul-2014 Zheng Liu <wenqing.lz@taobao.com>

ext4: make ext4_has_inline_data() as a inline function

Now ext4_has_inline_data() is used in wide spread codepaths. So we need
to make it as a inline function to avoid burning some CPU cycles.

Change in text size:

text data bss dec hex filename
before: 326110 19258 5528 350896 55ab0 fs/ext4/ext4.o
after: 326227 19258 5528 351013 55b25 fs/ext4/ext4.o

I use the following script to measure the CPU usage.

#!/bin/bash

shm_base='/dev/shm'
img=${shm_base}/ext4-img
mnt=/mnt/loop

e2fsprgs_base=$HOME/e2fsprogs
mkfs=${e2fsprgs_base}/misc/mke2fs
fsck=${e2fsprgs_base}/e2fsck/e2fsck

sudo umount $mnt
dd if=/dev/zero of=$img bs=4k count=3145728
${mkfs} -t ext4 -O inline_data -F $img
sudo mount -t ext4 -o loop $img $mnt

# start testing...
testdir="${mnt}/testdir"
mkdir $testdir
cd $testdir

echo "start testing..."
for ((cnt=0;cnt<100;cnt++)); do

for ((i=0;i<5;i++)); do
for ((j=0;j<5;j++)); do
for ((k=0;k<5;k++)); do
for ((l=0;l<5;l++)); do
mkdir -p $i/$j/$k/$l
echo "$i-$j-$k-$l" > $i/$j/$k/$l/testfile
done
done
done
done

ls -R $testdir > /dev/null
rm -rf $testdir/*

done

The result of `perf top -G -U` is as below.

vanilla:
13.92% [ext4] [k] ext4_do_update_inode
9.36% [ext4] [k] __ext4_get_inode_loc
4.07% [ext4] [k] ftrace_define_fields_ext4_writepages
3.83% [ext4] [k] __ext4_handle_dirty_metadata
3.42% [ext4] [k] ext4_get_inode_flags
2.71% [ext4] [k] ext4_mark_iloc_dirty
2.46% [ext4] [k] ftrace_define_fields_ext4_direct_IO_enter
2.26% [ext4] [k] ext4_get_inode_loc
2.22% [ext4] [k] ext4_has_inline_data
[...]

After applied the patch, we don't see ext4_has_inline_data() because it
has been inlined and perf couldn't sample it. Although it doesn't mean
that the CPU cycles can be saved but at least the overhead of function
calls can be eliminated. So IMHO we'd better inline this function.

Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 5d601255 12-May-2014 liang xie <xieliang007@gmail.com>

ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access

Make them more consistently

Signed-off-by: xieliang <xieliang@xiaomi.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

# c197855e 12-May-2014 Stephen Hemminger <stephen@networkplumber.org>

ext4: make local functions static

I have been running make namespacecheck to look for unneeded globals, and
found these in ext4.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

# d7092ae2 11-Jan-2014 jon ernst <jonernst07@gmail.com>

ext4: delete "set but not used" variables

Signed-off-by: Jon Ernst <jonernst07@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>

# 09c455aa 06-Jan-2014 Theodore Ts'o <tytso@mit.edu>

ext4: avoid clearing beyond i_blocks when truncating an inline data file

A missing cast means that when we are truncating a file which is less
than 60 bytes, we don't clear the correct area of memory, and in fact
we can end up truncating the next inode in the inode table, or worse
yet, some other kernel data structure.

Addresses-Coverity-Id: #751987

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org

# 52e44777 06-Jan-2014 Jan Kara <jack@suse.cz>

ext4: standardize error handling in ext4_da_write_inline_data_begin()

The function has a bit non-standard (for ext4) error recovery in that it
used a mix of 'out' labels and testing for 'handle' being NULL. There
isn't a good reason for that in the function so clean it up a bit.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

# bc0ca9df 06-Jan-2014 Jan Kara <jack@suse.cz>

ext4: retry allocation when inline->extent conversion failed

Similarly as other ->write_begin functions in ext4, also
ext4_da_write_inline_data_begin() should retry allocation if the
conversion failed because of ENOSPC. This avoids returning ENOSPC
prematurely because of uncommitted block deletions.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

# 5ba052fe 30-Oct-2013 Azat Khuzhin <a3at.mail@gmail.com>

ext4: drop set but otherwise unused variable from ext4_add_dirent_to_inline()

Signed-off-by: Azat Khuzhin <a3at.mail@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

# 48ffdab1 30-Oct-2013 BoxiLiu <lewis.liulei@huawei.com>

ext4: change ext4_read_inline_dir() to return 0 on success

In ext4_read_inline_dir(), if there is inline data, the successful
return value is the return value of ext4_read_inline_data(). Howewer,
this is used by ext4_readdir(), and while it seems harmless to return
a positive value on success, it's inconsistent, since historically
we've always return 0 on success.

Signed-off-by: BoxiLiu <lewis.liulei@huawei.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Tao Ma <boyu.mt@taobao.com>

# 9e239bb9 02-Jul-2013 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 update from Ted Ts'o:
"Lots of bug fixes, cleanups and optimizations. In the bug fixes
category, of note is a fix for on-line resizing file systems where the
block size is smaller than the page size (i.e., file systems 1k blocks
on x86, or more interestingly file systems with 4k blocks on Power or
ia64 systems.)

In the cleanup category, the ext4's punch hole implementation was
significantly improved by Lukas Czerner, and now supports bigalloc
file systems. In addition, Jan Kara significantly cleaned up the
write submission code path. We also improved error checking and added
a few sanity checks.

In the optimizations category, two major optimizations deserve
mention. The first is that ext4_writepages() is now used for
nodelalloc and ext3 compatibility mode. This allows writes to be
submitted much more efficiently as a single bio request, instead of
being sent as individual 4k writes into the block layer (which then
relied on the elevator code to coalesce the requests in the block
queue). Secondly, the extent cache shrink mechanism, which was
introduce in 3.9, no longer has a scalability bottleneck caused by the
i_es_lru spinlock. Other optimizations include some changes to reduce
CPU usage and to avoid issuing empty commits unnecessarily."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
ext4: optimize starting extent in ext4_ext_rm_leaf()
jbd2: invalidate handle if jbd2_journal_restart() fails
ext4: translate flag bits to strings in tracepoints
ext4: fix up error handling for mpage_map_and_submit_extent()
jbd2: fix theoretical race in jbd2__journal_restart
ext4: only zero partial blocks in ext4_zero_partial_blocks()
ext4: check error return from ext4_write_inline_data_end()
ext4: delete unnecessary C statements
ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree()
jbd2: move superblock checksum calculation to jbd2_write_superblock()
ext4: pass inode pointer instead of file pointer to punch hole
ext4: improve free space calculation for inline_data
ext4: reduce object size when !CONFIG_PRINTK
ext4: improve extent cache shrink mechanism to avoid to burn CPU time
ext4: implement error handling of ext4_mb_new_preallocation()
ext4: fix corruption when online resizing a fs with 1K block size
ext4: delete unused variables
ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents
jbd2: remove debug dependency on debug_fs and update Kconfig help text
jbd2: use a single printk for jbd_debug()
...


# c4932dbe 01-Jul-2013 boxi liu <boxi10liu@gmail.com>

ext4: improve free space calculation for inline_data

In ext4 feature inline_data,it use the xattr's space to store the
inline data in inode.When we calculate the inline data as the xattr,we
add the pad.But in get_max_inline_xattr_value_size() function we count
the free space without pad.It cause some contents are moved to a block
even if it can be
stored in the inode.

Signed-off-by: liulei <lewis.liulei@huawei.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Tao Ma <boyu.mt@taobao.com>

# 725bebb2 17-May-2013 Al Viro <viro@zeniv.linux.org.uk>

[readdir] convert ext4

and trim the living hell out bogosities in inline dir case

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# eaf37937 31-May-2013 Jan Kara <jack@suse.cz>

ext4: fix data offset overflow on 32-bit archs in ext4_inline_data_fiemap()

On 32-bit archs when sector_t is defined as 32-bit the logic computing
data offset in ext4_inline_data_fiemap(). Fix that by properly typing
the shifted value.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# c4d8b023 19-Apr-2013 Tao Ma <boyu.mt@taobao.com>

ext4: fix readdir error in case inline_data+^dir_index.

Zach reported a problem that if inline data is enabled, we don't
tell the difference between the offset of '.' and '..'. And a
getdents will fail if the user only want to get '.'. And what's
worse, we may meet with duplicate dir entries as the offset
for inline dir and non-inline one is quite different.

This patch just try to resolve this problem if dir_index
is disabled. In this case, f_pos is the real offset with
the dir block, so for inline dir, we just pretend as if
we are a dir block and returns the offset like a norml
dir block does.

Reported-by: Zach Brown <zab@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

# 8af0f082 19-Apr-2013 Tao Ma <boyu.mt@taobao.com>

ext4: fix readdir error in the case of inline_data+dir_index

Zach reported a problem that if inline data is enabled, we don't
tell the difference between the offset of '.' and '..'. And a
getdents will fail if the user only want to get '.' and what's worse,
if there is a conversion happens when the user calls getdents
many times, he/she may get the same entry twice.

In theory, a dir block would also fail if it is converted to a
hashed-index based dir since f_pos will become a hash value, not the
real one, but it doesn't happen. And a deep investigation shows that
we uses a hash based solution even for a normal dir if the dir_index
feature is enabled.

So this patch just adds a new htree_inlinedir_to_tree for inline dir,
and if we find that the hash index is supported, we will do like what
we do for a dir block.

Reported-by: Zach Brown <zab@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

# d895cb1a 26-Feb-2013 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.

The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.

Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.

PS: the next vfs pile will be xattr stuff."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...


# 496ad9aa 23-Jan-2013 Al Viro <viro@zeniv.linux.org.uk>

new helper: file_inode(file)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 9924a92a 08-Feb-2013 Theodore Ts'o <tytso@mit.edu>

ext4: pass context information to jbd2__journal_start()

So we can better understand what bits of ext4 are responsible for
long-running jbd2 handles, use jbd2__journal_start() so we can pass
context information for logging purposes.

The recommended way for finding the longer-running handles is:

T=/sys/kernel/debug/tracing
EVENT=$T/events/jbd2/jbd2_handle_stats
echo "interval > 5" > $EVENT/filter
echo 1 > $EVENT/enable

./run-my-fs-benchmark

cat $T/trace > /tmp/problem-handles

This will list handles that were active for longer than 20ms. Having
longer-running handles is bad, because a commit started at the wrong
time could stall for those 20+ milliseconds, which could delay an
fsync() or an O_SYNC operation. Here is an example line from the
trace file describing a handle which lived on for 311 jiffies, or over
1.2 seconds:

postmark-2917 [000] .... 196.435786: jbd2_handle_stats: dev 254,32
tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
dirtied_blocks 0

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

# 860d21e2 12-Jan-2013 Theodore Ts'o <tytso@mit.edu>

ext4: return ENOMEM if sb_getblk() fails

The only reason for sb_getblk() failing is if it can't allocate the
buffer_head. So ENOMEM is more appropriate than EIO. In addition,
make sure that the file system is marked as being inconsistent if
sb_getblk() fails.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org

# bd9926e8 11-Dec-2012 Theodore Ts'o <tytso@mit.edu>

ext4: zero out inline data using memset() instead of empty_zero_page

Not all architectures (in particular, sparc64) have empty_zero_page.
So instead of copying from empty_zero_page, use memset to clear the
inline data by signalling to ext4_xattr_set_entry() via a magic
pointer value, EXT4_ZERO_ATTR_VALUE, which is defined by casting -1 to
a pointer.

This fixes a build failure on sparc64, and the memset() should be more
efficient than using memcpy() anyway.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

# 0c8d414f 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: let fallocate handle inline data correctly

If we are punching hole in a file, we will return ENOTSUPP.
As for the fallocation of some extents, we will convert the
inline data to a normal extent based file first.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

# aef1c851 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: let ext4_truncate handle inline data correctly

Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

# 0d812f77 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: evict inline data out if we need to strore xattr in inode

Now we that store data in the inode, in case we need to store some
xattrs and inode doesn't have enough space, Andreas suggested that we
should keep the xattr(metadata) in and data should be pushed out. So
this patch does the work.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

# 94191985 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: let fiemap work with inline data

fiemap is used to find the disk layout of a file, as for inline data,
let us just pretend like a file with just one extent.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

# 32f7f22c 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: let ext4_rename handle inline dir

In case we rename a directory, ext4_rename has to read the dir block
and change its dotdot's information. The old ext4_rename encapsulated
the dir_block read into itself. So this patch adds a new function
ext4_get_first_dir_block() which gets the dir buffer information so
the ext4_rename can handle it properly. As it will also change the
parent inode number, we return the parent_de so that ext4_rename() can
handle it more easily.

ext4_find_entry is also changed so that the caller(rename) can tell
whether the found entry is an inlined one or not and journaling the
corresponding buffer head.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

# 61f86638 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: let empty_dir handle inline dir

empty_dir is used when deleting a dir. So it should handle inline dir
properly.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

# 9f40fe54 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: let ext4_delete_entry() handle inline data

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

# e8e948e7 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: let ext4_find_entry handle inline data

Create a new function ext4_find_inline_entry() to handle the case of
inline data.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

# 65d165d9 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: let ext4_readdir handle inline data

For "." and "..", we just call filldir by ourselves
instead of iterating the real dir entry.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

# 3c47d541 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: let add_dir_entry handle inline data properly

This patch let add_dir_entry handle the inline data case. So the
dir is initialized as inline dir first and then we can try to add
some files to it, when the inline space can't hold all the entries,
a dir block will be created and the dir entry will be moved to it.

Also for an inlined dir, "." and ".." are removed and we only use
4 bytes to store the parent inode number. These 2 entries will be
added when we convert an inline dir to a block-based one.

[ Folded in patch from Dan Carpenter to remove an unused variable. ]

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

# 9c3569b5 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: add delalloc support for inline data

For delayed allocation mode, we write to inline data if the file
is small enough. And in case of we write to some offset larger
than the inline size, the 1st page is dirtied, so that
ext4_da_writepages can handle the conversion. When the 1st page
is initialized with blocks, the inline part is removed.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

# 3fdcfb66 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: add journalled write support for inline data

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

# f19d5870 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: add normal write support for inline data

For a normal write case (not journalled write, not delayed
allocation), we write to the inline if the file is small and convert
it to an extent based file when the write is larger than the max
inline size.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

# 46c7f254 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: add read support for inline data

Let readpage and readpages handle the case when we want to read an
inlined file.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

# 67cf5b09 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: add the basic function for inline data support

Implement inline data with xattr.

Now we use "system.data" to store xattr, and the xattr will
be extended if the i_size is increased while we don't release
the space during truncate.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

# 6b90d413 24-Mar-2023 Matthew Wilcox <willy@infradead.org>

ext4: Convert ext4_write_inline_data_end() to use a folio

Convert the incoming page to a folio so that we call compound_head()
only once instead of seven times.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20230324180129.1220691-16-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 6b87fbe4 24-Mar-2023 Matthew Wilcox <willy@infradead.org>

ext4: Convert ext4_read_inline_page() to ext4_read_inline_folio()

All callers now have a folio, so pass it and use it. The folio may
be large, although I doubt we'll want to use a large folio for an
inline file.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20230324180129.1220691-15-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 9a9d01f0 24-Mar-2023 Matthew Wilcox <willy@infradead.org>

ext4: Convert ext4_da_write_inline_data_begin() to use a folio

Saves a number of calls to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20230324180129.1220691-14-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 4ed9b598 24-Mar-2023 Matthew Wilcox <willy@infradead.org>

ext4: Convert ext4_da_convert_inline_data_to_extent() to use a folio

Saves a number of calls to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20230324180129.1220691-13-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# f8f8c89f 24-Mar-2023 Matthew Wilcox <willy@infradead.org>

ext4: Convert ext4_try_to_write_inline_data() to use a folio

Saves a number of calls to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20230324180129.1220691-12-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 83eba701 24-Mar-2023 Matthew Wilcox <willy@infradead.org>

ext4: Convert ext4_convert_inline_data_to_extent() to use a folio

Saves a number of calls to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20230324180129.1220691-11-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 3edde93e 24-Mar-2023 Matthew Wilcox <willy@infradead.org>

ext4: Convert ext4_readpage_inline() to take a folio

Use the folio API in this function, saves a few calls to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20230324180129.1220691-10-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 1dcdce59 06-Mar-2023 Ye Bin <yebin10@huawei.com>

ext4: move where set the MAY_INLINE_DATA flag is set

The only caller of ext4_find_inline_data_nolock() that needs setting of
EXT4_STATE_MAY_INLINE_DATA flag is ext4_iget_extra_inode(). In
ext4_write_inline_data_end() we just need to update inode->i_inline_off.
Since we are going to add one more caller that does not need to set
EXT4_STATE_MAY_INLINE_DATA, just move setting of EXT4_STATE_MAY_INLINE_DATA
out to ext4_iget_extra_inode().

Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230307015253.2232062-2-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 66267814 16-Aug-2022 Jiangshan Yi <yijiangshan@kylinos.cn>

fs/ext4: replace ternary operator with min()/max() and min_t()

Fix the following coccicheck warning:

fs/ext4/inline.c:183: WARNING opportunity for min().
fs/ext4/extents.c:2631: WARNING opportunity for max().
fs/ext4/extents.c:2632: WARNING opportunity for min().
fs/ext4/extents.c:5559: WARNING opportunity for max().
fs/ext4/super.c:6908: WARNING opportunity for min().

min()/max() and min_t() macro is defined in include/linux/minmax.h.
It avoids multiple evaluations of the arguments when non-constant and
performs strict type-checking.

Reported-by: kernel test robot <lkp@intel.com>
Suggested-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jiangshan Yi <yijiangshan@kylinos.cn>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20220817025928.612851-1-13667453960@163.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# c9fd167d 15-Jun-2022 Baokun Li <libaokun1@huawei.com>

ext4: correct max_inline_xattr_value_size computing

If the ext4 inode does not have xattr space, 0 is returned in the
get_max_inline_xattr_value_size function. Otherwise, the function returns
a negative value when the inode does not contain EXT4_STATE_XATTR.

Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220616021358.2504451-4-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 5a57bca9 30-Jun-2022 Zhang Yi <yi.zhang@huawei.com>

ext4: fix reading leftover inlined symlinks

Since commit 6493792d3299 ("ext4: convert symlink external data block
mapping to bdev"), create new symlink with inline_data is not supported,
but it missing to handle the leftover inlined symlinks, which could
cause below error message and fail to read symlink.

ls: cannot read symbolic link 'foo': Structure needs cleaning

EXT4-fs error (device sda): ext4_map_blocks:605: inode #12: block
2021161080: comm ls: lblock 0 mapped to illegal pblock 2021161080
(length 1)

Fix this regression by adding ext4_read_inline_link(), which read the
inline data directly and convert it through a kmalloced buffer.

Fixes: 6493792d3299 ("ext4: convert symlink external data block mapping to bdev")
Cc: stable@kernel.org
Reported-by: Torge Matthies <openglfreak@googlemail.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Tested-by: Torge Matthies <openglfreak@googlemail.com>
Link: https://lore.kernel.org/r/20220630090100.2769490-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# ef09ed5d 16-May-2022 Ye Bin <yebin10@huawei.com>

ext4: fix bug_on in ext4_writepages

we got issue as follows:
EXT4-fs error (device loop0): ext4_mb_generate_buddy:1141: group 0, block bitmap and bg descriptor inconsistent: 25 vs 31513 free cls
------------[ cut here ]------------
kernel BUG at fs/ext4/inode.c:2708!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
CPU: 2 PID: 2147 Comm: rep Not tainted 5.18.0-rc2-next-20220413+ #155
RIP: 0010:ext4_writepages+0x1977/0x1c10
RSP: 0018:ffff88811d3e7880 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88811c098000
RDX: 0000000000000000 RSI: ffff88811c098000 RDI: 0000000000000002
RBP: ffff888128140f50 R08: ffffffffb1ff6387 R09: 0000000000000000
R10: 0000000000000007 R11: ffffed10250281ea R12: 0000000000000001
R13: 00000000000000a4 R14: ffff88811d3e7bb8 R15: ffff888128141028
FS: 00007f443aed9740(0000) GS:ffff8883aef00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020007200 CR3: 000000011c2a4000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
do_writepages+0x130/0x3a0
filemap_fdatawrite_wbc+0x83/0xa0
filemap_flush+0xab/0xe0
ext4_alloc_da_blocks+0x51/0x120
__ext4_ioctl+0x1534/0x3210
__x64_sys_ioctl+0x12c/0x170
do_syscall_64+0x3b/0x90

It may happen as follows:
1. write inline_data inode
vfs_write
new_sync_write
ext4_file_write_iter
ext4_buffered_write_iter
generic_perform_write
ext4_da_write_begin
ext4_da_write_inline_data_begin -> If inline data size too
small will allocate block to write, then mapping will has
dirty page
ext4_da_convert_inline_data_to_extent ->clear EXT4_STATE_MAY_INLINE_DATA
2. fallocate
do_vfs_ioctl
ioctl_preallocate
vfs_fallocate
ext4_fallocate
ext4_convert_inline_data
ext4_convert_inline_data_nolock
ext4_map_blocks -> fail will goto restore data
ext4_restore_inline_data
ext4_create_inline_data
ext4_write_inline_data
ext4_set_inode_state -> set inode EXT4_STATE_MAY_INLINE_DATA
3. writepages
__ext4_ioctl
ext4_alloc_da_blocks
filemap_flush
filemap_fdatawrite_wbc
do_writepages
ext4_writepages
if (ext4_has_inline_data(inode))
BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA))

The root cause of this issue is we destory inline data until call
ext4_writepages under delay allocation mode. But there maybe already
convert from inline to extent. To solve this issue, we call
filemap_flush first..

Cc: stable@kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220516122634.1690462-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# c30365b9 01-Apr-2022 Yu Zhe <yuzhe@nfschina.com>

ext4: remove unnecessary type castings

remove unnecessary void* type castings.

Signed-off-by: Yu Zhe <yuzhe@nfschina.com>
Link: https://lore.kernel.org/r/20220401081321.73735-1-yuzhe@nfschina.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b7446e7c 22-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Remove aop flags parameter from grab_cache_page_write_begin()

There are no more aop flags left, so remove the parameter.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 832ee62d 22-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

ext4: Use scoped memory APIs in ext4_write_begin()

Instead of setting AOP_FLAG_NOFS, use memalloc_nofs_save() and
memalloc_nofs_restore() to prevent GFP_FS allocations recursing
into the filesystem with a journal already started.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Theodore Ts'o <tytso@mit.edu>


# 36d116e9 22-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

ext4: Use scoped memory APIs in ext4_da_write_begin()

Instead of setting AOP_FLAG_NOFS, use memalloc_nofs_save() and
memalloc_nofs_restore() to prevent GFP_FS allocations recursing
into the filesystem with a journal already started.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Theodore Ts'o <tytso@mit.edu>


# 7333ed35 22-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

ext4: Allow GFP_FS allocations in ext4_da_convert_inline_data_to_extent()

Since commit 8bc1379b82b8, the transaction is stopped before calling
ext4_da_convert_inline_data_to_extent(), which means we can do GFP_FS
allocations and recurse into the filesystem.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Theodore Ts'o <tytso@mit.edu>


# 7aab5c84 27-Feb-2022 Ye Bin <yebin10@huawei.com>

ext4: fix fs corruption when tring to remove a non-empty directory with IO error

We inject IO error when rmdir non empty direcory, then got issue as follows:
step1: mkfs.ext4 -F /dev/sda
step2: mount /dev/sda test
step3: cd test
step4: mkdir -p 1/2
step5: rmdir 1
[ 110.920551] ext4_empty_dir: inject fault
[ 110.921926] EXT4-fs warning (device sda): ext4_rmdir:3113: inode #12:
comm rmdir: empty directory '1' has too many links (3)
step6: cd ..
step7: umount test
step8: fsck.ext4 -f /dev/sda
e2fsck 1.42.9 (28-Dec-2013)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Entry '..' in .../??? (13) has deleted/unused inode 12. Clear<y>? yes
Pass 3: Checking directory connectivity
Unconnected directory inode 13 (...)
Connect to /lost+found<y>? yes
Pass 4: Checking reference counts
Inode 13 ref count is 3, should be 2. Fix<y>? yes
Pass 5: Checking group summary information

/dev/sda: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda: 12/131072 files (0.0% non-contiguous), 26157/524288 blocks

ext4_rmdir
if (!ext4_empty_dir(inode))
goto end_rmdir;
ext4_empty_dir
bh = ext4_read_dirblock(inode, 0, DIRENT_HTREE);
if (IS_ERR(bh))
return true;
Now if read directory block failed, 'ext4_empty_dir' will return true, assume
directory is empty. Obviously, it will lead to above issue.
To solve this issue, if read directory block failed 'ext4_empty_dir' just
return false. To avoid making things worse when file system is already
corrupted, 'ext4_empty_dir' also return false.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20220228024815.3952506-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 09355d9d 17-Jan-2022 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: remove redundant max inline_size check in ext4_da_write_inline_data_begin()

ext4_prepare_inline_data() already checks for ext4_get_max_inline_size()
and returns -ENOSPC. So there is no need to check it twice within
ext4_da_write_inline_data_begin(). This patch removes the extra check.

It also makes it more clean.

No functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/cdd1654128d5105550c65fd13ca5da53b2162cc4.1642416995.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 897026aa 17-Jan-2022 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: fix error handling in ext4_restore_inline_data()

While running "./check -I 200 generic/475" it sometimes gives below
kernel BUG(). Ideally we should not call ext4_write_inline_data() if
ext4_create_inline_data() has failed.

<log snip>
[73131.453234] kernel BUG at fs/ext4/inline.c:223!

<code snip>
212 static void ext4_write_inline_data(struct inode *inode, struct ext4_iloc *iloc,
213 void *buffer, loff_t pos, unsigned int len)
214 {
<...>
223 BUG_ON(!EXT4_I(inode)->i_inline_off);
224 BUG_ON(pos + len > EXT4_I(inode)->i_inline_size);

This patch handles the error and prints out a emergency msg saying potential
data loss for the given inode (since we couldn't restore the original
inline_data due to some previous error).

[ 9571.070313] EXT4-fs (dm-0): error restoring inline_data for inode -- potential data loss! (inode 1703982, error -30)

Reported-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/9f4cd7dfd54fa58ff27270881823d94ddf78dd07.1642416995.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org


# 4034247a 14-Jan-2022 NeilBrown <neilb@suse.de>

mm: introduce memalloc_retry_wait()

Various places in the kernel - largely in filesystems - respond to a
memory allocation failure by looping around and re-trying. Some of
these cannot conveniently use __GFP_NOFAIL, for reasons such as:

- a GFP_ATOMIC allocation, which __GFP_NOFAIL doesn't work on
- a need to check for the process being signalled between failures
- the possibility that other recovery actions could be performed
- the allocation is quite deep in support code, and passing down an
extra flag to say if __GFP_NOFAIL is wanted would be clumsy.

Many of these currently use congestion_wait() which (in almost all
cases) simply waits the given timeout - congestion isn't tracked for
most devices.

It isn't clear what the best delay is for loops, but it is clear that
the various filesystems shouldn't be responsible for choosing a timeout.

This patch introduces memalloc_retry_wait() with takes on that
responsibility. Code that wants to retry a memory allocation can call
this function passing the GFP flags that were used. It will wait
however is appropriate.

For now, it only considers __GFP_NORETRY and whatever
gfpflags_allow_blocking() tests. If blocking is allowed without
__GFP_NORETRY, then alloc_page either made some reclaim progress, or
waited for a while, before failing. So there is no need for much
further waiting. memalloc_retry_wait() will wait until the current
jiffie ends. If this condition is not met, then alloc_page() won't have
waited much if at all. In that case memalloc_retry_wait() waits about
200ms. This is the delay that most current loops uses.

linux/sched/mm.h needs to be included in some files now,
but linux/backing-dev.h does not.

Link: https://lkml.kernel.org/r/163754371968.13692.1277530886009912421@noble.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0add491d 19-Aug-2021 Eric Whitney <enwlinux@gmail.com>

ext4: remove extent cache entries when truncating inline data

Conditionally remove all cached extents belonging to an inode
when truncating its inline data. It's only necessary to attempt to
remove cached extents when a conversion from inline to extent storage
has been initiated (!EXT4_STATE_MAY_INLINE_DATA). This avoids
unnecessary es lock overhead in the more common inline case.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210819144927.25163-2-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 6984aef5 16-Jul-2021 Zhang Yi <yi.zhang@huawei.com>

ext4: factor out write end code of inline file

Now that the inline_data file write end procedure are falled into the
common write end functions, it is not clear. Factor them out and do
some cleanup. This patch also drop ext4_da_write_inline_data_end()
and switch to use ext4_write_inline_data_end() instead because we also
need to do the same error processing if we failed to write data into
inline entry.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210716122024.1105856-4-yi.zhang@huawei.com


# 55ce2f64 16-Jul-2021 Zhang Yi <yi.zhang@huawei.com>

ext4: correct the error path of ext4_write_inline_data_end()

Current error path of ext4_write_inline_data_end() is not correct.

Firstly, it should pass out the error value if ext4_get_inode_loc()
return fail, or else it could trigger infinite loop if we inject error
here. And then it's better to add inode to orphan list if it return fail
in ext4_journal_stop(), otherwise we could not restore inline xattr
entry after power failure. Finally, we need to reset the 'ret' value if
ext4_write_inline_data_end() return success in ext4_write_end() and
ext4_journalled_write_end(), otherwise we could not get the error return
value of ext4_journal_stop().

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210716122024.1105856-3-yi.zhang@huawei.com


# 188c299e 16-Aug-2021 Jan Kara <jack@suse.cz>

ext4: Support for checksumming from journal triggers

JBD2 layer support triggers which are called when journaling layer moves
buffer to a certain state. We can use the frozen trigger, which gets
called when buffer data is frozen and about to be written out to the
journal, to compute block checksums for some buffer types (similarly as
does ocfs2). This avoids unnecessary repeated recomputation of the
checksum (at the cost of larger window where memory corruption won't be
caught by checksumming) and is even necessary when there are
unsynchronized updaters of the checksummed data.

So add superblock and journal trigger type arguments to
ext4_journal_get_write_access() and ext4_journal_get_create_access() so
that frozen triggers can be set accordingly. Also add inode argument to
ext4_walk_page_buffers() and all the callbacks used with that function
for the same purpose. This patch is mostly only a change of prototype of
the above mentioned functions and a few small helpers. Real checksumming
will come later.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210816095713.16537-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# a54c4613 20-Aug-2021 Theodore Ts'o <tytso@mit.edu>

ext4: fix race writing to an inline_data file while its xattrs are changing

The location of the system.data extended attribute can change whenever
xattr_sem is not taken. So we need to recalculate the i_inline_off
field since it mgiht have changed between ext4_write_begin() and
ext4_write_end().

This means that caching i_inline_off is probably not helpful, so in
the long run we should probably get rid of it and shrink the in-memory
ext4 inode slightly, but let's fix the race the simple way for now.

Cc: stable@kernel.org
Fixes: f19d5870cbf72 ("ext4: add normal write support for inline data")
Reported-by: syzbot+13146364637c7363a7de@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 310c097c 02-Jun-2021 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: remove duplicate definition of ext4_xattr_ibody_inline_set()

ext4_xattr_ibody_inline_set() & ext4_xattr_ibody_set() have the exact
same definition. Hence remove ext4_xattr_ibody_inline_set() and all
its call references. Convert the callers of it to call
ext4_xattr_ibody_set() instead.

[ Modified to preserve ext4_xattr_ibody_set() and remove
ext4_xattr_ibody_inline_set() instead. -- TYT ]

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/fd566b799bbbbe9b668eb5eecde5b5e319e3694f.1622685482.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 3088e5a5 27-Mar-2021 Bhaskar Chowdhury <unixbhaskar@gmail.com>

ext4: fix various seppling typos

Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Link: https://lore.kernel.org/r/cover.1616840203.git.unixbhaskar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 471fbbea 19-Mar-2021 Daniel Rosenberg <drosen@google.com>

ext4: handle casefolding with encryption

This adds support for encryption with casefolding.

Since the name on disk is case preserving, and also encrypted, we can no
longer just recompute the hash on the fly. Additionally, to avoid
leaking extra information from the hash of the unencrypted name, we use
siphash via an fscrypt v2 policy.

The hash is stored at the end of the directory entry for all entries
inside of an encrypted and casefolded directory apart from those that
deal with '.' and '..'. This way, the change is backwards compatible
with existing ext4 filesystems.

[ Changed to advertise this feature via the file:
/sys/fs/ext4/features/encrypted_casefold -- TYT ]

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210319073414.1381041-2-drosen@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 7067b261 02-Nov-2020 Joseph Qi <joseph.qi@linux.alibaba.com>

ext4: unlock xattr_sem properly in ext4_inline_data_truncate()

It takes xattr_sem to check inline data again but without unlock it
in case not have. So unlock it before return.

Fixes: aef1c8513c1f ("ext4: let ext4_truncate handle inline data correctly")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/1604370542-124630-1-git-send-email-joseph.qi@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org


# b483bb77 04-Aug-2020 Randy Dunlap <rdunlap@infradead.org>

ext4: delete duplicated words + other fixes

Delete repeated words in fs/ext4/.
{the, this, of, we, after}

Also change spelling of "xttr" in inline.c to "xattr" in 2 places.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200805024850.12129-1-rdunlap@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 7ca4fcba 24-Apr-2020 kyoungho koo <rnrudgh@gmail.com>

ext4: Fix comment typo "the the".

I have found double typed comments "the the". So i modified it to
one "the"

Signed-off-by: kyoungho koo <rnrudgh@gmail.com>
Link: https://lore.kernel.org/r/20200424171620.GA11943@koo-Z370-HD3
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 2fe34d29 10-Aug-2020 Kyoungho Koo <rnrudgh@gmail.com>

ext4: remove unused parameter of ext4_generic_delete_entry function

The ext4_generic_delete_entry function does not use the parameter
handle, so it can be removed.

Signed-off-by: Kyoungho Koo <rnrudgh@gmail.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200810080701.GA14160@koo-Z370-HD3
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 4209ae12 26-Apr-2020 Harshad Shirwadkar <harshadshirwadkar@gmail.com>

ext4: handle ext4_mark_inode_dirty errors

ext4_mark_inode_dirty() can fail for real reasons. Ignoring its return
value may lead ext4 to ignore real failures that would result in
corruption / crashes. Harden ext4_mark_inode_dirty error paths to fail
as soon as possible and return errors to the caller whenever
appropriate.

One of the possible scnearios when this bug could affected is that
while creating a new inode, its directory entry gets added
successfully but while writing the inode itself mark_inode_dirty
returns error which is ignored. This would result in inconsistency
that the directory entry points to a non-existent inode.

Ran gce-xfstests smoke tests and verified that there were no
regressions.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20200427013438.219117-1-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 54d3adbc 28-Mar-2020 Theodore Ts'o <tytso@mit.edu>

ext4: save all error info in save_error_info() and drop ext4_set_errno()

Using a separate function, ext4_set_errno() to set the errno is
problematic because it doesn't do the right thing once
s_last_error_errorcode is non-zero. It's also less racy to set all of
the error information all at once. (Also, as a bonus, it shrinks code
size slightly.)

Link: https://lore.kernel.org/r/20200329020404.686965-1-tytso@mit.edu
Fixes: 878520ac45f9 ("ext4: save the error code which triggered...")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# d3b6f23f 28-Feb-2020 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: move ext4_fiemap to use iomap framework

This patch moves ext4_fiemap to use iomap framework.
For xattr a new 'ext4_iomap_xattr_ops' is added.

Reported-by: kbuild test robot <lkp@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/b9f45c885814fcdd0631747ff0fe08886270828c.1582880246.git.riteshh@linux.ibm.com
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 8d6ce136 22-Jan-2020 Shijie Luo <luoshijie1@huawei.com>

ext4,jbd2: fix comment and code style

Fix comment and remove unneccessary blank.

Signed-off-by: Shijie Luo <luoshijie1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200123064325.36358-1-luoshijie1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 878520ac 19-Nov-2019 Theodore Ts'o <tytso@mit.edu>

ext4: save the error code which triggered an ext4_error() in the superblock

This allows the cause of an ext4_error() report to be categorized
based on whether it was triggered due to an I/O error, or an memory
allocation error, or other possible causes. Most errors are caused by
a detected file system inconsistency, so the default code stored in
the superblock will be EXT4_ERR_EFSCORRUPTED.

Link: https://lore.kernel.org/r/20191204032335.7683-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 7a14826e 12-Aug-2019 Colin Ian King <colin.king@canonical.com>

ext4: set error return correctly when ext4_htree_store_dirent fails

Currently when the call to ext4_htree_store_dirent fails the error return
variable 'ret' is is not being set to the error code and variable count is
instead, hence the error code is not being returned. Fix this by assigning
ret to the error return code.

Addresses-Coverity: ("Unused value")
Fixes: 8af0f0822797 ("ext4: fix readdir error in the case of inline_data+dir_index")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 7633b08b 21-Jun-2019 Theodore Ts'o <tytso@mit.edu>

ext4: rename htree_inline_dir_to_tree() to ext4_inlinedir_to_tree()

Clean up namespace pollution by the inline_data code.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# ddce3b94 21-Jun-2019 Theodore Ts'o <tytso@mit.edu>

ext4: refactor initialize_dirent_tail()

Move the calculation of the location of the dirent tail into
initialize_dirent_tail(). Also prefix the function with ext4_ to fix
kernel namepsace polution.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# f036adb3 21-Jun-2019 Theodore Ts'o <tytso@mit.edu>

ext4: rename "dirent_csum" functions to use "dirblock"

Functions such as ext4_dirent_csum_verify() and ext4_dirent_csum_set()
don't actually operate on a directory entry, but a directory block.
And while they take a struct ext4_dir_entry *dirent as an argument, it
had better be the first directory at the beginning of the direct
block, or things will go very wrong.

Rename the following functions so that things make more sense, and
remove a lot of confusing casts along the way:

ext4_dirent_csum_verify -> ext4_dirblock_csum_verify
ext4_dirent_csum_set -> ext4_dirblock_csum_set
ext4_dirent_csum -> ext4_dirblock_csum
ext4_handle_dirty_dirent_node -> ext4_handle_dirty_dirblock

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b886ee3e 25-Apr-2019 Gabriel Krisman Bertazi <krisman@collabora.co.uk>

ext4: Support case-insensitive file name lookups

This patch implements the actual support for case-insensitive file name
lookups in ext4, based on the feature bit and the encoding stored in the
superblock.

A filesystem that has the casefold feature set is able to configure
directories with the +F (EXT4_CASEFOLD_FL) attribute, enabling lookups
to succeed in that directory in a case-insensitive fashion, i.e: match
a directory entry even if the name used by userspace is not a byte per
byte match with the disk name, but is an equivalent case-insensitive
version of the Unicode string. This operation is called a
case-insensitive file name lookup.

The feature is configured as an inode attribute applied to directories
and inherited by its children. This attribute can only be enabled on
empty directories for filesystems that support the encoding feature,
thus preventing collision of file names that only differ by case.

* dcache handling:

For a +F directory, Ext4 only stores the first equivalent name dentry
used in the dcache. This is done to prevent unintentional duplication of
dentries in the dcache, while also allowing the VFS code to quickly find
the right entry in the cache despite which equivalent string was used in
a previous lookup, without having to resort to ->lookup().

d_hash() of casefolded directories is implemented as the hash of the
casefolded string, such that we always have a well-known bucket for all
the equivalencies of the same string. d_compare() uses the
utf8_strncasecmp() infrastructure, which handles the comparison of
equivalent, same case, names as well.

For now, negative lookups are not inserted in the dcache, since they
would need to be invalidated anyway, because we can't trust missing file
dentries. This is bad for performance but requires some leveraging of
the vfs layer to fix. We can live without that for now, and so does
everyone else.

* on-disk data:

Despite using a specific version of the name as the internal
representation within the dcache, the name stored and fetched from the
disk is a byte-per-byte match with what the user requested, making this
implementation 'name-preserving'. i.e. no actual information is lost
when writing to storage.

DX is supported by modifying the hashes used in +F directories to make
them case/encoding-aware. The new disk hashes are calculated as the
hash of the full casefolded string, instead of the string directly.
This allows us to efficiently search for file names in the htree without
requiring the user to provide an exact name.

* Dealing with invalid sequences:

By default, when a invalid UTF-8 sequence is identified, ext4 will treat
it as an opaque byte sequence, ignoring the encoding and reverting to
the old behavior for that unique file. This means that case-insensitive
file name lookup will not work only for that file. An optional bit can
be set in the superblock telling the filesystem code and userspace tools
to enforce the encoding. When that optional bit is set, any attempt to
create a file name using an invalid UTF-8 sequence will fail and return
an error to userspace.

* Normalization algorithm:

The UTF-8 algorithms used to compare strings in ext4 is implemented
lives in fs/unicode, and is based on a previous version developed by
SGI. It implements the Canonical decomposition (NFD) algorithm
described by the Unicode specification 12.1, or higher, combined with
the elimination of ignorable code points (NFDi) and full
case-folding (CF) as documented in fs/unicode/utf8_norm.c.

NFD seems to be the best normalization method for EXT4 because:

- It has a lower cost than NFC/NFKC (which requires
decomposing to NFD as an intermediary step)
- It doesn't eliminate important semantic meaning like
compatibility decompositions.

Although:

- This implementation is not completely linguistic accurate, because
different languages have conflicting rules, which would require the
specialization of the filesystem to a given locale, which brings all
sorts of problems for removable media and for users who use more than
one language.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 2b08b1f1 24-Dec-2018 Theodore Ts'o <tytso@mit.edu>

ext4: fix a potential fiemap/page fault deadlock w/ inline_data

The ext4_inline_data_fiemap() function calls fiemap_fill_next_extent()
while still holding the xattr semaphore. This is not necessary and it
triggers a circular lockdep warning. This is because
fiemap_fill_next_extent() could trigger a page fault when it writes
into page which triggers a page fault. If that page is mmaped from
the inline file in question, this could very well result in a
deadlock.

This problem can be reproduced using generic/519 with a file system
configuration which has the inline_data feature enabled.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org


# 132d00be 03-Dec-2018 Maurizio Lombardi <mlombard@redhat.com>

ext4: missing unlock/put_page() in ext4_try_to_write_inline_data()

In case of error, ext4_try_to_write_inline_data() should unlock
and release the page it holds.

Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data")
Cc: stable@kernel.org # 3.8
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 625ef8a3 02-Oct-2018 Lukas Czerner <lczerner@redhat.com>

ext4: initialize retries variable in ext4_da_write_inline_data_begin()

Variable retries is not initialized in ext4_da_write_inline_data_begin()
which can lead to nondeterministic number of retries in case we hit
ENOSPC. Initialize retries to zero as we do everywhere else.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fixes: bc0ca9df3b2a ("ext4: retry allocation when inline->extent conversion failed")
Cc: stable@kernel.org


# 4d982e25 27-Aug-2018 Theodore Ts'o <tytso@mit.edu>

ext4: avoid divide by zero fault when deleting corrupted inline directories

A specially crafted file system can trick empty_inline_dir() into
reading past the last valid entry in a inline directory, and then run
into the end of xattr marker. This will trigger a divide by zero
fault. Fix this by using the size of the inline directory instead of
dir->i_size.

Also clean up error reporting in __ext4_check_dir_entry so that the
message is clearer and more understandable --- and avoids the division
by zero trap if the size passed in is zero. (I'm not sure why we
coded it that way in the first place; printing offset % size is
actually more confusing and less useful.)

https://bugzilla.kernel.org/show_bug.cgi?id=200933

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: stable@vger.kernel.org


# 362eca70 09-Jul-2018 Theodore Ts'o <tytso@mit.edu>

ext4: fix inline data updates with checksums enabled

The inline data code was updating the raw inode directly; this is
problematic since if metadata checksums are enabled,
ext4_mark_inode_dirty() must be called to update the inode's checksum.
In addition, the jbd2 layer requires that get_write_access() be called
before the metadata buffer is modified. Fix both of these problems.

https://bugzilla.kernel.org/show_bug.cgi?id=200443

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org


# 8bc1379b 16-Jun-2018 Theodore Ts'o <tytso@mit.edu>

ext4: avoid running out of journal credits when appending to an inline file

Use a separate journal transaction if it turns out that we need to
convert an inline file to use an data block. Otherwise we could end
up failing due to not having journal credits.

This addresses CVE-2018-10883.

https://bugzilla.kernel.org/show_bug.cgi?id=200071

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org


# 6e8ab72a 14-Jun-2018 Theodore Ts'o <tytso@mit.edu>

ext4: clear i_data in ext4_inode_info when removing inline data

When converting from an inode from storing the data in-line to a data
block, ext4_destroy_inline_data_nolock() was only clearing the on-disk
copy of the i_blocks[] array. It was not clearing copy of the
i_blocks[] in ext4_inode_info, in i_data[], which is the copy actually
used by ext4_map_blocks().

This didn't matter much if we are using extents, since the extents
header would be invalid and thus the extents could would re-initialize
the extents tree. But if we are using indirect blocks, the previous
contents of the i_blocks array will be treated as block numbers, with
potentially catastrophic results to the file system integrity and/or
user data.

This gets worse if the file system is using a 1k block size and
s_first_data is zero, but even without this, the file system can get
quite badly corrupted.

This addresses CVE-2018-10881.

https://bugzilla.kernel.org/show_bug.cgi?id=200015

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org


# 19319b53 01-Jun-2018 Christoph Hellwig <hch@lst.de>

iomap: inline data should be an iomap type, not a flag

Inline data is fundamentally different from our normal mapped case in that
it doesn't even have a block address. So instead of having a flag for it
it should be an entirely separate iomap range type.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 117166ef 22-May-2018 Theodore Ts'o <tytso@mit.edu>

ext4: do not allow external inodes for inline data

The inline data feature was implemented before we added support for
external inodes for xattrs. It makes no sense to support that
combination, but the problem is that there are a number of extended
attribute checks that are skipped if e_value_inum is non-zero.

Unfortunately, the inline data code is completely e_value_inum
unaware, and attempts to interpret the xattr fields as if it were an
inline xattr --- at which point, Hilarty Ensues.

This addresses CVE-2018-11412.

https://bugzilla.kernel.org/show_bug.cgi?id=199803

Reported-by: Jann Horn <jannh@google.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fixes: e50e5129f384 ("ext4: xattr-in-inode support")
Cc: stable@kernel.org


# c472c07b 01-Feb-2018 Goffredo Baroncelli <kreijack@inwind.it>

iversion: Rename make inode_cmp_iversion{+raw} to inode_eq_iversion{+raw}

The function inode_cmp_iversion{+raw} is counter-intuitive, because it
returns true when the counters are different and false when these are equal.

Rename it to inode_eq_iversion{+raw}, which will returns true when
the counters are equal and false otherwise.

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
Signed-off-by: Jeff Layton <jlayton@redhat.com>


# ee73f9a5 09-Jan-2018 Jeff Layton <jlayton@kernel.org>

ext4: convert to new i_version API

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>


# f5166768 17-Dec-2017 Theodore Ts'o <tytso@mit.edu>

ext4: fix up remaining files with SPDX cleanups

A number of ext4 source files were skipped due because their copyright
permission statements didn't match the expected text used by the
automated conversion utilities. I've added SPDX tags for the rest.

While looking at some of these files, I've noticed that we have quite
a bit of variation on the licenses that were used --- in particular
some of the Red Hat licenses on the jbd2 files use a GPL2+ license,
and we have some files that have a LGPL-2.1 license (which was quite
surprising).

I've not attempted to do any license changes. Even if it is perfectly
legal to relicense to GPL 2.0-only for consistency's sake, that should
be done with ext4 developer community discussion.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 559db4c6 12-Oct-2017 Ross Zwisler <zwisler@kernel.org>

ext4: prevent data corruption with inline data + DAX

If an inode has inline data it is currently prevented from using DAX by a
check in ext4_set_inode_flags(). When the inode grows inline data via
ext4_create_inline_data() or removes its inline data via
ext4_destroy_inline_data_nolock(), the value of S_DAX can change.

Currently these changes are unsafe because we don't hold off page faults
and I/O, write back dirty radix tree entries and invalidate all mappings.
There are also issues with mm-level races when changing the value of S_DAX,
as well as issues with the VM_MIXEDMAP flag:

https://www.spinics.net/lists/linux-xfs/msg09859.html

The unsafe transition of S_DAX can reliably cause data corruption, as shown
by the following fstest:

https://patchwork.kernel.org/patch/9948381/

Fix this issue by preventing the DAX mount option from being used on
filesystems that were created to support inline data. Inline data is an
option given to mkfs.ext4.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
CC: stable@vger.kernel.org


# 7046ae35 01-Oct-2017 Andreas Gruenbacher <agruenba@redhat.com>

ext4: Add iomap support for inline data

Report inline data as a IOMAP_F_DATA_INLINE mapping. This allows to use
iomap_seek_hole and iomap_seek_data in ext4_llseek and makes switching
to iomap_fiemap in ext4_fiemap easier.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# e50e5129 21-Jun-2017 Andreas Dilger <andreas.dilger@intel.com>

ext4: xattr-in-inode support

Large xattr support is implemented for EXT4_FEATURE_INCOMPAT_EA_INODE.

If the size of an xattr value is larger than will fit in a single
external block, then the xattr value will be saved into the body
of an external xattr inode.

The also helps support a larger number of xattr, since only the headers
will be stored in the in-inode space or the single external block.

The inode is referenced from the xattr header via "e_value_inum",
which was formerly "e_value_block", but that field was never used.
The e_value_size still contains the xattr size so that listing
xattrs does not need to look up the inode if the data is not accessed.

struct ext4_xattr_entry {
__u8 e_name_len; /* length of name */
__u8 e_name_index; /* attribute name index */
__le16 e_value_offs; /* offset in disk block of value */
__le32 e_value_inum; /* inode in which value is stored */
__le32 e_value_size; /* size of attribute value */
__le32 e_hash; /* hash value of name and value */
char e_name[0]; /* attribute name */
};

The xattr inode is marked with the EXT4_EA_INODE_FL flag and also
holds a back-reference to the owning inode in its i_mtime field,
allowing the ext4/e2fsck to verify the correct inode is accessed.

[ Applied fix by Dan Carpenter to avoid freeing an ERR_PTR. ]

Lustre-Jira: https://jira.hpdd.intel.com/browse/LU-80
Lustre-bugzilla: https://bugzilla.lustre.org/show_bug.cgi?id=4424
Signed-off-by: Kalpak Shah <kalpak.shah@sun.com>
Signed-off-by: James Simmons <uja.ornl@gmail.com>
Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>


# d6b97550 24-May-2017 Eric Biggers <ebiggers@google.com>

ext4: remove unused d_name argument from ext4_search_dir() et al.

Now that we are passing a struct ext4_filename, we do not need to pass
around the original struct qstr too.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# 1bc0af60 29-Apr-2017 Eric Biggers <ebiggers@google.com>

ext4: trim return value and 'dir' argument from ext4_insert_dentry()

In the initial implementation of ext4 encryption, the filename was
encrypted in ext4_insert_dentry(), which could fail and also required
access to the 'dir' inode. Since then ext4 filename encryption has been
changed to encrypt the filename earlier, so we can revert the additions
to ext4_insert_dentry().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b9cf625d 15-Mar-2017 Eric Biggers <ebiggers@google.com>

ext4: mark inode dirty after converting inline directory

If ext4_convert_inline_data() was called on a directory with inline
data, the filesystem was left in an inconsistent state (as considered by
e2fsck) because the file size was not increased to cover the new block.
This happened because the inode was not marked dirty after i_disksize
was updated. Fix this by marking the inode dirty at the end of
ext4_finish_convert_inline_dir().

This bug was probably not noticed before because most users mark the
inode dirty afterwards for other reasons. But if userspace executed
FS_IOC_SET_ENCRYPTION_POLICY with invalid parameters, as exercised by
'kvm-xfstests -c adv generic/396', then the inode was never marked dirty
after updating i_disksize.

Cc: stable@vger.kernel.org # 3.10+
Fixes: 3c47d54170b6a678875566b1b8d6dcf57904e49b
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 0db1ff22 04-Feb-2017 Theodore Ts'o <tytso@mit.edu>

ext4: add shutdown bit and check for it

Add a shutdown bit that will cause ext4 processing to fail immediately
with EIO.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# eb5efbcb 04-Feb-2017 Theodore Ts'o <tytso@mit.edu>

ext4: fix inline data error paths

The write_end() function must always unlock the page and drop its ref
count, even on an error.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org


# 01daf945 22-Jan-2017 Theodore Ts'o <tytso@mit.edu>

ext4: propagate error values from ext4_inline_data_truncate()

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b907f2d5 11-Jan-2017 Theodore Ts'o <tytso@mit.edu>

ext4: avoid calling ext4_mark_inode_dirty() under unneeded semaphores

There is no need to call ext4_mark_inode_dirty while holding xattr_sem
or i_data_sem, so where it's easy to avoid it, move it out from the
critical region.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# c755e251 11-Jan-2017 Theodore Ts'o <tytso@mit.edu>

ext4: fix deadlock between inline_data and ext4_expand_extra_isize_ea()

The xattr_sem deadlock problems fixed in commit 2e81a4eeedca: "ext4:
avoid deadlock when expanding inode size" didn't include the use of
xattr_sem in fs/ext4/inline.c. With the addition of project quota
which added a new extra inode field, this exposed deadlocks in the
inline_data code similar to the ones fixed by 2e81a4eeedca.

The deadlock can be reproduced via:

dmesg -n 7
mke2fs -t ext4 -O inline_data -Fq -I 256 /dev/vdc 32768
mount -t ext4 -o debug_want_extra_isize=24 /dev/vdc /vdc
mkdir /vdc/a
umount /vdc
mount -t ext4 /dev/vdc /vdc
echo foo > /vdc/a/foo

and looks like this:

[ 11.158815]
[ 11.160276] =============================================
[ 11.161960] [ INFO: possible recursive locking detected ]
[ 11.161960] 4.10.0-rc3-00015-g011b30a8a3cf #160 Tainted: G W
[ 11.161960] ---------------------------------------------
[ 11.161960] bash/2519 is trying to acquire lock:
[ 11.161960] (&ei->xattr_sem){++++..}, at: [<c1225a4b>] ext4_expand_extra_isize_ea+0x3d/0x4cd
[ 11.161960]
[ 11.161960] but task is already holding lock:
[ 11.161960] (&ei->xattr_sem){++++..}, at: [<c1227941>] ext4_try_add_inline_entry+0x3a/0x152
[ 11.161960]
[ 11.161960] other info that might help us debug this:
[ 11.161960] Possible unsafe locking scenario:
[ 11.161960]
[ 11.161960] CPU0
[ 11.161960] ----
[ 11.161960] lock(&ei->xattr_sem);
[ 11.161960] lock(&ei->xattr_sem);
[ 11.161960]
[ 11.161960] *** DEADLOCK ***
[ 11.161960]
[ 11.161960] May be due to missing lock nesting notation
[ 11.161960]
[ 11.161960] 4 locks held by bash/2519:
[ 11.161960] #0: (sb_writers#3){.+.+.+}, at: [<c11a2414>] mnt_want_write+0x1e/0x3e
[ 11.161960] #1: (&type->i_mutex_dir_key){++++++}, at: [<c119508b>] path_openat+0x338/0x67a
[ 11.161960] #2: (jbd2_handle){++++..}, at: [<c123314a>] start_this_handle+0x582/0x622
[ 11.161960] #3: (&ei->xattr_sem){++++..}, at: [<c1227941>] ext4_try_add_inline_entry+0x3a/0x152
[ 11.161960]
[ 11.161960] stack backtrace:
[ 11.161960] CPU: 0 PID: 2519 Comm: bash Tainted: G W 4.10.0-rc3-00015-g011b30a8a3cf #160
[ 11.161960] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1 04/01/2014
[ 11.161960] Call Trace:
[ 11.161960] dump_stack+0x72/0xa3
[ 11.161960] __lock_acquire+0xb7c/0xcb9
[ 11.161960] ? kvm_clock_read+0x1f/0x29
[ 11.161960] ? __lock_is_held+0x36/0x66
[ 11.161960] ? __lock_is_held+0x36/0x66
[ 11.161960] lock_acquire+0x106/0x18a
[ 11.161960] ? ext4_expand_extra_isize_ea+0x3d/0x4cd
[ 11.161960] down_write+0x39/0x72
[ 11.161960] ? ext4_expand_extra_isize_ea+0x3d/0x4cd
[ 11.161960] ext4_expand_extra_isize_ea+0x3d/0x4cd
[ 11.161960] ? _raw_read_unlock+0x22/0x2c
[ 11.161960] ? jbd2_journal_extend+0x1e2/0x262
[ 11.161960] ? __ext4_journal_get_write_access+0x3d/0x60
[ 11.161960] ext4_mark_inode_dirty+0x17d/0x26d
[ 11.161960] ? ext4_add_dirent_to_inline.isra.12+0xa5/0xb2
[ 11.161960] ext4_add_dirent_to_inline.isra.12+0xa5/0xb2
[ 11.161960] ext4_try_add_inline_entry+0x69/0x152
[ 11.161960] ext4_add_entry+0xa3/0x848
[ 11.161960] ? __brelse+0x14/0x2f
[ 11.161960] ? _raw_spin_unlock_irqrestore+0x44/0x4f
[ 11.161960] ext4_add_nondir+0x17/0x5b
[ 11.161960] ext4_create+0xcf/0x133
[ 11.161960] ? ext4_mknod+0x12f/0x12f
[ 11.161960] lookup_open+0x39e/0x3fb
[ 11.161960] ? __wake_up+0x1a/0x40
[ 11.161960] ? lock_acquire+0x11e/0x18a
[ 11.161960] path_openat+0x35c/0x67a
[ 11.161960] ? sched_clock_cpu+0xd7/0xf2
[ 11.161960] do_filp_open+0x36/0x7c
[ 11.161960] ? _raw_spin_unlock+0x22/0x2c
[ 11.161960] ? __alloc_fd+0x169/0x173
[ 11.161960] do_sys_open+0x59/0xcc
[ 11.161960] SyS_open+0x1d/0x1f
[ 11.161960] do_int80_syscall_32+0x4f/0x61
[ 11.161960] entry_INT80_32+0x2f/0x2f
[ 11.161960] EIP: 0xb76ad469
[ 11.161960] EFLAGS: 00000286 CPU: 0
[ 11.161960] EAX: ffffffda EBX: 08168ac8 ECX: 00008241 EDX: 000001b6
[ 11.161960] ESI: b75e46bc EDI: b7755000 EBP: bfbdb108 ESP: bfbdafc0
[ 11.161960] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b

Cc: stable@vger.kernel.org # 3.10 (requires 2e81a4eeedca as a prereq)
Reported-by: George Spelvin <linux@sciencehorizons.net>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 578620f4 10-Dec-2016 Dan Carpenter <dan.carpenter@oracle.com>

ext4: return -ENOMEM instead of success

We should set the error code if kzalloc() fails.

Fixes: 67cf5b09a46f ("ext4: add the basic function for inline data support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org


# a3caa24b 20-Nov-2016 Jan Kara <jack@suse.cz>

ext4: only set S_DAX if DAX is really supported

Currently we have S_DAX set inode->i_flags for a regular file whenever
ext4 is mounted with dax mount option. However in some cases we cannot
really do DAX - e.g. when inode is marked to use data journalling, when
inode data is being encrypted, or when inode is stored inline. Make sure
S_DAX flag is appropriately set/cleared in these cases.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# eeca7ea1 14-Nov-2016 Deepa Dinamani <deepa.kernel@gmail.com>

ext4: use current_time() for inode timestamps

CURRENT_TIME_SEC and CURRENT_TIME are not y2038 safe.
current_time() will be transitioned to be y2038 safe
along with vfs.

current_time() returns timestamps according to the
granularities set in the super_block.
The granularity check in ext4_current_time() to call
current_time() or CURRENT_TIME_SEC is not required.
Use current_time() directly to obtain timestamps
unconditionally, and remove ext4_current_time().

Quota files are assumed to be on the same filesystem.
Hence, use current_time() for these files as well.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>


# a7550b30 10-Jul-2016 Jaegeuk Kim <jaegeuk@kernel.org>

ext4 crypto: migrate into vfs's crypto engine

This patch removes the most parts of internal crypto codes.
And then, it modifies and adds some ext4-specific crypt codes to use the generic
facility.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 8d2ae1cb 26-Apr-2016 Jakub Wilk <jwilk@jwilk.net>

ext4: remove trailing \n from ext4_warning/ext4_error calls

Messages passed to ext4_warning() or ext4_error() don't need trailing
newlines, because these function add the newlines themselves.

Signed-off-by: Jakub Wilk <jwilk@jwilk.net>


# 09cbfeaf 01-Apr-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a8ed9b86 09-Mar-2016 Geliang Tang <geliangtang@163.com>

ext4: drop unneeded BUFFER_TRACE in ext4_delete_inline_entry()

BUFFER_TRACE info "call ext4_handle_dirty_metadata" doesn't match the
code, so drop it.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 705965bd 08-Mar-2016 Jan Kara <jack@suse.cz>

ext4: rename and split get blocks functions

Rename ext4_get_blocks_write() to ext4_get_blocks_unwritten() to better
describe what it does. Also split out get blocks functions for direct
IO. Later we move functionality from _ext4_get_blocks() there. There's no
functional change in this patch.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 56a04915 08-Jan-2016 Theodore Ts'o <tytso@mit.edu>

ext4 crypto: simplify interfaces to directory entry insert functions

A number of functions include ext4_add_dx_entry, make_indexed_dir,
etc. are being passed a dentry even though the only thing they use is
the containing parent. We can shrink the code size slightly by making
this replacement. This will also be useful in cases where we don't
have a dentry as the argument to the directory entry insert functions.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# e2b911c5 17-Oct-2015 Darrick J. Wong <darrick.wong@oracle.com>

ext4: clean up feature test macros with predicate functions

Create separate predicate functions to test/set/clear feature flags,
thereby replacing the wordy old macros. Furthermore, clean out the
places where we open-coded feature tests.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 5b643f9c 18-May-2015 Theodore Ts'o <tytso@mit.edu>

ext4 crypto: optimize filename encryption

Encrypt the filename as soon it is passed in by the user. This avoids
our needing to encrypt the filename 2 or 3 times while in the process
of creating a filename.

Similarly, when looking up a directory entry, encrypt the filename
early, or if the encryption key is not available, base-64 decode the
file syystem so that the hash value and the last 16 bytes of the
encrypted filename is available in the new struct ext4_filename data
structure.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 2b0143b5 17-Mar-2015 David Howells <dhowells@redhat.com>

VFS: normal filesystems (and lustre): d_inode() annotations

that's the bulk of filesystem drivers dealing with inodes of their own

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4bdfc873 11-Apr-2015 Michael Halcrow <mhalcrow@google.com>

ext4 crypto: insert encrypted filenames into a leaf directory block

Signed-off-by: Uday Savagaonkar <savagaon@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 2f61830a 11-Apr-2015 Theodore Ts'o <tytso@mit.edu>

ext4 crypto: teach ext4_htree_store_dirent() to store decrypted filenames

For encrypted directories, we need to pass in a separate parameter for
the decrypted filename, since the directory entry contains the
encrypted filename.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 80cfb71e 02-Apr-2015 Rasmus Villemoes <linux@rasmusvillemoes.dk>

ext4: fix transposition typo in format string

According to C99, %*.s means the same as %*.0s, in other words, print as
many spaces as the field width argument says and effectively ignore the
string argument. That is certainly not what was meant here. The kernel's
printf implementation, however, treats it as if the . was not there,
i.e. as %*s. I don't know if de->name is nul-terminated or not, but in
any case I'm guessing the intention was to use de->name_len as precision
instead of field width.

[ Note: this is debugging code which is commented out, so this is not
security issue; a developer would have to explicitly enable
INLINE_DIR_DEBUG before this would be an issue. ]

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 50db71ab 05-Dec-2014 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: ext4_da_convert_inline_data_to_extent drop locked page after error

Testcase:
xfstests generic/270
MKFS_OPTIONS="-q -I 256 -O inline_data,64bit"

Call Trace:
[<ffffffff81144c76>] lock_page+0x35/0x39 -------> DEADLOCK
[<ffffffff81145260>] pagecache_get_page+0x65/0x15a
[<ffffffff811507fc>] truncate_inode_pages_range+0x1db/0x45c
[<ffffffff8120ea63>] ? ext4_da_get_block_prep+0x439/0x4b6
[<ffffffff811b29b7>] ? __block_write_begin+0x284/0x29c
[<ffffffff8120e62a>] ? ext4_change_inode_journal_flag+0x16b/0x16b
[<ffffffff81150af0>] truncate_inode_pages+0x12/0x14
[<ffffffff81247cb4>] ext4_truncate_failed_write+0x19/0x25
[<ffffffff812488cf>] ext4_da_write_inline_data_begin+0x196/0x31c
[<ffffffff81210dad>] ext4_da_write_begin+0x189/0x302
[<ffffffff810c07ac>] ? trace_hardirqs_on+0xd/0xf
[<ffffffff810ddd13>] ? read_seqcount_begin.clone.1+0x9f/0xcc
[<ffffffff8114309d>] generic_perform_write+0xc7/0x1c6
[<ffffffff810c040e>] ? mark_held_locks+0x59/0x77
[<ffffffff811445d1>] __generic_file_write_iter+0x17f/0x1c5
[<ffffffff8120726b>] ext4_file_write_iter+0x2a5/0x354
[<ffffffff81185656>] ? file_start_write+0x2a/0x2c
[<ffffffff8107bcdb>] ? bad_area_nosemaphore+0x13/0x15
[<ffffffff811858ce>] new_sync_write+0x8a/0xb2
[<ffffffff81186e7b>] vfs_write+0xb5/0x14d
[<ffffffff81186ffb>] SyS_write+0x5c/0x8c
[<ffffffff816f2529>] system_call_fastpath+0x12/0x17

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# d952d69e 02-Dec-2014 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: ext4_inline_data_fiemap should respect callers argument

Currently ext4_inline_data_fiemap ignores requested arguments (start
and len) which may lead endless loop if start != 0. Also fix incorrect
extent length determination.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 5cc28a9e 02-Dec-2014 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: prevent fsreentrance deadlock for inline_data

ext4_da_convert_inline_data_to_extent() invokes
grab_cache_page_write_begin(). grab_cache_page_write_begin performs
memory allocation, so fs-reentrance should be prohibited because we
are inside journal transaction.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 9aa5d32b 13-Oct-2014 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: Replace open coded mdata csum feature to helper function

Besides the fact that this replacement improves code readability
it also protects from errors caused direct EXT4_S(sb)->s_es manipulation
which may result attempt to use uninitialized csum machinery.

#Testcase_BEGIN
IMG=/dev/ram0
MNT=/mnt
mkfs.ext4 $IMG
mount $IMG $MNT
#Enable feature directly on disk, on mounted fs
tune2fs -O metadata_csum $IMG
# Provoke metadata update, likey result in OOPS
touch $MNT/test
umount $MNT
#Testcase_END

# Replacement script
@@
expression E;
@@
- EXT4_HAS_RO_COMPAT_FEATURE(E, EXT4_FEATURE_RO_COMPAT_METADATA_CSUM)
+ ext4_has_metadata_csum(E)

https://bugzilla.kernel.org/show_bug.cgi?id=82201

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org


# 684de574 11-Sep-2014 Darrick J. Wong <darrick.wong@oracle.com>

ext4: don't keep using page if inline conversion fails

If inline->extent conversion fails (most probably due to ENOSPC) and
we release the temporary page that we allocated to transfer the file
contents, don't keep using the page pointer after releasing the page.
This occasionally leads to complaints about evicting locked pages or
hangs when blocksize > pagesize, because it's possible for the page to
get reallocated elsewhere in the meantime.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Tao Ma <tm@tao.ma>


# 40b163f1 28-Jul-2014 Darrick J. Wong <darrick.wong@oracle.com>

ext4: check inline directory before converting

Before converting an inline directory to a regular directory, check
the directory entries to make sure they're not obviously broken.
This helps us to avoid a BUG_ON if one of the dirents is trashed.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>


# 83447ccb 15-Jul-2014 Zheng Liu <wenqing.lz@taobao.com>

ext4: make ext4_has_inline_data() as a inline function

Now ext4_has_inline_data() is used in wide spread codepaths. So we need
to make it as a inline function to avoid burning some CPU cycles.

Change in text size:

text data bss dec hex filename
before: 326110 19258 5528 350896 55ab0 fs/ext4/ext4.o
after: 326227 19258 5528 351013 55b25 fs/ext4/ext4.o

I use the following script to measure the CPU usage.

#!/bin/bash

shm_base='/dev/shm'
img=${shm_base}/ext4-img
mnt=/mnt/loop

e2fsprgs_base=$HOME/e2fsprogs
mkfs=${e2fsprgs_base}/misc/mke2fs
fsck=${e2fsprgs_base}/e2fsck/e2fsck

sudo umount $mnt
dd if=/dev/zero of=$img bs=4k count=3145728
${mkfs} -t ext4 -O inline_data -F $img
sudo mount -t ext4 -o loop $img $mnt

# start testing...
testdir="${mnt}/testdir"
mkdir $testdir
cd $testdir

echo "start testing..."
for ((cnt=0;cnt<100;cnt++)); do

for ((i=0;i<5;i++)); do
for ((j=0;j<5;j++)); do
for ((k=0;k<5;k++)); do
for ((l=0;l<5;l++)); do
mkdir -p $i/$j/$k/$l
echo "$i-$j-$k-$l" > $i/$j/$k/$l/testfile
done
done
done
done

ls -R $testdir > /dev/null
rm -rf $testdir/*

done

The result of `perf top -G -U` is as below.

vanilla:
13.92% [ext4] [k] ext4_do_update_inode
9.36% [ext4] [k] __ext4_get_inode_loc
4.07% [ext4] [k] ftrace_define_fields_ext4_writepages
3.83% [ext4] [k] __ext4_handle_dirty_metadata
3.42% [ext4] [k] ext4_get_inode_flags
2.71% [ext4] [k] ext4_mark_iloc_dirty
2.46% [ext4] [k] ftrace_define_fields_ext4_direct_IO_enter
2.26% [ext4] [k] ext4_get_inode_loc
2.22% [ext4] [k] ext4_has_inline_data
[...]

After applied the patch, we don't see ext4_has_inline_data() because it
has been inlined and perf couldn't sample it. Although it doesn't mean
that the CPU cycles can be saved but at least the overhead of function
calls can be eliminated. So IMHO we'd better inline this function.

Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 5d601255 12-May-2014 liang xie <xieliang007@gmail.com>

ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access

Make them more consistently

Signed-off-by: xieliang <xieliang@xiaomi.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# c197855e 12-May-2014 Stephen Hemminger <stephen@networkplumber.org>

ext4: make local functions static

I have been running make namespacecheck to look for unneeded globals, and
found these in ext4.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# d7092ae2 11-Jan-2014 jon ernst <jonernst07@gmail.com>

ext4: delete "set but not used" variables

Signed-off-by: Jon Ernst <jonernst07@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>


# 09c455aa 06-Jan-2014 Theodore Ts'o <tytso@mit.edu>

ext4: avoid clearing beyond i_blocks when truncating an inline data file

A missing cast means that when we are truncating a file which is less
than 60 bytes, we don't clear the correct area of memory, and in fact
we can end up truncating the next inode in the inode table, or worse
yet, some other kernel data structure.

Addresses-Coverity-Id: #751987

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org


# 52e44777 06-Jan-2014 Jan Kara <jack@suse.cz>

ext4: standardize error handling in ext4_da_write_inline_data_begin()

The function has a bit non-standard (for ext4) error recovery in that it
used a mix of 'out' labels and testing for 'handle' being NULL. There
isn't a good reason for that in the function so clean it up a bit.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# bc0ca9df 06-Jan-2014 Jan Kara <jack@suse.cz>

ext4: retry allocation when inline->extent conversion failed

Similarly as other ->write_begin functions in ext4, also
ext4_da_write_inline_data_begin() should retry allocation if the
conversion failed because of ENOSPC. This avoids returning ENOSPC
prematurely because of uncommitted block deletions.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 5ba052fe 30-Oct-2013 Azat Khuzhin <a3at.mail@gmail.com>

ext4: drop set but otherwise unused variable from ext4_add_dirent_to_inline()

Signed-off-by: Azat Khuzhin <a3at.mail@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 48ffdab1 30-Oct-2013 BoxiLiu <lewis.liulei@huawei.com>

ext4: change ext4_read_inline_dir() to return 0 on success

In ext4_read_inline_dir(), if there is inline data, the successful
return value is the return value of ext4_read_inline_data(). Howewer,
this is used by ext4_readdir(), and while it seems harmless to return
a positive value on success, it's inconsistent, since historically
we've always return 0 on success.

Signed-off-by: BoxiLiu <lewis.liulei@huawei.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Tao Ma <boyu.mt@taobao.com>


# c4932dbe 01-Jul-2013 boxi liu <boxi10liu@gmail.com>

ext4: improve free space calculation for inline_data

In ext4 feature inline_data,it use the xattr's space to store the
inline data in inode.When we calculate the inline data as the xattr,we
add the pad.But in get_max_inline_xattr_value_size() function we count
the free space without pad.It cause some contents are moved to a block
even if it can be
stored in the inode.

Signed-off-by: liulei <lewis.liulei@huawei.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Tao Ma <boyu.mt@taobao.com>


# 725bebb2 17-May-2013 Al Viro <viro@zeniv.linux.org.uk>

[readdir] convert ext4

and trim the living hell out bogosities in inline dir case

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# eaf37937 31-May-2013 Jan Kara <jack@suse.cz>

ext4: fix data offset overflow on 32-bit archs in ext4_inline_data_fiemap()

On 32-bit archs when sector_t is defined as 32-bit the logic computing
data offset in ext4_inline_data_fiemap(). Fix that by properly typing
the shifted value.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# c4d8b023 19-Apr-2013 Tao Ma <boyu.mt@taobao.com>

ext4: fix readdir error in case inline_data+^dir_index.

Zach reported a problem that if inline data is enabled, we don't
tell the difference between the offset of '.' and '..'. And a
getdents will fail if the user only want to get '.'. And what's
worse, we may meet with duplicate dir entries as the offset
for inline dir and non-inline one is quite different.

This patch just try to resolve this problem if dir_index
is disabled. In this case, f_pos is the real offset with
the dir block, so for inline dir, we just pretend as if
we are a dir block and returns the offset like a norml
dir block does.

Reported-by: Zach Brown <zab@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 8af0f082 19-Apr-2013 Tao Ma <boyu.mt@taobao.com>

ext4: fix readdir error in the case of inline_data+dir_index

Zach reported a problem that if inline data is enabled, we don't
tell the difference between the offset of '.' and '..'. And a
getdents will fail if the user only want to get '.' and what's worse,
if there is a conversion happens when the user calls getdents
many times, he/she may get the same entry twice.

In theory, a dir block would also fail if it is converted to a
hashed-index based dir since f_pos will become a hash value, not the
real one, but it doesn't happen. And a deep investigation shows that
we uses a hash based solution even for a normal dir if the dir_index
feature is enabled.

So this patch just adds a new htree_inlinedir_to_tree for inline dir,
and if we find that the hash index is supported, we will do like what
we do for a dir block.

Reported-by: Zach Brown <zab@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 496ad9aa 23-Jan-2013 Al Viro <viro@zeniv.linux.org.uk>

new helper: file_inode(file)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9924a92a 08-Feb-2013 Theodore Ts'o <tytso@mit.edu>

ext4: pass context information to jbd2__journal_start()

So we can better understand what bits of ext4 are responsible for
long-running jbd2 handles, use jbd2__journal_start() so we can pass
context information for logging purposes.

The recommended way for finding the longer-running handles is:

T=/sys/kernel/debug/tracing
EVENT=$T/events/jbd2/jbd2_handle_stats
echo "interval > 5" > $EVENT/filter
echo 1 > $EVENT/enable

./run-my-fs-benchmark

cat $T/trace > /tmp/problem-handles

This will list handles that were active for longer than 20ms. Having
longer-running handles is bad, because a commit started at the wrong
time could stall for those 20+ milliseconds, which could delay an
fsync() or an O_SYNC operation. Here is an example line from the
trace file describing a handle which lived on for 311 jiffies, or over
1.2 seconds:

postmark-2917 [000] .... 196.435786: jbd2_handle_stats: dev 254,32
tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
dirtied_blocks 0

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 860d21e2 12-Jan-2013 Theodore Ts'o <tytso@mit.edu>

ext4: return ENOMEM if sb_getblk() fails

The only reason for sb_getblk() failing is if it can't allocate the
buffer_head. So ENOMEM is more appropriate than EIO. In addition,
make sure that the file system is marked as being inconsistent if
sb_getblk() fails.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org


# bd9926e8 11-Dec-2012 Theodore Ts'o <tytso@mit.edu>

ext4: zero out inline data using memset() instead of empty_zero_page

Not all architectures (in particular, sparc64) have empty_zero_page.
So instead of copying from empty_zero_page, use memset to clear the
inline data by signalling to ext4_xattr_set_entry() via a magic
pointer value, EXT4_ZERO_ATTR_VALUE, which is defined by casting -1 to
a pointer.

This fixes a build failure on sparc64, and the memset() should be more
efficient than using memcpy() anyway.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 0c8d414f 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: let fallocate handle inline data correctly

If we are punching hole in a file, we will return ENOTSUPP.
As for the fallocation of some extents, we will convert the
inline data to a normal extent based file first.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# aef1c851 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: let ext4_truncate handle inline data correctly

Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 0d812f77 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: evict inline data out if we need to strore xattr in inode

Now we that store data in the inode, in case we need to store some
xattrs and inode doesn't have enough space, Andreas suggested that we
should keep the xattr(metadata) in and data should be pushed out. So
this patch does the work.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 94191985 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: let fiemap work with inline data

fiemap is used to find the disk layout of a file, as for inline data,
let us just pretend like a file with just one extent.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 32f7f22c 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: let ext4_rename handle inline dir

In case we rename a directory, ext4_rename has to read the dir block
and change its dotdot's information. The old ext4_rename encapsulated
the dir_block read into itself. So this patch adds a new function
ext4_get_first_dir_block() which gets the dir buffer information so
the ext4_rename can handle it properly. As it will also change the
parent inode number, we return the parent_de so that ext4_rename() can
handle it more easily.

ext4_find_entry is also changed so that the caller(rename) can tell
whether the found entry is an inlined one or not and journaling the
corresponding buffer head.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 61f86638 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: let empty_dir handle inline dir

empty_dir is used when deleting a dir. So it should handle inline dir
properly.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 9f40fe54 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: let ext4_delete_entry() handle inline data

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# e8e948e7 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: let ext4_find_entry handle inline data

Create a new function ext4_find_inline_entry() to handle the case of
inline data.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 65d165d9 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: let ext4_readdir handle inline data

For "." and "..", we just call filldir by ourselves
instead of iterating the real dir entry.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 3c47d541 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: let add_dir_entry handle inline data properly

This patch let add_dir_entry handle the inline data case. So the
dir is initialized as inline dir first and then we can try to add
some files to it, when the inline space can't hold all the entries,
a dir block will be created and the dir entry will be moved to it.

Also for an inlined dir, "." and ".." are removed and we only use
4 bytes to store the parent inode number. These 2 entries will be
added when we convert an inline dir to a block-based one.

[ Folded in patch from Dan Carpenter to remove an unused variable. ]

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 9c3569b5 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: add delalloc support for inline data

For delayed allocation mode, we write to inline data if the file
is small enough. And in case of we write to some offset larger
than the inline size, the 1st page is dirtied, so that
ext4_da_writepages can handle the conversion. When the 1st page
is initialized with blocks, the inline part is removed.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 3fdcfb66 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: add journalled write support for inline data

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# f19d5870 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: add normal write support for inline data

For a normal write case (not journalled write, not delayed
allocation), we write to the inline if the file is small and convert
it to an extent based file when the write is larger than the max
inline size.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 46c7f254 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: add read support for inline data

Let readpage and readpages handle the case when we want to read an
inlined file.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 67cf5b09 10-Dec-2012 Tao Ma <boyu.mt@taobao.com>

ext4: add the basic function for inline data support

Implement inline data with xattr.

Now we use "system.data" to store xattr, and the xattr will
be extended if the i_size is increased while we don't release
the space during truncate.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>