History log of /linux-master/fs/ext4/mballoc.c
Revision Date Author Comments
# 0efcd739 26-Feb-2024 Wenchao Hao <haowenchao2@huawei.com>

ext4: remove unused parameter biop in ext4_issue_discard()

all caller of ext4_issue_discard() would set biop to NULL since
'commit 55cdd0af2bc5 ("ext4: get discard out of jbd2 commit kthread
contex")', it's unnecessary to keep this parameter any more.

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20240226081731.3224470-1-haowenchao2@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# fa606293 13-Feb-2024 Jan Kara <jack@suse.cz>

ext4: don't report EOPNOTSUPP errors from discard

When ext4 is mounted without journal, with discard mount option, and on
a device not supporting trim, we print error for each and every freed
extent. This is not only useless but actively harmful. Instead ignore
the EOPNOTSUPP error. Trim is only advisory anyway and when the
filesystem has journal we silently ignore trim error as well.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20240213101601.17463-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 4fbf8bc7 01-Feb-2024 Baokun Li <libaokun1@huawei.com>

ext4: correct best extent lstart adjustment logic

When yangerkun review commit 93cdf49f6eca ("ext4: Fix best extent lstart
adjustment logic in ext4_mb_new_inode_pa()"), it was found that the best
extent did not completely cover the original request after adjusting the
best extent lstart in ext4_mb_new_inode_pa() as follows:

original request: 2/10(8)
normalized request: 0/64(64)
best extent: 0/9(9)

When we check if best ex can be kept at start of goal, ac_o_ex.fe_logical
is 2 less than the adjusted best extent logical end 9, so we think the
adjustment is done. But obviously 0/9(9) doesn't cover 2/10(8), so we
should determine here if the original request logical end is less than or
equal to the adjusted best extent logical end.

In addition, add a comment stating when adjusted best_ex will not cover
the original request, and remove the duplicate assertion because adjusting
lstart makes no change to b_ex.fe_len.

Link: https://lore.kernel.org/r/3630fa7f-b432-7afd-5f79-781bc3b2c5ea@huawei.com
Fixes: 93cdf49f6eca ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()")
Cc: <stable@kernel.org>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20240201141845.1879253-1-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 68ee261f 18-Jan-2024 Zhang Yi <yi.zhang@huawei.com>

ext4: add a hint for block bitmap corrupt state in mb_groups

If one group is marked as block bitmap corrupted, its free blocks cannot
be used and its free count is also deducted from the global
sbi->s_freeclusters_counter. User might be confused about the absent
free space because we can't query the information about corrupted block
groups except unreliable error messages in syslog. So add a hint to show
block bitmap corrupted groups in mb_groups.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240119061154.1525781-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 4b55d343 17-Jan-2024 yangerkun <yangerkun@huawei.com>

ext4: improve error msg for ext4_mb_seq_groups_show

While cat mb_groups for a mounted ext4 image which has some corrupted
group, the string return to userspace was just "I/O error" which confuse
me a lot. Improve it with ext4_decode_error.

Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240118042557.380058-2-yangerkun@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 25044880 17-Jan-2024 yangerkun <yangerkun@huawei.com>

ext4: remove unused buddy_loaded in ext4_mb_seq_groups_show

We can just first call ext4_mb_unload_buddy, then copy information from
ext4_group_info. So remove this unused value.

Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240118042557.380058-1-yangerkun@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# f0e54b60 05-Jan-2024 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: remove 'needed' in trace_ext4_discard_preallocations

As 'needed' to trace_ext4_discard_preallocations is always 0 which
is meaningless. Just remove it.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240105092102.496631-10-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 2ffd2a6a 05-Jan-2024 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: remove unnecessary parameter "needed" in ext4_discard_preallocations

The "needed" controls the number of ext4_prealloc_space to discard in
ext4_discard_preallocations. Function ext4_discard_preallocations is
supposed to discard all non-used preallocated blocks when "needed"
is 0 and now ext4_discard_preallocations is always called with "needed"
= 0. Remove unnecessary parameter "needed" and remove all non-used
preallocated spaces in ext4_discard_preallocations to simplify the
code.

Note: If count of non-used preallocated spaces could be more than
UINT_MAX, there was a memory leak as some non-used preallocated
spaces are left ununsed and this commit will fix it. Otherwise,
there is no behavior change.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240105092102.496631-9-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 20427949 05-Jan-2024 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: remove unused return value of ext4_mb_release_group_pa

Remove unused return value of ext4_mb_release_group_pa.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240105092102.496631-8-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 820c2808 05-Jan-2024 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: remove unused return value of ext4_mb_release_inode_pa

Remove unused return value of ext4_mb_release_inode_pa

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240105092102.496631-7-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 90817717 05-Jan-2024 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: remove unused return value of ext4_mb_release

Remove unused return value of ext4_mb_release.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240105092102.496631-6-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 11fd1a5d 05-Jan-2024 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: remove unneeded return value of ext4_mb_release_context

Function ext4_mb_release_context always return 0 and the return value is
never used. Just remove unneeded return value of ext4_mb_release_context.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240105092102.496631-4-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 438a35e7 05-Jan-2024 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: remove unused parameter ngroup in ext4_mb_choose_next_group_*()

Remove unused parameter ngroup in ext4_mb_choose_next_group_*().

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20240105092102.496631-3-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 133de5a0 05-Jan-2024 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: remove unused return value of __mb_check_buddy

Remove unused return value of __mb_check_buddy.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240105092102.496631-2-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# c5f3a382 04-Jan-2024 Baokun Li <libaokun1@huawei.com>

ext4: mark the group block bitmap as corrupted before reporting an error

Otherwise unlocking the group in ext4_grp_locked_error may allow other
processes to modify the core block bitmap that is known to be corrupt.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240104142040.2835097-9-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 83269837 04-Jan-2024 Baokun Li <libaokun1@huawei.com>

ext4: avoid allocating blocks from corrupted group in ext4_mb_find_by_goal()

Places the logic for checking if the group's block bitmap is corrupt under
the protection of the group lock to avoid allocating blocks from the group
with a corrupted block bitmap.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240104142040.2835097-8-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 4530b366 04-Jan-2024 Baokun Li <libaokun1@huawei.com>

ext4: avoid allocating blocks from corrupted group in ext4_mb_try_best_found()

Determine if the group block bitmap is corrupted before using ac_b_ex in
ext4_mb_try_best_found() to avoid allocating blocks from a group with a
corrupted block bitmap in the following concurrency and making the
situation worse.

ext4_mb_regular_allocator
ext4_lock_group(sb, group)
ext4_mb_good_group
// check if the group bbitmap is corrupted
ext4_mb_complex_scan_group
// Scan group gets ac_b_ex but doesn't use it
ext4_unlock_group(sb, group)
ext4_mark_group_bitmap_corrupted(group)
// The block bitmap was corrupted during
// the group unlock gap.
ext4_mb_try_best_found
ext4_lock_group(ac->ac_sb, group)
ext4_mb_use_best_found
mb_mark_used
// Allocating blocks in block bitmap corrupted group

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240104142040.2835097-7-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 993bf0f4 04-Jan-2024 Baokun Li <libaokun1@huawei.com>

ext4: avoid dividing by 0 in mb_update_avg_fragment_size() when block bitmap corrupt

Determine if bb_fragments is 0 instead of determining bb_free to eliminate
the risk of dividing by zero when the block bitmap is corrupted.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240104142040.2835097-6-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 2331fd4a 04-Jan-2024 Baokun Li <libaokun1@huawei.com>

ext4: avoid bb_free and bb_fragments inconsistency in mb_free_blocks()

After updating bb_free in mb_free_blocks, it is possible to return without
updating bb_fragments because the block being freed is found to have
already been freed, which leads to inconsistency between bb_free and
bb_fragments.

Since the group may be unlocked in ext4_grp_locked_error(), this can lead
to problems such as dividing by zero when calculating the average fragment
length. Hence move the update of bb_free to after the block double-free
check guarantees that the corresponding statistics are updated only after
the core block bitmap is modified.

Fixes: eabe0444df90 ("ext4: speed-up releasing blocks on commit")
CC: <stable@vger.kernel.org> # 3.10
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240104142040.2835097-5-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# c9b528c3 04-Jan-2024 Baokun Li <libaokun1@huawei.com>

ext4: regenerate buddy after block freeing failed if under fc replay

This mostly reverts commit 6bd97bf273bd ("ext4: remove redundant
mb_regenerate_buddy()") and reintroduces mb_regenerate_buddy(). Based on
code in mb_free_blocks(), fast commit replay can end up marking as free
blocks that are already marked as such. This causes corruption of the
buddy bitmap so we need to regenerate it in that case.

Reported-by: Jan Kara <jack@suse.cz>
Fixes: 6bd97bf273bd ("ext4: remove redundant mb_regenerate_buddy()")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240104142040.2835097-4-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 17220215 04-Jan-2024 Baokun Li <libaokun1@huawei.com>

ext4: do not trim the group with corrupted block bitmap

Otherwise operating on an incorrupted block bitmap can lead to all sorts
of unknown problems.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240104142040.2835097-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 68da4c44 15-Dec-2023 Ye Bin <yebin10@huawei.com>

ext4: fix inconsistent between segment fstrim and full fstrim

Suppose we issue two FITRIM ioctls for ranges [0,15] and [16,31] with
mininum length of trimmed range set to 8 blocks. If we have say a range of
blocks 10-22 free, this range will not be trimmed because it straddles the
boundary of the two FITRIM ranges and neither part is big enough. This is a
bit surprising to some users that call FITRIM on smaller ranges of blocks
to limit impact on the system. Also XFS trims all free space extents that
overlap with the specified range so we are inconsistent among filesystems.
Let's change ext4_try_to_trim_range() to consider for trimming the whole
free space extent that straddles the end of specified range, not just the
part of it within the range.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231216010919.1995851-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 1f6bc02f 15-Dec-2023 Ojaswin Mujoo <ojaswin@linux.ibm.com>

ext4: fallback to complex scan if aligned scan doesn't work

Currently in case the goal length is a multiple of stripe size we use
ext4_mb_scan_aligned() to find the stripe size aligned physical blocks.
In case we are not able to find any, we again go back to calling
ext4_mb_choose_next_group() to search for a different suitable block
group. However, since the linear search always begins from the start,
most of the times we end up with the same BG and the cycle continues.

With large fliesystems, the CPU can be stuck in this loop for hours
which can slow down the whole system. Hence, until we figure out a
better way to continue the search (rather than starting from beginning)
in ext4_mb_choose_next_group(), lets just fallback to
ext4_mb_complex_scan_group() in case aligned scan fails, as it is much
more likely to find the needed blocks.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/ee033f6dfa0a7f2934437008a909c3788233950f.1702455010.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 7c784d62 12-Dec-2023 Suraj Jitindar Singh <surajjs@amazon.com>

ext4: allow for the last group to be marked as trimmed

The ext4 filesystem tracks the trim status of blocks at the group
level. When an entire group has been trimmed then it is marked as
such and subsequent trim invocations with the same minimum trim size
will not be attempted on that group unless it is marked as able to be
trimmed again such as when a block is freed.

Currently the last group can't be marked as trimmed due to incorrect
logic in ext4_last_grp_cluster(). ext4_last_grp_cluster() is supposed
to return the zero based index of the last cluster in a group. This is
then used by ext4_try_to_trim_range() to determine if the trim
operation spans the entire group and as such if the trim status of the
group should be recorded.

ext4_last_grp_cluster() takes a 0 based group index, thus the valid
values for grp are 0..(ext4_get_groups_count - 1). Any group index
less than (ext4_get_groups_count - 1) is not the last group and must
have EXT4_CLUSTERS_PER_GROUP(sb) clusters. For the last group we need
to calculate the number of clusters based on the number of blocks in
the group. Finally subtract 1 from the number of clusters as zero
based indexing is expected. Rearrange the function slightly to make
it clear what we are calculating and returning.

Reproducer:
// Create file system where the last group has fewer blocks than
// blocks per group
$ mkfs.ext4 -b 4096 -g 8192 /dev/nvme0n1 8191
$ mount /dev/nvme0n1 /mnt

Before Patch:
$ fstrim -v /mnt
/mnt: 25.9 MiB (27156480 bytes) trimmed
// Group not marked as trimmed so second invocation still discards blocks
$ fstrim -v /mnt
/mnt: 25.9 MiB (27156480 bytes) trimmed

After Patch:
fstrim -v /mnt
/mnt: 25.9 MiB (27156480 bytes) trimmed
// Group marked as trimmed so second invocation DOESN'T discard any blocks
fstrim -v /mnt
/mnt: 0 B (0 bytes) trimmed

Fixes: 45e4ab320c9b ("ext4: move setting of trimmed bit into ext4_try_to_trim_range()")
Cc: <stable@vger.kernel.org> # 4.19+
Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231213051635.37731-1-surajjs@amazon.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 2bf5eb2a 13-Nov-2023 Gou Hao <gouhao@uniontech.com>

ext4: improving calculation of 'fe_{len|start}' in mb_find_extent()

After first execution of mb_find_order_for_block():

'fe_start' is the value of 'block' passed in mb_find_extent().

'fe_len' is the difference between the length of order-chunk and
remainder of the block divided by order-chunk.

And 'next' does not require initialization after above modifications.

Signed-off-by: Gou Hao <gouhao@uniontech.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231113082617.11258-1-gouhao@uniontech.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# f2fec3e9 23-Oct-2023 Gou Hao <gouhao@uniontech.com>

ext4: delete redundant calculations in ext4_mb_get_buddy_page_lock()

'blocks_per_page' is always 1 after 'if (blocks_per_page >= 2)',
'pnum' and 'block' are equal in this case.

Signed-off-by: Gou Hao <gouhao@uniontech.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231024035215.29474-1-gouhao@uniontech.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 2dcf5fde 26-Nov-2023 Baokun Li <libaokun1@huawei.com>

ext4: prevent the normalized size from exceeding EXT_MAX_BLOCKS

For files with logical blocks close to EXT_MAX_BLOCKS, the file size
predicted in ext4_mb_normalize_request() may exceed EXT_MAX_BLOCKS.
This can cause some blocks to be preallocated that will not be used.
And after [Fixes], the following issue may be triggered:

=========================================================
kernel BUG at fs/ext4/mballoc.c:4653!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 1 PID: 2357 Comm: xfs_io 6.7.0-rc2-00195-g0f5cc96c367f
Hardware name: linux,dummy-virt (DT)
pc : ext4_mb_use_inode_pa+0x148/0x208
lr : ext4_mb_use_inode_pa+0x98/0x208
Call trace:
ext4_mb_use_inode_pa+0x148/0x208
ext4_mb_new_inode_pa+0x240/0x4a8
ext4_mb_use_best_found+0x1d4/0x208
ext4_mb_try_best_found+0xc8/0x110
ext4_mb_regular_allocator+0x11c/0xf48
ext4_mb_new_blocks+0x790/0xaa8
ext4_ext_map_blocks+0x7cc/0xd20
ext4_map_blocks+0x170/0x600
ext4_iomap_begin+0x1c0/0x348
=========================================================

Here is a calculation when adjusting ac_b_ex in ext4_mb_new_inode_pa():

ex.fe_logical = orig_goal_end - EXT4_C2B(sbi, ex.fe_len);
if (ac->ac_o_ex.fe_logical >= ex.fe_logical)
goto adjust_bex;

The problem is that when orig_goal_end is subtracted from ac_b_ex.fe_len
it is still greater than EXT_MAX_BLOCKS, which causes ex.fe_logical to
overflow to a very small value, which ultimately triggers a BUG_ON in
ext4_mb_new_inode_pa() because pa->pa_free < len.

The last logical block of an actual write request does not exceed
EXT_MAX_BLOCKS, so in ext4_mb_normalize_request() also avoids normalizing
the last logical block to exceed EXT_MAX_BLOCKS to avoid the above issue.

The test case in [Link] can reproduce the above issue with 64k block size.

Link: https://patchwork.kernel.org/project/fstests/list/?series=804003
Cc: <stable@kernel.org> # 6.4
Fixes: 93cdf49f6eca ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231127063313.3734294-1-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 7c9fa399 28-Sep-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: add first unit test for ext4_mb_new_blocks_simple in mballoc

Here are prepared work:
1. Include mballoc-test.c to mballoc.c to be able test static function
in mballoc.c.
2. Implement static stub to avoid read IO to disk.
3. Construct fake super_block. Only partial members are set, more members
will be set when more functions are tested.
Then unit test for ext4_mb_new_blocks_simple is added.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230928160407.142069-12-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# bdefd689 28-Sep-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: add some kunit stub for mballoc kunit test

Multiblocks allocation will read and write block bitmap and group
descriptor which reside on disk. Add kunit stub to function
ext4_get_group_desc, ext4_read_block_bitmap_nowait, ext4_wait_block_bitmap
and ext4_mb_mark_context to avoid real IO to disk.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230928160407.142069-11-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 5c657db4 28-Sep-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: call ext4_mb_mark_context in ext4_group_add_blocks()

Call ext4_mb_mark_context in ext4_group_add_blocks() to remove repeat code.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230928160407.142069-10-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 03c7fc39 28-Sep-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: Separate block bitmap and buddy bitmap freeing in ext4_group_add_blocks()

This patch separates block bitmap and buddy bitmap freeing in order to
update block bitmap with ext4_mb_mark_context in following patch.
The reason why this can be sperated is explained in previous submit.
Put the explanation here to simplify the code archeology to
ext4_group_add_blocks():

Separated freeing is safe with concurrent allocation as long as:
1. Firstly allocate block in buddy bitmap, and then in block bitmap.
2. Firstly free block in block bitmap, and then buddy bitmap.
Then freed block will only be available to allocation when both buddy
bitmap and block bitmap are updated by freeing.
Allocation obeys rule 1 already, just do sperated freeing with rule 2.

Separated freeing has no race with generate_buddy as:
Once ext4_mb_load_buddy_gfp is executed successfully, the update-to-date
buddy page can be found in sbi->s_buddy_cache and no more buddy
initialization of the buddy page will be executed concurrently until
buddy page is unloaded. As we always do free in "load buddy, free,
unload buddy" sequence, separated freeing has no race with generate_buddy.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230928160407.142069-9-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 38b8f70c 28-Sep-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: call ext4_mb_mark_context in ext4_mb_clear_bb

Call ext4_mb_mark_context in ext4_mb_clear_bb to remove repeat code.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230928160407.142069-8-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 33e728c6 28-Sep-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: Separate block bitmap and buddy bitmap freeing in ext4_mb_clear_bb()

This patch separates block bitmap and buddy bitmap freeing in order to
update block bitmap with ext4_mb_mark_context in following patch.

Separated freeing is safe with concurrent allocation as long as:
1. Firstly allocate block in buddy bitmap, and then in block bitmap.
2. Firstly free block in block bitmap, and then buddy bitmap.
Then freed block will only be available to allocation when both buddy
bitmap and block bitmap are updated by freeing.
Allocation obeys rule 1 already, just do sperated freeing with rule 2.

Separated freeing has no race with generate_buddy as:
Once ext4_mb_load_buddy_gfp is executed successfully, the update-to-date
buddy page can be found in sbi->s_buddy_cache and no more buddy
initialization of the buddy page will be executed concurrently until
buddy page is unloaded. As we always do free in "load buddy, free,
unload buddy" sequence, separated freeing has no race with generate_buddy.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230928160407.142069-7-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 2f94711b 28-Sep-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: call ext4_mb_mark_context in ext4_mb_mark_diskspace_used

Call ext4_mb_mark_context in ext4_mb_mark_diskspace_used to:
1. Remove repeat code to normally update bitmap and group descriptor
on disk.
2. Now that we have a common API for marking blocks inuse/free in block
bitmap, use that instead of open coding it in function
ext4_mb_mark_diskspace_used(). The current code was not updating
checksum and other counters. ext4_mb_mark_context() should fix these
consistency problems.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230928160407.142069-6-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# c431d386 28-Sep-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: extend ext4_mb_mark_context to support allocation under journal

Previously, ext4_mb_mark_context is only called under fast commit
replay path, so there is no valid handle when we update block bitmap
and group descriptor. This patch try to extend ext4_mb_mark_context
to be used by code under journal. There are several improvement:
1. Add "handle_t *handle" to struct ext4_mark_context to journal block
bitmap and group descriptor update inside ext4_mb_mark_context (the
added journal code is based on ext4_mb_mark_diskspace_used where
ext4_mb_mark_context is going to be used.)
2. Adds a flag argument to ext4_mb_mark_context() which controls
a. EXT4_MB_BITMAP_MARKED_CHECK - whether block bitmap checking is needed.
b. EXT4_MB_SYNC_UPDATE - whether dirty buffers (bitmap and group
descriptor) needs sync.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230928160407.142069-5-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 26d0f87b 28-Sep-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: call ext4_mb_mark_context in ext4_free_blocks_simple

call ext4_mb_mark_context in ext4_free_blocks_simple to:
1. remove repeat code
2. pair update of free_clusters in ext4_mb_new_blocks_simple.
3. add missing ext4_lock_group/ext4_unlock_group protection.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230928160407.142069-4-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# f9e2d95a 28-Sep-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: factor out codes to update block bitmap and group descriptor on disk from ext4_mb_mark_bb

There are several reasons to add a general function ext4_mb_mark_context
to update block bitmap and group descriptor on disk:
1. pair behavior of alloc/free bits. For example,
ext4_mb_new_blocks_simple will update free_clusters in struct flex_groups
in ext4_mb_mark_bb while ext4_free_blocks_simple forgets this.
2. remove repeat code to read from disk, update and write back to disk.
3. reduce future unit test mocks to catch real IO to update structure
on disk.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230928160407.142069-3-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# d2f7cf40 28-Sep-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: make state in ext4_mb_mark_bb to be bool

As state could only be either 0 or 1, just make it bool.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230928160407.142069-2-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# ebf6cb7c 24-Aug-2023 Wang Jianjian <wangjianjian0@foxmail.com>

ext4: no need to generate from free list in mballoc

Commit 7a2fcbf7f85 ("ext4: don't use blocks freed but not yet committed in
buddy cache init") added a code to mark as used blocks in the list of not yet
committed freed blocks during initialization of a buddy page. However
ext4_mb_free_metadata() makes sure buddy page is already loaded and takes a
reference to it so it cannot happen that ext4_mb_init_cache() is called
when efd list is non-empty. Just remove the
ext4_mb_generate_from_freelist() call.

Fixes: 7a2fcbf7f85('ext4: don't use blocks freed but not yet committed in buddy cache init')
Signed-off-by: Wang Jianjian <wangjianjian0@foxmail.com>
Link: https://lore.kernel.org/r/tencent_53CBCB1668358AE862684E453DF37B722008@qq.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org


# ce774e53 12-Jun-2023 Jinke Han <hanjinke.666@bytedance.com>

ext4: make running and commit transaction have their own freed_data_list

When releasing space in jbd, we traverse s_freed_data_list to get the
free range belonging to the current commit transaction. In extreme cases,
the time spent may not be small, and we have observed cases exceeding
10ms. This patch makes running and commit transactions manage their own
free_data_list respectively, eliminating unnecessary traversal.

And in the callback phase of the commit transaction, no one will touch
it except the jbd thread itself, so s_md_lock is no longer needed.

Signed-off-by: Jinke Han <hanjinke.666@bytedance.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20230612124017.14115-1-hanjinke.666@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 5229a658 13-Sep-2023 Jan Kara <jack@suse.cz>

ext4: do not let fstrim block system suspend

Len Brown has reported that system suspend sometimes fail due to
inability to freeze a task working in ext4_trim_fs() for one minute.
Trimming a large filesystem on a disk that slowly processes discard
requests can indeed take a long time. Since discard is just an advisory
call, it is perfectly fine to interrupt it at any time and the return
number of discarded blocks until that moment. Do that when we detect the
task is being frozen.

Cc: stable@kernel.org
Reported-by: Len Brown <lenb@kernel.org>
Suggested-by: Dave Chinner <david@fromorbit.com>
References: https://bugzilla.kernel.org/show_bug.cgi?id=216322
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230913150504.9054-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 45e4ab32 13-Sep-2023 Jan Kara <jack@suse.cz>

ext4: move setting of trimmed bit into ext4_try_to_trim_range()

Currently we set the group's trimmed bit in ext4_trim_all_free() based
on return value of ext4_try_to_trim_range(). However when we will want
to abort trimming because of suspend attempt, we want to return success
from ext4_try_to_trim_range() but not set the trimmed bit. Instead
implementing awkward propagation of this information, just move setting
of trimmed bit into ext4_try_to_trim_range() when the whole group is
trimmed.

Cc: stable@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230913150504.9054-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 0f6bc579 12-Aug-2023 Ruan Jinjie <ruanjinjie@huawei.com>

ext4: use LIST_HEAD() to initialize the list_head in mballoc.c

Use LIST_HEAD() to initialize the list_head instead of open-coding it.

Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com>
Link: https://lore.kernel.org/r/20230812071839.3481909-1-ruanjinjie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# a50bda14 24-Jul-2023 Su Hui <suhui@nfschina.com>

ext4: mballoc: avoid garbage value from err

clang's static analysis warning: fs/ext4/mballoc.c
line 4178, column 6, Branch condition evaluates to a garbage value.

err is uninitialized and will be judged when 'len <= 0' or
it first enters the loop while the condition "!ext4_sb_block_valid()"
is true. Although this can't make problems now, it's better to
correct it.

Signed-off-by: Su Hui <suhui@nfschina.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20230725043310.1227621-1-suhui@nfschina.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 79ebf48c 07-Jul-2023 Lu Hongfei <luhongfei@vivo.com>

ext4: use sbi instead of EXT4_SB(sb) in ext4_mb_new_blocks_simple()

Signed-off-by: Lu Hongfei <luhongfei@vivo.com>
Link: https://lore.kernel.org/r/20230707115907.26637-1-luhongfei@vivo.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 89cadf6e 07-Jul-2023 Lu Hongfei <luhongfei@vivo.com>

ext4: change the type of blocksize in ext4_mb_init_cache()

The return value type of i_blocksize() is 'unsigned int', so the
type of blocksize has been modified from 'int' to 'unsigned int'
to ensure data type consistency.

Signed-off-by: Lu Hongfei <luhongfei@vivo.com>
Link: https://lore.kernel.org/r/20230707105516.9156-1-luhongfei@vivo.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 772c9f691 16-Jul-2023 Ritesh Harjani <ritesh.list@gmail.com>

ext4: don't use CR_BEST_AVAIL_LEN for non-regular files

Using CR_BEST_AVAIL_LEN only make sense for regular files, as for
non-regular files we never normalize the allocation request length i.e.
goal len is same as original length (ac_g_ex.fe_len == ac_o_ex.fe_len).

Hence there is no scope of trimming the goal length to make it
satisfy original request len. Thus this patch avoids using
CR_BEST_AVAIL_LEN criteria for non-regular files request.

Cc: stable@kernel.org
Fixes: 33122aa930f1 ("ext4: Add allocation criteria 1.5 (CR1_5)")
Reported-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Tested-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/2a694c748ff8b8c4b416995a24f06f07b55047a8.1689516047.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 4eea9fbe 01-Aug-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: correct some stale comment of criteria

We named criteria with CR_XXX, correct stale comment to criteria with
raw number.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230801143204.2284343-11-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# bcb123ac 01-Aug-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: return found group directly in ext4_mb_choose_next_group_best_avail

Return good group when it's found in loop to remove futher check if good
group is found after loop.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230801143204.2284343-10-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b50675a4 01-Aug-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: return found group directly in ext4_mb_choose_next_group_goal_fast

Return good group when it's found in loop to remove futher check if good
group is found after loop.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230801143204.2284343-9-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# de8bf0e5 01-Aug-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: replace the traditional ternary conditional operator with with max()/min()

Replace the traditional ternary conditional operator with with max()/min()

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230801143204.2284343-7-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# ad635507 01-Aug-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: remove unnecessary return for void function

The return at end of void function is unnecessary, just remove it.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230801143204.2284343-6-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# bb60caa2 01-Aug-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: use is_power_of_2 helper in ext4_mb_regular_allocator

Use intuitive is_power_of_2 helper in ext4_mb_regular_allocator.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230801143204.2284343-5-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 919eb90c 01-Aug-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: return found group directly in ext4_mb_choose_next_group_p2_aligned

Return good group when it's found in loop to remove unnecessary NULL
initialization of grp and futher check if good group is found after loop.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230801143204.2284343-4-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 60c672b7 01-Aug-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: avoid potential data overflow in next_linear_group

ngroups is ext4_group_t (unsigned int) while next_linear_group treat it
in int. If ngroups is bigger than max number described by int, it will
be treat as a negative number. Then "return group + 1 >= ngroups ? 0 :
group + 1;" may keep returning 0.
Switch int to ext4_group_t in next_linear_group to fix the overflow.

Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230801143204.2284343-3-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# a9ce5993 01-Aug-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: correct grp validation in ext4_mb_good_group

Group corruption check will access memory of grp and will trigger kernel
crash if grp is NULL. So do NULL check before corruption check.

Fixes: 5354b2af3406 ("ext4: allow ext4_get_group_info() to fail")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230801143204.2284343-2-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 304749c0 30-Jun-2023 Ojaswin Mujoo <ojaswin@linux.ibm.com>

ext4: replace CR_FAST macro with inline function for readability

Replace CR_FAST with ext4_mb_cr_expensive() inline function for better
readability. This function returns true if the criteria is one of the
expensive/slower ones where lots of disk IO/prefetching is acceptable.

No functional changes are intended in this patch.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230630085927.140137-1-ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 95257987 16-Jun-2023 Jan Kara <jack@suse.cz>

ext4: drop EXT4_MF_FS_ABORTED flag

EXT4_MF_FS_ABORTED flag has practically the same intent as
EXT4_FLAGS_SHUTDOWN flag. The shutdown flag is checked in many more
places than the aborted flag which is mostly the historical artifact
where we were relying on SB_RDONLY checks instead of the aborted flag
checks. There are only three places - ext4_sync_file(),
__ext4_remount(), and mballoc debug code - which check aborted flag and
not shutdown flag and this is arguably a bug. Avoid these
inconsistencies by removing EXT4_MF_FS_ABORTED flag and using
EXT4_FLAGS_SHUTDOWN everywhere.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230616165109.21695-5-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# bedc5d34 24-Jul-2023 Baokun Li <libaokun1@huawei.com>

ext4: avoid overlapping preallocations due to overflow

Let's say we want to allocate 2 blocks starting from 4294966386, after
predicting the file size, start is aligned to 4294965248, len is changed
to 2048, then end = start + size = 0x100000000. Since end is of
type ext4_lblk_t, i.e. uint, end is truncated to 0.

This causes (pa->pa_lstart >= end) to always hold when checking if the
current extent to be allocated crosses already preallocated blocks, so the
resulting ac_g_ex may cross already preallocated blocks. Hence we convert
the end type to loff_t and use pa_logical_end() to avoid overflow.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230724121059.11834-4-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# bc056e71 24-Jul-2023 Baokun Li <libaokun1@huawei.com>

ext4: fix BUG in ext4_mb_new_inode_pa() due to overflow

When we calculate the end position of ext4_free_extent, this position may
be exactly where ext4_lblk_t (i.e. uint) overflows. For example, if
ac_g_ex.fe_logical is 4294965248 and ac_orig_goal_len is 2048, then the
computed end is 0x100000000, which is 0. If ac->ac_o_ex.fe_logical is not
the first case of adjusting the best extent, that is, new_bex_end > 0, the
following BUG_ON will be triggered:

=========================================================
kernel BUG at fs/ext4/mballoc.c:5116!
invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 3 PID: 673 Comm: xfs_io Tainted: G E 6.5.0-rc1+ #279
RIP: 0010:ext4_mb_new_inode_pa+0xc5/0x430
Call Trace:
<TASK>
ext4_mb_use_best_found+0x203/0x2f0
ext4_mb_try_best_found+0x163/0x240
ext4_mb_regular_allocator+0x158/0x1550
ext4_mb_new_blocks+0x86a/0xe10
ext4_ext_map_blocks+0xb0c/0x13a0
ext4_map_blocks+0x2cd/0x8f0
ext4_iomap_begin+0x27b/0x400
iomap_iter+0x222/0x3d0
__iomap_dio_rw+0x243/0xcb0
iomap_dio_rw+0x16/0x80
=========================================================

A simple reproducer demonstrating the problem:

mkfs.ext4 -F /dev/sda -b 4096 100M
mount /dev/sda /tmp/test
fallocate -l1M /tmp/test/tmp
fallocate -l10M /tmp/test/file
fallocate -i -o 1M -l16777203M /tmp/test/file
fsstress -d /tmp/test -l 0 -n 100000 -p 8 &
sleep 10 && killall -9 fsstress
rm -f /tmp/test/tmp
xfs_io -c "open -ad /tmp/test/file" -c "pwrite -S 0xff 0 8192"

We simply refactor the logic for adjusting the best extent by adding
a temporary ext4_free_extent ex and use extent_logical_end() to avoid
overflow, which also simplifies the code.

Cc: stable@kernel.org # 6.4
Fixes: 93cdf49f6eca ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230724121059.11834-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 43bbddc0 24-Jul-2023 Baokun Li <libaokun1@huawei.com>

ext4: add two helper functions extent_logical_end() and pa_logical_end()

When we use lstart + len to calculate the end of free extent or prealloc
space, it may exceed the maximum value of 4294967295(0xffffffff) supported
by ext4_lblk_t and cause overflow, which may lead to various problems.

Therefore, we add two helper functions, extent_logical_end() and
pa_logical_end(), to limit the type of end to loff_t, and also convert
lstart to loff_t for calculation to avoid overflow.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230724121059.11834-2-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 9d3de7ee 22-Jul-2023 Ojaswin Mujoo <ojaswin@linux.ibm.com>

ext4: fix rbtree traversal bug in ext4_mb_use_preallocated

During allocations, while looking for preallocations(PA) in the per
inode rbtree, we can't do a direct traversal of the tree because
ext4_mb_discard_group_preallocation() can paralelly mark the pa deleted
and that can cause direct traversal to skip some entries. This was
leading to a BUG_ON() being hit [1] when we missed a PA that could satisfy
our request and ultimately tried to create a new PA that would overlap
with the missed one.

To makes sure we handle that case while still keeping the performance of
the rbtree, we make use of the fact that the only pa that could possibly
overlap the original goal start is the one that satisfies the below
conditions:

1. It must have it's logical start immediately to the left of
(ie less than) original logical start.

2. It must not be deleted

To find this pa we use the following traversal method:

1. Descend into the rbtree normally to find the immediate neighboring
PA. Here we keep descending irrespective of if the PA is deleted or if
it overlaps with our request etc. The goal is to find an immediately
adjacent PA.

2. If the found PA is on right of original goal, use rb_prev() to find
the left adjacent PA.

3. Check if this PA is deleted and keep moving left with rb_prev() until
a non deleted PA is found.

4. This is the PA we are looking for. Now we can check if it can satisfy
the original request and proceed accordingly.

This approach also takes care of having deleted PAs in the tree.

(While we are at it, also fix a possible overflow bug in calculating the
end of a PA)

[1] https://lore.kernel.org/linux-ext4/CA+G9fYv2FRpLqBZf34ZinR8bU2_ZRAUOjKAD3+tKRFaEQHtt8Q@mail.gmail.com/

Cc: stable@kernel.org # 6.4
Fixes: 3872778664e3 ("ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list")
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reviewed-by: Ritesh Harjani (IBM) ritesh.list@gmail.com
Tested-by: Ritesh Harjani (IBM) ritesh.list@gmail.com
Link: https://lore.kernel.org/r/edd2efda6a83e6343c5ace9deea44813e71dbe20.1690045963.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 5d5460fa 09-Jun-2023 Ojaswin Mujoo <ojaswin@linux.ibm.com>

ext4: fix off by one issue in ext4_mb_choose_next_group_best_avail()

In ext4_mb_choose_next_group_best_avail(), we want the start order to be
1 less than goal length and the min_order to be, at max, 1 more than the
original length. This commit fixes an off by one issue that arose due to
the fact that 1 << fls(n) > (n).

After all the processing:

order = 1 order below goal len
min_order = maximum of the three:-
- order - trim_order
- 1 order below B2C(s_stripe)
- 1 order above original len

Cc: stable@kernel.org
Fixes: 33122aa930 ("ext4: Add allocation criteria 1.5 (CR1_5)")
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230609103403.112807-1-ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 4c0cfebd 08-Jun-2023 Theodore Ts'o <tytso@mit.edu>

ext4: clean up mballoc criteria comments

Line wrap and slightly clarify the comments describing mballoc's
cirtiera.

Define EXT4_MB_NUM_CRS as part of the enum, so that it will
automatically get updated when criteria is added or removed.

Also fix a potential unitialized use of 'cr' variable if
CONFIG_EXT4_DEBUG is enabled.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# f52f3d2b 30-May-2023 Ojaswin Mujoo <ojaswin@linux.ibm.com>

ext4: Give symbolic names to mballoc criterias

mballoc criterias have historically been called by numbers
like CR0, CR1... however this makes it confusing to understand
what each criteria is about.

Change these criterias from numbers to symbolic names and add
relevant comments. While we are at it, also reformat and add some
comments to ext4_seq_mb_stats_show() for better readability.

Additionally, define CR_FAST which signifies the criteria
below which we can make quicker decisions like:
* quitting early if (free block < requested len)
* avoiding to scan free extents smaller than required len.
* avoiding to initialize buddy cache and work with existing cache
* limiting prefetches

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/a2dc6ec5aea5e5e68cf8e788c2a964ffead9c8b0.1685449706.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 7e170922 30-May-2023 Ojaswin Mujoo <ojaswin@linux.ibm.com>

ext4: Add allocation criteria 1.5 (CR1_5)

CR1_5 aims to optimize allocations which can't be satisfied in CR1. The
fact that we couldn't find a group in CR1 suggests that it would be
difficult to find a continuous extent to compleltely satisfy our
allocations. So before falling to the slower CR2, in CR1.5 we
proactively trim the the preallocations so we can find a group with
(free / fragments) big enough. This speeds up our allocation at the
cost of slightly reduced preallocation.

The patch also adds a new sysfs tunable:

* /sys/fs/ext4/<partition>/mb_cr1_5_max_trim_order

This controls how much CR1.5 can trim a request before falling to CR2.
For example, for a request of order 7 and max trim order 2, CR1.5 can
trim this upto order 5.

Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/150fdf65c8e4cc4dba71e020ce0859bcf636a5ff.1685449706.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 856d865c 30-May-2023 Ojaswin Mujoo <ojaswin@linux.ibm.com>

ext4: Abstract out logic to search average fragment list

Make the logic of searching average fragment list of a given order reusable
by abstracting it out to a differnet function. This will also avoid
code duplication in upcoming patches.

No functional changes.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/028c11d95b17ce0285f45456709a0ca922df1b83.1685449706.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 4f3d1e45 30-May-2023 Ojaswin Mujoo <ojaswin@linux.ibm.com>

ext4: Ensure ext4_mb_prefetch_fini() is called for all prefetched BGs

Before this patch, the call stack in ext4_run_li_request is as follows:

/*
* nr = no. of BGs we want to fetch (=s_mb_prefetch)
* prefetch_ios = no. of BGs not uptodate after
* ext4_read_block_bitmap_nowait()
*/
next_group = ext4_mb_prefetch(sb, group, nr, prefetch_ios);
ext4_mb_prefetch_fini(sb, next_group prefetch_ios);

ext4_mb_prefetch_fini() will only try to initialize buddies for BGs in
range [next_group - prefetch_ios, next_group). This is incorrect since
sometimes (prefetch_ios < nr), which causes ext4_mb_prefetch_fini() to
incorrectly ignore some of the BGs that might need initialization. This
issue is more notable now with the previous patch enabling "fetching" of
BLOCK_UNINIT BGs which are marked buffer_uptodate by default.

Fix this by passing nr to ext4_mb_prefetch_fini() instead of
prefetch_ios so that it considers the right range of groups.

Similarly, make sure we don't pass nr=0 to ext4_mb_prefetch_fini() in
ext4_mb_regular_allocator() since we might have prefetched BLOCK_UNINIT
groups that would need buddy initialization.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/05e648ae04ec5b754207032823e9c1de9a54f87a.1685449706.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 3c629604 30-May-2023 Ojaswin Mujoo <ojaswin@linux.ibm.com>

ext4: Don't skip prefetching BLOCK_UNINIT groups

Currently, ext4_mb_prefetch() and ext4_mb_prefetch_fini() skip
BLOCK_UNINIT groups since fetching their bitmaps doesn't need disk IO.
As a consequence, we end not initializing the buddy structures and CR0/1
lists for these BGs, even though it can be done without any disk IO
overhead. Hence, don't skip such BGs during prefetch and prefetch_fini.

This improves the accuracy of CR0/1 allocation as earlier, we could have
essentially empty BLOCK_UNINIT groups being ignored by CR0/1 due to their buddy
not being initialized, leading to slower CR2 allocations. With this patch CR0/1
will be able to discover these groups as well, thus improving performance.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/dc3130b8daf45ffe63d8a3c1edcf00eb8ba70e1f.1685449706.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 1b420011 30-May-2023 Ojaswin Mujoo <ojaswin@linux.ibm.com>

ext4: Avoid scanning smaller extents in BG during CR1

When we are inside ext4_mb_complex_scan_group() in CR1, we can be sure
that this group has atleast 1 big enough continuous free extent to satisfy
our request because (free / fragments) > goal length.

Hence, instead of wasting time looping over smaller free extents, only
try to consider the free extent if we are sure that it has enough
continuous free space to satisfy goal length. This is particularly
useful when scanning highly fragmented BGs in CR1 as, without this
patch, the allocator might stop scanning early before reaching the big
enough free extent (due to ac_found > mb_max_to_scan) which causes us to
uncessarily trim the request.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/a5473df4517c53ec940bc9b603ef83a547032a32.1685449706.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 3ef5d263 30-May-2023 Ojaswin Mujoo <ojaswin@linux.ibm.com>

ext4: Add counter to track successful allocation of goal length

Track number of allocations where the length of blocks allocated is equal to the
length of goal blocks (post normalization). This metric could be useful if
making changes to the allocator logic in the future as it could give us
visibility into how often do we trim our requests.

PS: ac_b_ex.fe_len might get modified due to preallocation efforts and
hence we use ac_f_ex.fe_len instead since we want to compare how much the
allocator was able to actually find.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/343620e2be8a237239ea2613a7a866ee8607e973.1685449706.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# fdd9a009 30-May-2023 Ojaswin Mujoo <ojaswin@linux.ibm.com>

ext4: Add per CR extent scanned counter

This gives better visibility into the number of extents scanned in each
particular CR. For example, this information can be used to see how out
block group scanning logic is performing when the BG is fragmented.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/55bb6d80f6e22ed2a5a830aa045572bdffc8b1b9.1685449706.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 4eb7a4a1 30-May-2023 Ojaswin Mujoo <ojaswin@linux.ibm.com>

ext4: Convert mballoc cr (criteria) to enum

Convert criteria to be an enum so it easier to maintain and
update the tracefiles to use enum names. This change also makes
it easier to insert new criterias in the future.

There is no functional change in this patch.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/5d82fd467bdf70ea45bdaef810af3b146013946c.1685449706.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 569f196f 30-May-2023 Ritesh Harjani <ritesh.list@gmail.com>

ext4: mballoc: Remove useless setting of ac_criteria

There will be changes coming in future patches which will introduce a new
criteria for block allocation. This removes the useless setting of ac_criteria.
AFAIU, this might be only used to differentiate between whether a preallocated
blocks was allocated or was regular allocator called for allocating blocks.
Hence this also adds the debug prints to identify what type of block allocation
was done in ext4_mb_show_ac().

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/1dbae05617519cb6202f1b299c9d1be3e7cda763.1685449706.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 2ec6d0a5 03-Jun-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: fix wrong unit use in ext4_mb_new_blocks

Function ext4_free_blocks_simple needs count in cluster. Function
ext4_free_blocks accepts count in block. Convert count to cluster
to fix the mismatch.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: stable@kernel.org
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-12-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 247c3d21 03-Jun-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: fix wrong unit use in ext4_mb_clear_bb

Function ext4_issue_discard need count in cluster. Pass count_clusters
instead of count to fix the mismatch.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: stable@kernel.org
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-11-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# ad78b5ef 03-Jun-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: remove unused parameter from ext4_mb_new_blocks_simple()

Two cleanups for ext4_mb_new_blocks_simple:
Remove unused parameter handle of ext4_mb_new_blocks_simple.
Move ext4_mb_new_blocks_simple definition before ext4_mb_new_blocks to
remove unnecessary forward declaration of ext4_mb_new_blocks_simple.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-10-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 11b6890b 03-Jun-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: get block from bh in ext4_free_blocks for fast commit replay

ext4_free_blocks will retrieve block from bh if block parameter is zero.
Retrieve block before ext4_free_blocks_simple to avoid potentially
passing wrong block to ext4_free_blocks_simple.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: stable@kernel.org
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-9-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 19a043bb 03-Jun-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: try all groups in ext4_mb_new_blocks_simple

ext4_mb_new_blocks_simple ignores the group before goal, so it will fail
if free blocks reside in group before goal. Try all groups to avoid
unexpected failure.
Search finishes either if any free block is found or if no available
blocks are found. Simpliy check "i >= max" to distinguish the above
cases.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Suggested-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-8-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 1eff5904 03-Jun-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: add EXT4_MB_HINT_GOAL_ONLY test in ext4_mb_use_preallocated

ext4_mb_use_preallocated will ignore the demand to alloc goal blocks,
although the EXT4_MB_HINT_GOAL_ONLY is requested.
For group pa, ext4_mb_group_or_file will not set EXT4_MB_HINT_GROUP_ALLOC
if EXT4_MB_HINT_GOAL_ONLY is set. So we will not alloc goal blocks from
group pa if EXT4_MB_HINT_GOAL_ONLY is set.
For inode pa, ext4_mb_pa_goal_check is added to check if free extent in
found inode pa meets goal blocks when EXT4_MB_HINT_GOAL_ONLY is set.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Suggested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-6-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# c3defd99 03-Jun-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: treat stripe in block unit

Stripe is misused in block unit and in cluster unit in different code
paths. User awared of stripe maybe not awared of bigalloc feature, so
treat stripe only in block unit to fix this.
Besides, it's hard to get stripe aligned blocks (start and length are both
aligned with stripe) if stripe is not aligned with cluster, just disable
stripe and alert user in this case to simpfy the code and avoid
unnecessary work to get stripe aligned blocks which likely to be failed.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-5-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 99c515e3 03-Jun-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: fix wrong unit use in ext4_mb_find_by_goal

We need start in block unit while fe_start is in cluster unit. Use
ext4_grp_offs_to_block helper to convert fe_start to get start in
block unit.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-4-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 497885f7 03-Jun-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: fix unit mismatch in ext4_mb_new_blocks_simple

The "i" returned from mb_find_next_zero_bit is in cluster unit and we
need offset "block" corresponding to "i" in block unit. Convert "i" to
block unit to fix the unit mismatch.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-3-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b3916da0 03-Jun-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: fix wrong unit use in ext4_mb_normalize_request

NRL_CHECK_SIZE will compare input req and size, so req and size should
be in same unit. Input req "fe_len" is in cluster unit while input
size "(8<<20)>>bsbits" is in block unit. Convert "fe_len" to block
unit to fix the mismatch.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-2-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 3582e745 30-May-2023 Ojaswin Mujoo <ojaswin@linux.ibm.com>

Revert "ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits"

This reverts commit 32c0869370194ae5ac9f9f501953ef693040f6a1.

The reverted commit was intended to remove a dead check however it was observed
that this check was actually being used to exit early instead of looping
sbi->s_mb_max_to_scan times when we are able to find a free extent bigger than
the goal extent. Due to this, a my performance tests (fsmark, parallel file
writes in a highly fragmented FS) were seeing a 2x-3x regression.

Example, the default value of the following variables is:

sbi->s_mb_max_to_scan = 200
sbi->s_mb_min_to_scan = 10

In ext4_mb_check_limits() if we find an extent smaller than goal, then we return
early and try again. This loop will go on until we have processed
sbi->s_mb_max_to_scan(=200) number of free extents at which point we exit and
just use whatever we have even if it is smaller than goal extent.

Now, the regression comes when we find an extent bigger than goal. Earlier, in
this case we would loop only sbi->s_mb_min_to_scan(=10) times and then just use
the bigger extent. However with commit 32c08693 that check was removed and hence
we would loop sbi->s_mb_max_to_scan(=200) times even though we have a big enough
free extent to satisfy the request. The only time we would exit early would be
when the free extent is *exactly* the size of our goal, which is pretty uncommon
occurrence and so we would almost always end up looping 200 times.

Hence, revert the commit by adding the check back to fix the regression. Also
add a comment to outline this policy.

Fixes: 32c086937019 ("ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits")
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/ddcae9658e46880dfec2fb0aa61d01fb3353d202.1685449706.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 463808f2 29-Apr-2023 Theodore Ts'o <tytso@mit.edu>

ext4: remove a BUG_ON in ext4_mb_release_group_pa()

If a malicious fuzzer overwrites the ext4 superblock while it is
mounted such that the s_first_data_block is set to a very large
number, the calculation of the block group can underflow, and trigger
a BUG_ON check. Change this to be an ext4_warning so that we don't
crash the kernel.

Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20230430154311.579720-3-tytso@mit.edu
Reported-by: syzbot+e2efa3efc15a1c9e95c3@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=69b28112e098b070f639efb356393af3ffec4220
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 5354b2af 28-Apr-2023 Theodore Ts'o <tytso@mit.edu>

ext4: allow ext4_get_group_info() to fail

Previously, ext4_get_group_info() would treat an invalid group number
as BUG(), since in theory it should never happen. However, if a
malicious attaker (or fuzzer) modifies the superblock via the block
device while it is the file system is mounted, it is possible for
s_first_data_block to get set to a very large number. In that case,
when calculating the block group of some block number (such as the
starting block of a preallocation region), could result in an
underflow and very large block group number. Then the BUG_ON check in
ext4_get_group_info() would fire, resutling in a denial of service
attack that can be triggered by root or someone with write access to
the block device.

For a quality of implementation perspective, it's best that even if
the system administrator does something that they shouldn't, that it
will not trigger a BUG. So instead of BUG'ing, ext4_get_group_info()
will call ext4_error and return NULL. We also add fallback code in
all of the callers of ext4_get_group_info() that it might NULL.

Also, since ext4_get_group_info() was already borderline to be an
inline function, un-inline it. The results in a next reduction of the
compiled text size of ext4 by roughly 2k.

Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20230430154311.579720-2-tytso@mit.edu
Reported-by: syzbot+e2efa3efc15a1c9e95c3@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=69b28112e098b070f639efb356393af3ffec4220
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# 361eb69f 25-Mar-2023 Ojaswin Mujoo <ojaswin@linux.ibm.com>

ext4: Remove the logic to trim inode PAs

Earlier, inode PAs were stored in a linked list. This caused a need to
periodically trim the list down inorder to avoid growing it to a very
large size, as this would severly affect performance during list
iteration.

Recent patches changed this list to an rbtree, and since the tree scales
up much better, we no longer need to have the trim functionality, hence
remove it.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/c409addceaa3ade4b40328e28e3b54b2f259689e.1679731817.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 38727786 25-Mar-2023 Ojaswin Mujoo <ojaswin@linux.ibm.com>

ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list

Currently, the kernel uses i_prealloc_list to hold all the inode
preallocations. This is known to cause degradation in performance in
workloads which perform large number of sparse writes on a single file.
This is mainly because functions like ext4_mb_normalize_request() and
ext4_mb_use_preallocated() iterate over this complete list, resulting in
slowdowns when large number of PAs are present.

Patch 27bc446e2 partially fixed this by enforcing a limit of 512 for
the inode preallocation list and adding logic to continually trim the
list if it grows above the threshold, however our testing revealed that
a hardcoded value is not suitable for all kinds of workloads.

To optimize this, add an rbtree to the inode and hold the inode
preallocations in this rbtree. This will make iterating over inode PAs
faster and scale much better than a linked list. Additionally, we also
had to remove the LRU logic that was added during trimming of the list
(in ext4_mb_release_context()) as it will add extra overhead in rbtree.
The discards now happen in the lowest-logical-offset-first order.

** Locking notes **

With the introduction of rbtree to maintain inode PAs, we can't use RCU
to walk the tree for searching since it can result in partial traversals
which might miss some nodes(or entire subtrees) while discards happen
in parallel (which happens under a lock). Hence this patch converts the
ei->i_prealloc_lock spin_lock to rw_lock.

Almost all the codepaths that read/modify the PA rbtrees are protected
by the higher level inode->i_data_sem (except
ext4_mb_discard_group_preallocations() and ext4_clear_inode()) IIUC, the
only place we need lock protection is when one thread is reading
"searching" the PA rbtree (earlier protected under rcu_read_lock()) and
another is "deleting" the PAs in ext4_mb_discard_group_preallocations()
function (which iterates all the PAs using the grp->bb_prealloc_list and
deletes PAs from the tree without taking any inode lock (i_data_sem)).

So, this patch converts all rcu_read_lock/unlock() paths for inode list
PA to use read_lock() and all places where we were using
ei->i_prealloc_lock spinlock will now be using write_lock().

Note that this makes the fast path (searching of the right PA e.g.
ext4_mb_use_preallocated() or ext4_mb_normalize_request()), now use
read_lock() instead of rcu_read_lock/unlock(). Ths also will now block
due to slow discard path (ext4_mb_discard_group_preallocations()) which
uses write_lock().

But this is not as bad as it looks. This is because -

1. The slow path only occurs when the normal allocation failed and we
can say that we are low on disk space. One can argue this scenario
won't be much frequent.

2. ext4_mb_discard_group_preallocations(), locks and unlocks the rwlock
for deleting every individual PA. This gives enough opportunity for
the fast path to acquire the read_lock for searching the PA inode
list.

Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/4137bce8f6948fedd8bae134dabae24acfe699c6.1679731817.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# a8e38fd3 25-Mar-2023 Ojaswin Mujoo <ojaswin@linux.ibm.com>

ext4: Convert pa->pa_inode_list and pa->pa_obj_lock into a union

** Splitting pa->pa_inode_list **

Currently, we use the same pa->pa_inode_list to add a pa to either
the inode preallocation list or the locality group preallocation list.
For better clarity, split this list into a union of 2 list_heads and use
either of the them based on the type of pa.

** Splitting pa->pa_obj_lock **

Currently, pa->pa_obj_lock is either assigned &ei->i_prealloc_lock for
inode PAs or lg_prealloc_lock for lg PAs, and is then used to lock the
lists containing these PAs. Make the distinction between the 2 PA types
clear by changing this lock to a union of 2 locks. Explicitly use the
pa_lock_node.inode_lock for inode PAs and pa_lock_node.lg_lock for lg
PAs.

This patch is required so that the locality group preallocation code
remains the same as in upcoming patches we are going to make changes to
inode preallocation code to move from list to rbtree based
implementation. This patch also makes it easier to review the upcoming
patches.

There are no functional changes in this patch.

Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/1d7ac0557e998c3fc7eef422b52e4bc67bdef2b0.1679731817.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 93cdf49f 25-Mar-2023 Ojaswin Mujoo <ojaswin@linux.ibm.com>

ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()

When the length of best extent found is less than the length of goal extent
we need to make sure that the best extent atleast covers the start of the
original request. This is done by adjusting the ac_b_ex.fe_logical (logical
start) of the extent.

While doing so, the current logic sometimes results in the best extent's
logical range overflowing the goal extent. Since this best extent is later
added to the inode preallocation list, we have a possibility of introducing
overlapping preallocations. This is discussed in detail here [1].

As per Jan's suggestion, to fix this, replace the existing logic with the
below logic for adjusting best extent as it keeps fragmentation in check
while ensuring logical range of best extent doesn't overflow out of goal
extent:

1. Check if best extent can be kept at end of goal range and still cover
original start.
2. Else, check if best extent can be kept at start of goal range and still
cover original start.
3. Else, keep the best extent at start of original request.

Also, add a few extra BUG_ONs that might help catch errors faster.

[1] https://lore.kernel.org/r/Y+OGkVvzPN0RMv0O@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/f96aca6d415b36d1f90db86c1a8cd7e2e9d7ab0e.1679731817.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 0830344c 25-Mar-2023 Ojaswin Mujoo <ojaswin@linux.ibm.com>

ext4: Abstract out overlap fix/check logic in ext4_mb_normalize_request()

Abstract out the logic of fixing PA overlaps in ext4_mb_normalize_request
to improve readability of code. This also makes it easier to make changes
to the overlap logic in future.

There are no functional changes in this patch

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/9b35f3955a1d7b66bbd713eca1e63026e01f78c1.1679731817.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 7692094a 25-Mar-2023 Ojaswin Mujoo <ojaswin@linux.ibm.com>

ext4: Move overlap assert logic into a separate function

Abstract out the logic to double check for overlaps in normalize_pa to
a separate function. Since there has been no reports in past where we
have seen any overlaps which hits this bug_on(), in future we can
consider calling this function under "#ifdef AGGRESSIVE_CHECK" only.

There are no functional changes in this patch

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/35dd5d94fa0b2d1cd2d2947adf8967279c72967d.1679731817.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# bcf43499 25-Mar-2023 Ojaswin Mujoo <ojaswin@linux.ibm.com>

ext4: Refactor code in ext4_mb_normalize_request() and ext4_mb_use_preallocated()

Change some variable names to be more consistent and
refactor some of the code to make it easier to read.

There are no functional changes in this patch

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/8edcab489c06cf861b19d87207d9b0ff7ac7f3c1.1679731817.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 82089725 25-Mar-2023 Ojaswin Mujoo <ojaswin@linux.ibm.com>

ext4: Refactor code related to freeing PAs

This patch makes the following changes:

* Rename ext4_mb_pa_free to ext4_mb_pa_put_free
to better reflect its purpose

* Add new ext4_mb_pa_free() which only handles freeing

* Refactor ext4_mb_pa_callback() to use ext4_mb_pa_free()

There are no functional changes in this patch

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/b273bc9cbf5bd278f641fa5bc6c0cc9e6cb3330c.1679731817.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# e86a7182 25-Mar-2023 Ojaswin Mujoo <ojaswin@linux.ibm.com>

ext4: Stop searching if PA doesn't satisfy non-extent file

If we come across a PA that matches the logical offset but is unable to
satisfy a non-extent file due to its physical start being higher than
that supported by non extent files, then simply stop searching for
another PA and break out of loop. This is because, since PAs don't
overlap, we won't be able to find another inode PA which can satisfy the
original request.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/42404ca29bd304ae2c962184c3c32a02e8eefcd0.1679731817.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 19b8b035 16-Mar-2023 Theodore Ts'o <tytso@mit.edu>

ext4: convert some BUG_ON's in mballoc to use WARN_RATELIMITED instead

In cases where we have an obvious way of continuing, let's use
WARN_RATELIMITED() instead of BUG_ON().

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 91a48aaf 11-Mar-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: avoid unnecessary pointer dereference in ext4_mb_normalize_request

Result of EXT4_SB(ac->ac_sb) is already stored in sbi at beginning of
ext4_mb_normalize_request. Use sbi instead of EXT4_SB(ac->ac_sb) to
remove unnecessary pointer dereference.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230311170949.1047958-3-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 1221b235 11-Mar-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: fix typos in mballoc

pa_plen -> pa_len
pa_start -> pa_pstart

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230311170949.1047958-2-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 253cacb0 03-Mar-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: simplify calculation of blkoff in ext4_mb_new_blocks_simple

We try to allocate a block from goal in ext4_mb_new_blocks_simple. We
only need get blkoff in first group with goal and set blkoff to 0 for
the rest groups.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230303172120.3800725-21-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 46825e94 03-Mar-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: remove comment code ext4_discard_preallocations

Just remove comment code in ext4_discard_preallocations.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230303172120.3800725-20-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 3a037b1b 03-Mar-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: remove repeat assignment to ac_f_ex

Call trace to assign ac_f_ex:
ext4_mb_use_best_found
ac->ac_f_ex = ac->ac_b_ex;
ext4_mb_new_preallocation
ext4_mb_new_group_pa
ac->ac_f_ex = ac->ac_b_ex;
ext4_mb_new_inode_pa
ac->ac_f_ex = ac->ac_b_ex;

Actually allocated blocks is already stored in ac_f_ex in
ext4_mb_use_best_found, so there is no need to assign ac_f_ex
in ext4_mb_new_group_pa and ext4_mb_new_inode_pa.
Just remove repeat assignment to ac_f_ex in ext4_mb_new_group_pa
and ext4_mb_new_inode_pa.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230303172120.3800725-19-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# fb28f9ce 03-Mar-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: remove unnecessary goto in ext4_mb_mark_diskspace_used

When ext4_read_block_bitmap fails, we can return PTR_ERR(bitmap_bh) to
remove unnecessary NULL check of bitmap_bh.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230303172120.3800725-18-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# c7f2bafa 03-Mar-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: remove unnecessary count2 in ext4_free_data_in_buddy

count2 is always 1 in mb_debug, just remove unnecessary count2.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230303172120.3800725-17-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# df119095 03-Mar-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: remove unnecessary exit_meta_group_info tag

We goto exit_meta_group_info only to return -ENOMEM. Return -ENOMEM
directly instead of goto to remove this unnecessary tag.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230303172120.3800725-16-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 78dc9f84 03-Mar-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: use best found when complex scan of group finishs

If any bex which meets bex->fe_len >= gex->fe_len is found, then it will
always be used when complex scan of group that bex belongs to finishs.
So there will not be any lock-unlock period.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230303172120.3800725-15-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 32c08693 03-Mar-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits

Only call trace of ext4_mb_check_limits is as following:
ext4_mb_complex_scan_group
ext4_mb_measure_extent
ext4_mb_check_limits(ac, e4b, 0);
ext4_mb_check_limits(ac, e4b, 1);

If the first ac->ac_found > sbi->s_mb_max_to_scan check in
ext4_mb_check_limits is met, we will set ac_status to
AC_STATUS_BREAK and call ext4_mb_try_best_found to try to use
ac->ac_b_ex.
If ext4_mb_try_best_found successes, then block allocation finishs,
the removed ac->ac_found > sbi->s_mb_min_to_scan check is not reachable.
If ext4_mb_try_best_found fails, then we set EXT4_MB_HINT_FIRST and
reset ac->ac_b_ex to retry block allocation. We will use any found
free extent in ext4_mb_measure_extent before reach the removed
ac->ac_found > sbi->s_mb_min_to_scan check.
In summary, the removed ac->ac_found > sbi->s_mb_min_to_scan check is
not reachable and we can remove that dead check.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230303172120.3800725-14-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 976620bd 03-Mar-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: remove dead check in mb_buddy_mark_free

We always adjust first to even number and adjust last to odd number, so
first == last will never happen. Remove this dead check.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230303172120.3800725-13-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# aaae558d 03-Mar-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: remove unnecessary check in ext4_mb_new_blocks

1. remove unnecessary ac check:
We always go to out tag before ac is successfully allocated, then we can
move out tag after free of ac and remove NULL check of ac.

2. remove unnecessary *errp check:
We always go to errout tag if *errp is non-zero, then we can move errout
tag into error handle if *errp is non-zero.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230303172120.3800725-12-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 285164b8 03-Mar-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: remove unnecessary e4b->bd_buddy_page check in ext4_mb_load_buddy_gfp

e4b->bd_buddy_page is only set if we initialize ext4_buddy successfully. So
e4b->bd_buddy_page is always NULL in error handle branch. Just remove the
dead check.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230303172120.3800725-11-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 139f46d3 03-Mar-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: Remove unnecessary release when memory allocation failed in ext4_mb_init_cache

If we alloc array of buffer_head failed, there is no resource need to be
freed and we can simpily return error.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230303172120.3800725-10-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 85b67ffb 03-Mar-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: remove unused return value of ext4_mb_try_best_found and ext4_mb_free_metadata

Return value static function ext4_mb_try_best_found and
ext4_mb_free_metadata is not used. Just remove unused return value.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230303172120.3800725-9-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 1b5c9d34 03-Mar-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: add missed brelse in ext4_free_blocks_simple

Release bitmap buffer_head we got if error occurs.
Besides, this patch remove unused assignment to err.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230303172120.3800725-8-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 36cb0f52 03-Mar-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: protect pa->pa_free in ext4_discard_allocated_blocks

If ext4_mb_mark_diskspace_used fails in ext4_mb_new_blocks, we may
discard pa already in list. Protect pa with pa_lock to avoid race.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230303172120.3800725-7-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 1afdc588 03-Mar-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: correct start of used group pa for debug in ext4_mb_use_group_pa

As we don't correct pa_lstart here, so there is no need to subtract
pa_lstart with consumed len.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230303172120.3800725-6-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# abc075d4 03-Mar-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: correct calculation of s_mb_preallocated

We will add pa_free to s_mb_preallocated when new ext4_prealloc_space is
created. In ext4_mb_new_inode_pa, we will call ext4_mb_use_inode_pa
before adding pa_free to s_mb_preallocated. However, ext4_mb_use_inode_pa
will consume pa_free for block allocation which triggerred the creation
of ext4_prealloc_space. Add pa_free to s_mb_preallocated before
ext4_mb_use_inode_pa to correct calculation of s_mb_preallocated.
There is no such problem in ext4_mb_new_group_pa as pa_free of group pa
is consumed in ext4_mb_release_context instead of ext4_mb_use_group_pa.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230303172120.3800725-5-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 22fab984 03-Mar-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: get correct ext4_group_info in ext4_mb_prefetch_fini

We always get ext4_group_desc with group + 1 and ext4_group_info with
group to check if we need do initialize ext4_group_info for the group.
Just get ext4_group_desc with group for ext4_group_info initialization
check.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230303172120.3800725-4-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 01e4ca29 03-Mar-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: allow to find by goal if EXT4_MB_HINT_GOAL_ONLY is set

If EXT4_MB_HINT_GOAL_ONLY is set, ext4_mb_regular_allocator will only
allocate blocks from ext4_mb_find_by_goal. Allow to find by goal in
ext4_mb_find_by_goal if EXT4_MB_HINT_GOAL_ONLY is set or allocation
with EXT4_MB_HINT_GOAL_ONLY set will always fail.

EXT4_MB_HINT_GOAL_ONLY is not used at all, so the problem is not
found for now.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230303172120.3800725-3-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b07ffe69 03-Mar-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: set goal start correctly in ext4_mb_normalize_request

We need to set ac_g_ex to notify the goal start used in
ext4_mb_find_by_goal. Set ac_g_ex instead of ac_f_ex in
ext4_mb_normalize_request.
Besides we should assure goal start is in range [first_data_block,
blocks_count) as ext4_mb_initialize_context does.

[ Added a check to make sure size is less than ar->pright; otherwise
we could end up passing an underflowed value of ar->pright - size to
ext4_get_group_no_and_offset(), which will trigger a BUG_ON later on.
- TYT ]

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230303172120.3800725-2-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 1df9bde4 21-Feb-2023 Kemeng Shi <shikemeng@huaweicloud.com>

ext4: remove unused group parameter in ext4_block_bitmap_csum_set

Remove unused group parameter in ext4_block_bitmap_csum_set. After this,
group parameter in ext4_set_bitmap_checksums is also not used, just
remove it too.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230221203027.2359920-5-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# d73eff68 02-Dec-2022 Guoqing Jiang <guoqing.jiang@linux.dev>

ext4: make ext4_mb_initialize_context return void

Change the return type to void since it always return 0, and no need
to do the checking in ext4_mb_new_blocks.

Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20221202120409.24098-1-guoqing.jiang@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# a078dff8 22-Sep-2022 Jan Kara <jack@suse.cz>

ext4: fixup possible uninitialized variable access in ext4_mb_choose_next_group_cr1()

Variable 'grp' may be left uninitialized if there's no group with
suitable average fragment size (or larger). Fix the problem by
initializing it earlier.

Link: https://lore.kernel.org/r/20220922091542.pkhedytey7wzp5fi@quack3
Fixes: 83e80a6e3543 ("ext4: use buckets for cr 1 block scan instead of rbtree")
Cc: stable@kernel.org
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 80fa46d6 01-Sep-2022 Theodore Ts'o <tytso@mit.edu>

ext4: limit the number of retries after discarding preallocations blocks

This patch avoids threads live-locking for hours when a large number
threads are competing over the last few free extents as they blocks
getting added and removed from preallocation pools. From our bug
reporter:

A reliable way for triggering this has multiple writers
continuously write() to files when the filesystem is full, while
small amounts of space are freed (e.g. by truncating a large file
-1MiB at a time). In the local filesystem, this can be done by
simply not checking the return code of write (0) and/or the error
(ENOSPACE) that is set. Over NFS with an async mount, even clients
with proper error checking will behave this way since the linux NFS
client implementation will not propagate the server errors [the
write syscalls immediately return success] until the file handle is
closed. This leads to a situation where NFS clients send a
continuous stream of WRITE rpcs which result in ERRNOSPACE -- but
since the client isn't seeing this, the stream of writes continues
at maximum network speed.

When some space does appear, multiple writers will all attempt to
claim it for their current write. For NFS, we may see dozens to
hundreds of threads that do this.

The real-world scenario of this is database backup tooling (in
particular, github.com/mdkent/percona-xtrabackup) which may write
large files (>1TiB) to NFS for safe keeping. Some temporary files
are written, rewound, and read back -- all before closing the file
handle (the temp file is actually unlinked, to trigger automatic
deletion on close/crash.) An application like this operating on an
async NFS mount will not see an error code until TiB have been
written/read.

The lockup was observed when running this database backup on large
filesystems (64 TiB in this case) with a high number of block
groups and no free space. Fragmentation is generally not a factor
in this filesystem (~thousands of large files, mostly contiguous
except for the parts written while the filesystem is at capacity.)

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org


# 83e80a6e 08-Sep-2022 Jan Kara <jack@suse.cz>

ext4: use buckets for cr 1 block scan instead of rbtree

Using rbtree for sorting groups by average fragment size is relatively
expensive (needs rbtree update on every block freeing or allocation) and
leads to wide spreading of allocations because selection of block group
is very sentitive both to changes in free space and amount of blocks
allocated. Furthermore selecting group with the best matching average
fragment size is not necessary anyway, even more so because the
variability of fragment sizes within a group is likely large so average
is not telling much. We just need a group with large enough average
fragment size so that we have high probability of finding large enough
free extent and we don't want average fragment size to be too big so
that we are likely to find free extent only somewhat larger than what we
need.

So instead of maintaing rbtree of groups sorted by fragment size keep
bins (lists) or groups where average fragment size is in the interval
[2^i, 2^(i+1)). This structure requires less updates on block allocation
/ freeing, generally avoids chaotic spreading of allocations into block
groups, and still is able to quickly (even faster that the rbtree)
provide a block group which is likely to have a suitably sized free
space extent.

This patch reduces number of block groups used when untarring archive
with medium sized files (size somewhat above 64k which is default
mballoc limit for avoiding locality group preallocation) to about half
and thus improves write speeds for eMMC flash significantly.

Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@kernel.org
Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
Link: https://lore.kernel.org/r/20220908092136.11770-5-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# a9f2a293 08-Sep-2022 Jan Kara <jack@suse.cz>

ext4: use locality group preallocation for small closed files

Curently we don't use any preallocation when a file is already closed
when allocating blocks (from writeback code when converting delayed
allocation). However for small files, using locality group preallocation
is actually desirable as that is not specific to a particular file.
Rather it is a method to pack small files together to reduce
fragmentation and for that the fact the file is closed is actually even
stronger hint the file would benefit from packing. So change the logic
to allow locality group preallocation in this case.

Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@kernel.org
Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
Link: https://lore.kernel.org/r/20220908092136.11770-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 1940265e 08-Sep-2022 Jan Kara <jack@suse.cz>

ext4: avoid unnecessary spreading of allocations among groups

mb_set_largest_free_order() updates lists containing groups with largest
chunk of free space of given order. The way it updates it leads to
always moving the group to the tail of the list. Thus allocations
looking for free space of given order effectively end up cycling through
all groups (and due to initialization in last to first order). This
spreads allocations among block groups which reduces performance for
rotating disks or low-end flash media. Change
mb_set_largest_free_order() to only update lists if the order of the
largest free chunk in the group changed.

Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@kernel.org
Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
Link: https://lore.kernel.org/r/20220908092136.11770-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 4fca50d4 08-Sep-2022 Jan Kara <jack@suse.cz>

ext4: make mballoc try target group first even with mb_optimize_scan

One of the side-effects of mb_optimize_scan was that the optimized
functions to select next group to try were called even before we tried
the goal group. As a result we no longer allocate files close to
corresponding inodes as well as we don't try to expand currently
allocated extent in the same group. This results in reaim regression
with workfile.disk workload of upto 8% with many clients on my test
machine:

baseline mb_optimize_scan
Hmean disk-1 2114.16 ( 0.00%) 2099.37 ( -0.70%)
Hmean disk-41 87794.43 ( 0.00%) 83787.47 * -4.56%*
Hmean disk-81 148170.73 ( 0.00%) 135527.05 * -8.53%*
Hmean disk-121 177506.11 ( 0.00%) 166284.93 * -6.32%*
Hmean disk-161 220951.51 ( 0.00%) 207563.39 * -6.06%*
Hmean disk-201 208722.74 ( 0.00%) 203235.59 ( -2.63%)
Hmean disk-241 222051.60 ( 0.00%) 217705.51 ( -1.96%)
Hmean disk-281 252244.17 ( 0.00%) 241132.72 * -4.41%*
Hmean disk-321 255844.84 ( 0.00%) 245412.84 * -4.08%*

Also this is causing huge regression (time increased by a factor of 5 or
so) when untarring archive with lots of small files on some eMMC storage
cards.

Fix the problem by making sure we try goal group first.

Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@kernel.org
Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/all/20220727105123.ckwrhbilzrxqpt24@quack3/
Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220908092136.11770-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 1e1c2b86 14-Jul-2022 Lukas Czerner <lczerner@redhat.com>

ext4: block range must be validated before use in ext4_mb_clear_bb()

Block range to free is validated in ext4_free_blocks() using
ext4_inode_block_valid() and then it's passed to ext4_mb_clear_bb().
However in some situations on bigalloc file system the range might be
adjusted after the validation in ext4_free_blocks() which can lead to
troubles on corrupted file systems such as one found by syzkaller that
resulted in the following BUG

kernel BUG at fs/ext4/ext4.h:3319!
PREEMPT SMP NOPTI
CPU: 28 PID: 4243 Comm: repro Kdump: loaded Not tainted 5.19.0-rc6+ #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
RIP: 0010:ext4_free_blocks+0x95e/0xa90
Call Trace:
<TASK>
? lock_timer_base+0x61/0x80
? __es_remove_extent+0x5a/0x760
? __mod_timer+0x256/0x380
? ext4_ind_truncate_ensure_credits+0x90/0x220
ext4_clear_blocks+0x107/0x1b0
ext4_free_data+0x15b/0x170
ext4_ind_truncate+0x214/0x2c0
? _raw_spin_unlock+0x15/0x30
? ext4_discard_preallocations+0x15a/0x410
? ext4_journal_check_start+0xe/0x90
? __ext4_journal_start_sb+0x2f/0x110
ext4_truncate+0x1b5/0x460
? __ext4_journal_start_sb+0x2f/0x110
ext4_evict_inode+0x2b4/0x6f0
evict+0xd0/0x1d0
ext4_enable_quotas+0x11f/0x1f0
ext4_orphan_cleanup+0x3de/0x430
? proc_create_seq_private+0x43/0x50
ext4_fill_super+0x295f/0x3ae0
? snprintf+0x39/0x40
? sget_fc+0x19c/0x330
? ext4_reconfigure+0x850/0x850
get_tree_bdev+0x16d/0x260
vfs_get_tree+0x25/0xb0
path_mount+0x431/0xa70
__x64_sys_mount+0xe2/0x120
do_syscall_64+0x5b/0x80
? do_user_addr_fault+0x1e2/0x670
? exc_page_fault+0x70/0x170
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7fdf4e512ace

Fix it by making sure that the block range is properly validated before
used every time it changes in ext4_free_blocks() or ext4_mb_clear_bb().

Link: https://syzkaller.appspot.com/bug?id=5266d464285a03cee9dbfda7d2452a72c3c2ae7c
Reported-by: syzbot+15cd994e273307bf5cfa@syzkaller.appspotmail.com
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Tadeusz Struk <tadeusz.struk@linaro.org>
Tested-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Link: https://lore.kernel.org/r/20220714165903.58260-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 218a6944 06-Jun-2022 hanjinke <hanjinke.666@bytedance.com>

ext4: reuse order and buddy in mb_mark_used when buddy split

After each buddy split, mb_mark_used will search the proper order
for the block which may consume some loop in mb_find_order_for_block.
In fact, we can reuse the order and buddy generated by the buddy split.

Reviewed by: lei.rao@intel.com
Signed-off-by: hanjinke <hanjinke.666@bytedance.com>
Link: https://lore.kernel.org/r/20220606155305.74146-1-hanjinke.666@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# cf4ff938 28-May-2022 Baokun Li <libaokun1@huawei.com>

ext4: correct the judgment of BUG in ext4_mb_normalize_request

ext4_mb_normalize_request() can move logical start of allocated blocks
to reduce fragmentation and better utilize preallocation. However logical
block requested as a start of allocation (ac->ac_o_ex.fe_logical) should
always be covered by allocated blocks so we should check that by
modifying and to or in the assertion.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220528110017.354175-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# a08f789d 28-May-2022 Baokun Li <libaokun1@huawei.com>

ext4: fix bug_on ext4_mb_use_inode_pa

Hulk Robot reported a BUG_ON:
==================================================================
kernel BUG at fs/ext4/mballoc.c:3211!
[...]
RIP: 0010:ext4_mb_mark_diskspace_used.cold+0x85/0x136f
[...]
Call Trace:
ext4_mb_new_blocks+0x9df/0x5d30
ext4_ext_map_blocks+0x1803/0x4d80
ext4_map_blocks+0x3a4/0x1a10
ext4_writepages+0x126d/0x2c30
do_writepages+0x7f/0x1b0
__filemap_fdatawrite_range+0x285/0x3b0
file_write_and_wait_range+0xb1/0x140
ext4_sync_file+0x1aa/0xca0
vfs_fsync_range+0xfb/0x260
do_fsync+0x48/0xa0
[...]
==================================================================

Above issue may happen as follows:
-------------------------------------
do_fsync
vfs_fsync_range
ext4_sync_file
file_write_and_wait_range
__filemap_fdatawrite_range
do_writepages
ext4_writepages
mpage_map_and_submit_extent
mpage_map_one_extent
ext4_map_blocks
ext4_mb_new_blocks
ext4_mb_normalize_request
>>> start + size <= ac->ac_o_ex.fe_logical
ext4_mb_regular_allocator
ext4_mb_simple_scan_group
ext4_mb_use_best_found
ext4_mb_new_preallocation
ext4_mb_new_inode_pa
ext4_mb_use_inode_pa
>>> set ac->ac_b_ex.fe_len <= 0
ext4_mb_mark_diskspace_used
>>> BUG_ON(ac->ac_b_ex.fe_len <= 0);

we can easily reproduce this problem with the following commands:
`fallocate -l100M disk`
`mkfs.ext4 -b 1024 -g 256 disk`
`mount disk /mnt`
`fsstress -d /mnt -l 0 -n 1000 -p 1`

The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP.
Therefore, "start + size <= ac->ac_o_ex.fe_logical" may occur
when the size is truncated. So start should be the start position of
the group where ac_o_ex.fe_logical is located after alignment.
In addition, when the value of fe_logical or EXT4_BLOCKS_PER_GROUP
is very large, the value calculated by start_off is more accurate.

Cc: stable@kernel.org
Fixes: cd648b8a8fd5 ("ext4: trim allocation requests to group size")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220528110017.354175-2-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# d63c00ea 17-Apr-2022 Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>

ext4: mark group as trimmed only if it was fully scanned

Otherwise nonaligned fstrim calls will works inconveniently for iterative
scanners, for example:

// trim [0,16MB] for group-1, but mark full group as trimmed
fstrim -o $((1024*1024*128)) -l $((1024*1024*16)) ./m
// handle [16MB,16MB] for group-1, do nothing because group already has the flag.
fstrim -o $((1024*1024*144)) -l $((1024*1024*16)) ./m

[ Update function documentation for ext4_trim_all_free -- TYT ]

Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Link: https://lore.kernel.org/r/1650214995-860245-1-git-send-email-dmtrmonakhov@yandex-team.ru
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org


# af2b3275 04-Apr-2022 Jinke Han <hanjinke.666@bytedance.com>

ext4: remove unnecessary code in __mb_check_buddy

When enter elseif branch, the the MB_CHECK_ASSERT will never fail.
In addtion, the only illegal combination is 0/0, which can be caught
by the first if branch.

Signed-off-by: Jinke Han <hanjinke.666@bytedance.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220404152243.13556-1-hanjinke.666@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# c30365b9 01-Apr-2022 Yu Zhe <yuzhe@nfschina.com>

ext4: remove unnecessary type castings

remove unnecessary void* type castings.

Signed-off-by: Yu Zhe <yuzhe@nfschina.com>
Link: https://lore.kernel.org/r/20220401081321.73735-1-yuzhe@nfschina.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 44abff2c 14-Apr-2022 Christoph Hellwig <hch@lst.de>

block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARD

Secure erase is a very different operation from discard in that it is
a data integrity operation vs hint. Fully split the limits and helper
infrastructure to make the separation more clear.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd]
Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> [nifs2]
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org> [f2fs]
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Chao Yu <chao@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-27-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7b47ef52 14-Apr-2022 Christoph Hellwig <hch@lst.de>

block: add a bdev_discard_granularity helper

Abstract away implementation details from file systems by providing a
block_device based helper to retrieve the discard granularity.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd]
Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Link: https://lore.kernel.org/r/20220415045258.199825-26-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 10f0d2a5 14-Apr-2022 Christoph Hellwig <hch@lst.de>

block: add a bdev_nonrot helper

Add a helper to check the nonrot flag based on the block_device instead
of having to poke into the block layer internal request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 077d0c2c 08-Mar-2022 Ojaswin Mujoo <ojaswin@linux.ibm.com>

ext4: make mb_optimize_scan performance mount option work with extents

Currently mb_optimize_scan scan feature which improves filesystem
performance heavily (when FS is fragmented), seems to be not working
with files with extents (ext4 by default has files with extents).

This patch fixes that and makes mb_optimize_scan feature work
for files with extents.

Below are some performance numbers obtained when allocating a 10M and 100M
file with and w/o this patch on a filesytem with no 1M contiguous block.

<perf numbers>
===============
Workload: dd if=/dev/urandom of=test conv=fsync bs=1M count=10/100

Time taken
=====================================================
no. Size without-patch with-patch Diff(%)
1 10M 0m8.401s 0m5.623s 33.06%
2 100M 1m40.465s 1m14.737s 25.6%

<debug stats>
=============
w/o patch:
mballoc:
reqs: 17056
success: 11407
groups_scanned: 13643
cr0_stats:
hits: 37
groups_considered: 9472
useless_loops: 36
bad_suggestions: 0
cr1_stats:
hits: 11418
groups_considered: 908560
useless_loops: 1894
bad_suggestions: 0
cr2_stats:
hits: 1873
groups_considered: 6913
useless_loops: 21
cr3_stats:
hits: 21
groups_considered: 5040
useless_loops: 21
extents_scanned: 417364
goal_hits: 3707
2^n_hits: 37
breaks: 1873
lost: 0
buddies_generated: 239/240
buddies_time_used: 651080
preallocated: 705
discarded: 478

with patch:
mballoc:
reqs: 12768
success: 11305
groups_scanned: 12768
cr0_stats:
hits: 1
groups_considered: 18
useless_loops: 0
bad_suggestions: 0
cr1_stats:
hits: 5829
groups_considered: 50626
useless_loops: 0
bad_suggestions: 0
cr2_stats:
hits: 6938
groups_considered: 580363
useless_loops: 0
cr3_stats:
hits: 0
groups_considered: 0
useless_loops: 0
extents_scanned: 309059
goal_hits: 0
2^n_hits: 1
breaks: 1463
lost: 0
buddies_generated: 239/240
buddies_time_used: 791392
preallocated: 673
discarded: 446

Fixes: 196e402 (ext4: improve cr 0 / cr 1 group scanning)
Cc: stable@kernel.org
Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Reported-by: Nageswara R Sastry <rnsastry@linux.ibm.com>
Suggested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/fc9a48f7f8dcfc83891a8b21f6dd8cdf056ed810.1646732698.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# fd9b6fad 01-Mar-2022 Yang Li <yang.lee@linux.alibaba.com>

ext4: fix ext4_mb_clear_bb() kernel-doc comment

Remove the excess description of @bh in ext4_mb_clear_bb() kernel-doc
comment to remove warnings found by running scripts/kernel-doc, which
is caused by using 'make W=1'.

fs/ext4/mballoc.c:5895: warning: Excess function parameter 'bh'
description in 'ext4_mb_clear_bb'

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220301092136.34764-1-yang.lee@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 8c91c579 15-Feb-2022 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: add extra check in ext4_mb_mark_bb() to prevent against possible corruption

This patch adds an extra checks in ext4_mb_mark_bb() function
to make sure we mark & report error if we were to mark/clear any
of the critical FS metadata specific bitmaps (&bail out) to prevent
from any accidental corruption.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/53cbb6f2573db162a57f935365050d8b1df202ee.1644992610.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# a00b482b 15-Feb-2022 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: add strict range checks while freeing blocks

Currently ext4_mb_clear_bb() & ext4_group_add_blocks() only checks
whether the given block ranges (which is to be freed) belongs to any FS
metadata blocks or not, of the block's respective block group.
But to detect any FS error early, it is better to add more strict
checkings in those functions which checks whether the given blocks
belongs to any critical FS metadata or not within system-zone.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/ddd9143d064774e32d6364a99667817c6e8bfdc0.1644992610.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# bd8247ee 15-Feb-2022 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: no need to test for block bitmap bits in ext4_mb_mark_bb()

We don't need the return value of mb_test_and_clear_bits() in ext4_mb_mark_bb()
So simply use mb_clear_bits() instead.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/a971935306dafb124da0193c7dad1aa485210b62.1644992610.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 123e3016 15-Feb-2022 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: rename ext4_set_bits to mb_set_bits

ext4_set_bits() should actually be mb_set_bits() for uniform API naming
convention.
This is via below cmd -

grep -nr "ext4_set_bits" fs/ext4/ | cut -d ":" -f 1 | xargs sed -i 's/ext4_set_bits/mb_set_bits/g'

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/f1f6ece1405b76a7a987e9145d1adfaf71e30695.1644992610.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 8ac3939d 15-Feb-2022 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: refactor ext4_free_blocks() to pull out ext4_mb_clear_bb()

ext4_free_blocks() function became too long and confusing, this patch
just pulls out the ext4_mb_clear_bb() function logic from it
which clears the block bitmap and frees it.

No functionality change in this patch

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/22c30fbb26ba409cf8aa5f0c7912970272c459e8.1644992610.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# bfdc502a 15-Feb-2022 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: fix ext4_mb_mark_bb() with flex_bg with fast_commit

In case of flex_bg feature (which is by default enabled), extents for
any given inode might span across blocks from two different block group.
ext4_mb_mark_bb() only reads the buffer_head of block bitmap once for the
starting block group, but it fails to read it again when the extent length
boundary overflows to another block group. Then in this below loop it
accesses memory beyond the block group bitmap buffer_head and results
into a data abort.

for (i = 0; i < clen; i++)
if (!mb_test_bit(blkoff + i, bitmap_bh->b_data) == !state)
already++;

This patch adds this functionality for checking block group boundary in
ext4_mb_mark_bb() and update the buffer_head(bitmap_bh) for every different
block group.

w/o this patch, I was easily able to hit a data access abort using Power platform.

<...>
[ 74.327662] EXT4-fs error (device loop3): ext4_mb_generate_buddy:1141: group 11, block bitmap and bg descriptor inconsistent: 21248 vs 23294 free clusters
[ 74.533214] EXT4-fs (loop3): shut down requested (2)
[ 74.536705] Aborting journal on device loop3-8.
[ 74.702705] BUG: Unable to handle kernel data access on read at 0xc00000005e980000
[ 74.703727] Faulting instruction address: 0xc0000000007bffb8
cpu 0xd: Vector: 300 (Data Access) at [c000000015db7060]
pc: c0000000007bffb8: ext4_mb_mark_bb+0x198/0x5a0
lr: c0000000007bfeec: ext4_mb_mark_bb+0xcc/0x5a0
sp: c000000015db7300
msr: 800000000280b033
dar: c00000005e980000
dsisr: 40000000
current = 0xc000000027af6880
paca = 0xc00000003ffd5200 irqmask: 0x03 irq_happened: 0x01
pid = 5167, comm = mount
<...>
enter ? for help
[c000000015db7380] c000000000782708 ext4_ext_clear_bb+0x378/0x410
[c000000015db7400] c000000000813f14 ext4_fc_replay+0x1794/0x2000
[c000000015db7580] c000000000833f7c do_one_pass+0xe9c/0x12a0
[c000000015db7710] c000000000834504 jbd2_journal_recover+0x184/0x2d0
[c000000015db77c0] c000000000841398 jbd2_journal_load+0x188/0x4a0
[c000000015db7880] c000000000804de8 ext4_fill_super+0x2638/0x3e10
[c000000015db7a40] c0000000005f8404 get_tree_bdev+0x2b4/0x350
[c000000015db7ae0] c0000000007ef058 ext4_get_tree+0x28/0x40
[c000000015db7b00] c0000000005f6344 vfs_get_tree+0x44/0x100
[c000000015db7b70] c00000000063c408 path_mount+0xdd8/0xe70
[c000000015db7c40] c00000000063c8f0 sys_mount+0x450/0x550
[c000000015db7d50] c000000000035770 system_call_exception+0x4a0/0x4e0
[c000000015db7e10] c00000000000c74c system_call_common+0xec/0x250

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/2609bc8f66fc15870616ee416a18a3d392a209c4.1644992609.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# a5c0e2fd 15-Feb-2022 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: correct cluster len and clusters changed accounting in ext4_mb_mark_bb

ext4_mb_mark_bb() currently wrongly calculates cluster len (clen) and
flex_group->free_clusters. This patch fixes that.

Identified based on code review of ext4_mb_mark_bb() function.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/a0b035d536bafa88110b74456853774b64c8ac40.1644992609.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 31a074a0 09-Jan-2022 Xin Yin <yinxin.x@bytedance.com>

ext4: modify the logic of ext4_mb_new_blocks_simple

For now in ext4_mb_new_blocks_simple, if we found a block which
should be excluded then will switch to next group, this may
probably cause 'group' run out of range.

Change to check next block in the same group when get a block should
be excluded. Also change the search range to EXT4_CLUSTERS_PER_GROUP
and add error checking.

Signed-off-by: Xin Yin <yinxin.x@bytedance.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20220110035141.1980-3-yinxin.x@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org


# 359745d7 21-Jan-2022 Muchun Song <songmuchun@bytedance.com>

proc: remove PDE_DATA() completely

Remove PDE_DATA() completely and replace it with pde_data().

[akpm@linux-foundation.org: fix naming clash in drivers/nubus/proc.c]
[akpm@linux-foundation.org: now fix it properly]

Link: https://lkml.kernel.org/r/20211124081956.87711-2-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2327fb2e 03-Nov-2021 Lukas Czerner <lczerner@redhat.com>

ext4: change s_last_trim_minblks type to unsigned long

There is no good reason for the s_last_trim_minblks to be atomic. There is
no data integrity needed and there is no real danger in setting and
reading it in a racy manner. Change it to be unsigned long, the same type
as s_clusters_per_group which is the maximum that's allowed.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Suggested-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20211103145122.17338-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 173b6e38 12-Nov-2021 Jan Kara <jack@suse.cz>

ext4: avoid trim error on fs with small groups

A user reported FITRIM ioctl failing for him on ext4 on some devices
without apparent reason. After some debugging we've found out that
these devices (being LVM volumes) report rather large discard
granularity of 42MB and the filesystem had 1k blocksize and thus group
size of 8MB. Because ext4 FITRIM implementation puts discard
granularity into minlen, ext4_trim_fs() declared the trim request as
invalid. However just silently doing nothing seems to be a more
appropriate reaction to such combination of parameters since user did
not specify anything wrong.

CC: Lukas Czerner <lczerner@redhat.com>
Fixes: 5c2ed62fd447 ("ext4: Adjust minlen with discard_granularity in the FITRIM ioctl")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211112152202.26614-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 8c80fb31 22-Nov-2021 Chunguang Xu <brookxu@tencent.com>

ext4: fix a possible ABBA deadlock due to busy PA

We found on older kernel (3.10) that in the scenario of insufficient
disk space, system may trigger an ABBA deadlock problem, it seems that
this problem still exists in latest kernel, try to fix it here. The
main process triggered by this problem is that task A occupies the PA
and waits for the jbd2 transaction finish, the jbd2 transaction waits
for the completion of task B's IO (plug_list), but task B waits for
the release of PA by task A to finish discard, which indirectly forms
an ABBA deadlock. The related calltrace is as follows:

Task A
vfs_write
ext4_mb_new_blocks()
ext4_mb_mark_diskspace_used() JBD2
jbd2_journal_get_write_access() -> jbd2_journal_commit_transaction()
->schedule() filemap_fdatawait()
| |
| Task B |
| do_unlinkat() |
| ext4_evict_inode() |
| jbd2_journal_begin_ordered_truncate() |
| filemap_fdatawrite_range() |
| ext4_mb_new_blocks() |
-ext4_mb_discard_group_preallocations() <-----

Here, try to cancel ext4_mb_discard_group_preallocations() internal
retry due to PA busy, and do a limited number of retries inside
ext4_mb_discard_preallocations(), which can circumvent the above
problems, but also has some advantages:

1. Since the PA is in a busy state, if other groups have free PAs,
keeping the current PA may help to reduce fragmentation.
2. Continue to traverse forward instead of waiting for the current
group PA to be released. In most scenarios, the PA discard time
can be reduced.

However, in the case of smaller free space, if only a few groups have
space, then due to multiple traversals of the group, it may increase
CPU overhead. But in contrast, I feel that the overall benefit is
better than the cost.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/1637630277-23496-1-git-send-email-brookxu.cn@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org


# afcc4e32 20-Aug-2021 Lukas Bulwahn <lukas.bulwahn@gmail.com>

ext4: scope ret locally in ext4_try_to_trim_range()

As commit 6920b3913235 ("ext4: add new helper interface
ext4_try_to_trim_range()") moves some code into the separate function
ext4_try_to_trim_range(), the use of the variable ret within that
function is more limited and can be adjusted as well.

Scope the use of the variable ret locally and drop dead assignments.

No functional change.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Link: https://lore.kernel.org/r/20210820120853.23134-1-lukas.bulwahn@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 188c299e 16-Aug-2021 Jan Kara <jack@suse.cz>

ext4: Support for checksumming from journal triggers

JBD2 layer support triggers which are called when journaling layer moves
buffer to a certain state. We can use the frozen trigger, which gets
called when buffer data is frozen and about to be written out to the
journal, to compute block checksums for some buffer types (similarly as
does ocfs2). This avoids unnecessary repeated recomputation of the
checksum (at the cost of larger window where memory corruption won't be
caught by checksumming) and is even necessary when there are
unsynchronized updaters of the checksummed data.

So add superblock and journal trigger type arguments to
ext4_journal_get_write_access() and ext4_journal_get_create_access() so
that frozen triggers can be set accordingly. Also add inode argument to
ext4_walk_page_buffers() and all the callbacks used with that function
for the same purpose. This patch is mostly only a change of prototype of
the above mentioned functions and a few small helpers. Real checksumming
will come later.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210816095713.16537-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# a5fda113 14-Aug-2021 Theodore Ts'o <tytso@mit.edu>

ext4: fix sparse warnings

Add sparse annotations to suppress false positive context imbalance
warnings, and use NULL instead of 0 in EXT_MAX_{EXTENT,INDEX}.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 5036ab8d 30-Aug-2021 Wang Jianchao <wangjianchao@kuaishou.com>

ext4: flush background discard kwork when retry allocation

The background discard kwork tries to mark blocks used and issue
discard. This can make filesystem suffer from NOSPC error, xfstest
generic/371 can fail due to it. Fix it by flushing discard kwork
in ext4_should_retry_alloc. At the same time, give up discard at
the moment.

Signed-off-by: Wang Jianchao <wangjianchao@kuaishou.com>
Link: https://lore.kernel.org/r/20210830075246.12516-6-jianchao.wan9@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 55cdd0af 24-Jul-2021 Wang Jianchao <wangjianchao@kuaishou.com>

ext4: get discard out of jbd2 commit kthread contex

Right now, discard is issued and waited to be completed in jbd2
commit kthread context after the logs are committed. When large
amount of files are deleted and discard is flooding, jbd2 commit
kthread can be blocked for long time. Then all of the metadata
operations can be blocked to wait the log space.

One case is the page fault path with read mm->mmap_sem held, which
wants to update the file time but has to wait for the log space.
When other threads in the task wants to do mmap, then write mmap_sem
is blocked. Finally all of the following read mmap_sem requirements
are blocked, even the ps command which need to read the /proc/pid/
-cmdline. Our monitor service which needs to read /proc/pid/cmdline
used to be blocked for 5 mins.

This patch frees the blocks back to buddy after commit and then do
discard in a async kworker context in fstrim fashion, namely,
- mark blocks to be discarded as used if they have not been allocated
- do discard
- mark them free
After this, jbd2 commit kthread won't be blocked any more by discard
and we won't get NOSPC even if the discard is slow or throttled.

Link: https://marc.info/?l=linux-kernel&m=162143690731901&w=2
Suggested-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wang Jianchao <wangjianchao@kuaishou.com>
Link: https://lore.kernel.org/r/20210830075246.12516-5-jianchao.wan9@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b6f5558c 24-Jul-2021 Wang Jianchao <wangjianchao@kuaishou.com>

ext4: remove the repeated comment of ext4_trim_all_free

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Wang Jianchao <wangjianchao@kuaishou.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210724074124.25731-4-jianchao.wan9@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 6920b391 24-Jul-2021 Wang Jianchao <wangjianchao@kuaishou.com>

ext4: add new helper interface ext4_try_to_trim_range()

There is no functional change in this patch but just split the
codes, which serachs free block and does trim, into a new function
ext4_try_to_trim_range. This is preparing for the following async
backgroup discard.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Wang Jianchao <wangjianchao@kuaishou.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210724074124.25731-3-jianchao.wan9@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# bd2eea8d 24-Jul-2021 Wang Jianchao <wangjianchao@kuaishou.com>

ext4: remove the 'group' parameter of ext4_trim_extent

Get rid of the 'group' parameter of ext4_trim_extent as we can get
it from the 'e4b'.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Wang Jianchao <wangjianchao@kuaishou.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210724074124.25731-2-jianchao.wan9@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# cd84bbba 23-Jun-2021 Stephen Brennan <stephen.s.brennan@oracle.com>

ext4: use ext4_grp_locked_error in mb_find_extent

Commit 5d1b1b3f492f ("ext4: fix BUG when calling ext4_error with locked
block group") introduces ext4_grp_locked_error to handle unlocking a
group in error cases. Otherwise, there is a possibility of a sleep while
atomic. However, since 43c73221b3b1 ("ext4: replace BUG_ON with WARN_ON
in mb_find_extent()"), mb_find_extent() has contained a ext4_error()
call while a group spinlock is held. Replace this with
ext4_grp_locked_error.

Fixes: 43c73221b3b1 ("ext4: replace BUG_ON with WARN_ON in mb_find_extent()")
Cc: <stable@vger.kernel.org> # 4.14+
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Link: https://lore.kernel.org/r/20210623232114.34457-1-stephen.s.brennan@oracle.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# a8867f4e 12-Apr-2021 Phillip Potter <phil@philpotter.co.uk>

ext4: fix memory leak in ext4_mb_init_backend on error path.

Fix a memory leak discovered by syzbot when a file system is corrupted
with an illegally large s_log_groups_per_flex.

Reported-by: syzbot+aa12d6106ea4ca1b6aae@syzkaller.appspotmail.com
Signed-off-by: Phillip Potter <phil@philpotter.co.uk>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20210412073837.1686-1-phil@philpotter.co.uk
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 666245d9 08-Apr-2021 Jack Qiu <jack.qiu@huawei.com>

ext4: fix trailing whitespace

Made suggested modifications from checkpatch in reference to ERROR:
trailing whitespace

Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20210409042035.15516-1-jack.qiu@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# f68f4063 01-Apr-2021 Harshad Shirwadkar <harshadshirwadkar@gmail.com>

ext4: add proc files to monitor new structures

This patch adds a new file "mb_structs_summary" which allows us to see
the summary of the new allocator structures added in this
series. Here's the sample output of file:

optimize_scan: 1
max_free_order_lists:
list_order_0_groups: 0
list_order_1_groups: 0
list_order_2_groups: 0
list_order_3_groups: 0
list_order_4_groups: 0
list_order_5_groups: 0
list_order_6_groups: 0
list_order_7_groups: 0
list_order_8_groups: 0
list_order_9_groups: 0
list_order_10_groups: 0
list_order_11_groups: 0
list_order_12_groups: 0
list_order_13_groups: 40
fragment_size_tree:
tree_min: 16384
tree_max: 32768
tree_nodes: 40

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20210401172129.189766-7-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 196e402a 01-Apr-2021 Harshad Shirwadkar <harshadshirwadkar@gmail.com>

ext4: improve cr 0 / cr 1 group scanning

Instead of traversing through groups linearly, scan groups in specific
orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
largest free order >= the order of the request. So, with this patch,
we maintain lists for each possible order and insert each group into a
list based on the largest free order in its buddy bitmap. During cr 0
allocation, we traverse these lists in the increasing order of largest
free orders. This allows us to find a group with the best available cr
0 match in constant time. If nothing can be found, we fallback to cr 1
immediately.

At CR1, the story is slightly different. We want to traverse in the
order of increasing average fragment size. For CR1, we maintain a rb
tree of groupinfos which is sorted by average fragment size. Instead
of traversing linearly, at CR1, we traverse in the order of increasing
average fragment size, starting at the most optimal group. This brings
down cr 1 search complexity to log(num groups).

For cr >= 2, we just perform the linear search as before. Also, in
case of lock contention, we intermittently fallback to linear search
even in CR 0 and CR 1 cases. This allows us to proceed during the
allocation path even in case of high contention.

There is an opportunity to do optimization at CR2 too. That's because
at CR2 we only consider groups where bb_free counter (number of free
blocks) is greater than the request extent size. That's left as future
work.

All the changes introduced in this patch are protected under a new
mount option "mb_optimize_scan".

With this patchset, following experiment was performed:

Created a highly fragmented disk of size 65TB. The disk had no
contiguous 2M regions. Following command was run consecutively for 3
times:

time dd if=/dev/urandom of=file bs=2M count=10

Here are the results with and without cr 0/1 optimizations introduced
in this patch:

|---------+------------------------------+---------------------------|
| | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
|---------+------------------------------+---------------------------|
| 1st run | 5m1.871s | 2m47.642s |
| 2nd run | 2m28.390s | 0m0.611s |
| 3rd run | 2m26.530s | 0m1.255s |
|---------+------------------------------+---------------------------|

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210401172129.189766-6-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 4b68f6df 01-Apr-2021 Harshad Shirwadkar <harshadshirwadkar@gmail.com>

ext4: add MB_NUM_ORDERS macro

A few arrays in mballoc.c use the total number of valid orders as
their size. Currently, this value is set as "sb->s_blocksize_bits +
2". This makes code harder to read. So, instead add a new macro
MB_NUM_ORDERS(sb) to make the code more readable.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20210401172129.189766-5-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# a6c75eaf 01-Apr-2021 Harshad Shirwadkar <harshadshirwadkar@gmail.com>

ext4: add mballoc stats proc file

Add new stats for measuring the performance of mballoc. This patch is
forked from Artem Blagodarenko's work that can be found here:

https://github.com/lustre/lustre-release/blob/master/ldiskfs/kernel_patches/patches/rhel8/ext4-simple-blockalloc.patch

This patch reorganizes the stats by cr level. This is how the output
looks like:

mballoc:
reqs: 0
success: 0
groups_scanned: 0
cr0_stats:
hits: 0
groups_considered: 0
useless_loops: 0
bad_suggestions: 0
cr1_stats:
hits: 0
groups_considered: 0
useless_loops: 0
bad_suggestions: 0
cr2_stats:
hits: 0
groups_considered: 0
useless_loops: 0
cr3_stats:
hits: 0
groups_considered: 0
useless_loops: 0
extents_scanned: 0
goal_hits: 0
2^n_hits: 0
breaks: 0
lost: 0
buddies_generated: 0/40
buddies_time_used: 0
preallocated: 0
discarded: 0

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20210401172129.189766-4-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 67d25186 01-Apr-2021 Harshad Shirwadkar <harshadshirwadkar@gmail.com>

ext4: drop s_mb_bal_lock and convert protected fields to atomic

s_mb_buddies_generated gets used later in this patch series to
determine if the cr 0 and cr 1 optimziations should be performed or
not. Currently, s_mb_buddies_generated is protected under a
spin_lock. In the allocation path, it is better if we don't depend on
the lock and instead read the value atomically. In order to do that,
we drop s_bal_lock altogether and we convert the only two protected
fields by it s_mb_buddies_generated and s_mb_generation_time to atomic
type.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20210401172129.189766-2-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# f91436d5 24-Feb-2021 Sabyrzhan Tasbolatov <snovitoll@gmail.com>

fs/ext4: fix integer overflow in s_log_groups_per_flex

syzbot found UBSAN: shift-out-of-bounds in ext4_mb_init [1], when
1 << sbi->s_es->s_log_groups_per_flex is bigger than UINT_MAX,
where sbi->s_mb_prefetch is unsigned integer type.

32 is the maximum allowed power of s_log_groups_per_flex. Following if
check will also trigger UBSAN shift-out-of-bound:

if (1 << sbi->s_es->s_log_groups_per_flex >= UINT_MAX) {

So I'm checking it against the raw number, perhaps there is another way
to calculate UINT_MAX max power. Also use min_t as to make sure it's
uint type.

[1] UBSAN: shift-out-of-bounds in fs/ext4/mballoc.c:2713:24
shift exponent 60 is too large for 32-bit type 'int'
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x137/0x1be lib/dump_stack.c:120
ubsan_epilogue lib/ubsan.c:148 [inline]
__ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395
ext4_mb_init_backend fs/ext4/mballoc.c:2713 [inline]
ext4_mb_init+0x19bc/0x19f0 fs/ext4/mballoc.c:2898
ext4_fill_super+0xc2ec/0xfbe0 fs/ext4/super.c:4983

Reported-by: syzbot+a8b4b0c60155e87e9484@syzkaller.appspotmail.com
Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210224095800.3350002-1-snovitoll@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 82ef1370 03-Dec-2020 Chunguang Xu <brookxu@tencent.com>

ext4: avoid s_mb_prefetch to be zero in individual scenarios

Commit cfd732377221 ("ext4: add prefetching for block allocation
bitmaps") introduced block bitmap prefetch, and expects to read block
bitmaps of flex_bg through an IO. However, it seems to ignore the
value range of s_log_groups_per_flex. In the scenario where the value
of s_log_groups_per_flex is greater than 27, s_mb_prefetch or
s_mb_prefetch_limit will overflow, cause a divide zero exception.

In addition, the logic of calculating nr is also flawed, because the
size of flexbg is fixed during a single mount, but s_mb_prefetch can
be modified, which causes nr to fail to meet the value condition of
[1, flexbg_size].

To solve this problem, we need to set the upper limit of
s_mb_prefetch. Since we expect to load block bitmaps of a flex_bg
through an IO, we can consider determining a reasonable upper limit
among the IO limit parameters. After consideration, we chose
BLK_MAX_SEGMENT_SIZE. This is a good choice to solve divide zero
problem and avoiding performance degradation.

[ Some minor code simplifications to make the changes easy to follow -- TYT ]

Reported-by: Tosk Robot <tencent_os_robot@tencent.com>
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Samuel Liao <samuelliao@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/1607051143-24508-1-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# cca41553 07-Nov-2020 Chunguang Xu <brookxu@tencent.com>

ext4: fix a memory leak of ext4_free_data

When freeing metadata, we will create an ext4_free_data and
insert it into the pending free list. After the current
transaction is committed, the object will be freed.

ext4_mb_free_metadata() will check whether the area to be freed
overlaps with the pending free list. If true, return directly. At this
time, ext4_free_data is leaked. Fortunately, the probability of this
problem is small, since it only occurs if the file system is corrupted
such that a block is claimed by more one inode and those inodes are
deleted within a single jbd2 transaction.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/1604764698-4269-8-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org


# ce3cca33 07-Nov-2020 Chunguang Xu <brookxu@tencent.com>

ext4: simplify the code of mb_find_order_for_block

The code of mb_find_order_for_block is a bit obscure, but we can
simplify it with mb_find_buddy(), make the code more concise.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/1604764698-4269-3-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 6bd97bf2 07-Nov-2020 Chunguang Xu <brookxu@tencent.com>

ext4: remove redundant mb_regenerate_buddy()

After this patch (163a203), if an abnormal bitmap is detected, we
will mark the group as corrupt, and we will not use this group in
the future. Therefore, it should be meaningless to regenerate the
buddy bitmap of this group, It might be better to delete it.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/1604764698-4269-2-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 9b5f6c9b 05-Nov-2020 Harshad Shirwadkar <harshadshirwadkar@gmail.com>

ext4: make s_mount_flags modifications atomic

Fast commit file system states are recorded in
sbi->s_mount_flags. Fast commit expects these bit manipulations to be
atomic. This patch adds helpers to make those modifications atomic.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201106035911.1942128-21-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# e121bd48 30-Oct-2020 Dan Carpenter <dan.carpenter@oracle.com>

ext4: silence an uninitialized variable warning

Smatch complains that "i" can be uninitialized if we don't enter the
loop. I don't know if it's possible but we may as well silence this
warning.

[ Initialize i to sb->s_blocksize instead of 0. The only way the for
loop could be skipped entirely is the in-memory data structures, in
particular the bh->b_data for the on-disk superblock has gotten
corrupted enough that calculated value of group is >= to
ext4_get_groups_count(sb). In that case, we want to exit
immediately without allocating a block. -- TYT ]

Fixes: 8016e29f4362 ("ext4: fast commit recovery path")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/20201030114620.GB3251003@mwanda
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org


# 8016e29f 15-Oct-2020 Harshad Shirwadkar <harshadshirwadkar@gmail.com>

ext4: fast commit recovery path

This patch adds fast commit recovery path support for Ext4 file
system. We add several helper functions that are similar in spirit to
e2fsprogs journal recovery path handlers. Example of such functions
include - a simple block allocator, idempotent block bitmap update
function etc. Using these routines and the fast commit log in the fast
commit area, the recovery path (ext4_fc_replay()) performs fast commit
log recovery.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201015203802.3597742-8-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# addd752c 28-Sep-2020 Chunguang Xu <brookxu@tencent.com>

ext4: make mb_check_counter per group

Make bb_check_counter per group, so each group has the same chance
to be checked, which can expose errors more easily.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/1601292995-32205-2-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 9d1f9b27 28-Sep-2020 Chunguang Xu <brookxu@tencent.com>

ext4: delete invalid comments near mb_buddy_adjust_border

The comment near mb_buddy_adjust_border seems meaningless, just
clear it.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/1601292995-32205-1-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b483bb77 04-Aug-2020 Randy Dunlap <rdunlap@infradead.org>

ext4: delete duplicated words + other fixes

Delete repeated words in fs/ext4/.
{the, this, of, we, after}

Also change spelling of "xttr" in inline.c to "xattr" in 2 places.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200805024850.12129-1-rdunlap@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 5b3dc19d 24-Sep-2020 Jan Kara <jack@suse.cz>

ext4: discard preallocations before releasing group lock

ext4_mb_discard_group_preallocations() can be releasing group lock with
preallocations accumulated on its local list. Thus although
discard_pa_seq was incremented and concurrent allocating processes will
be retrying allocations, it can happen that premature ENOSPC error is
returned because blocks used for preallocations are not available for
reuse yet. Make sure we always free locally accumulated preallocations
before releasing group lock.

Fixes: 07b5b8e1ac40 ("ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200924150959.4335-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 70022da8 16-Sep-2020 Ye Bin <yebin10@huawei.com>

ext4: fix dead loop in ext4_mb_new_blocks

As we test disk offline/online with running fsstress, we find fsstress
process is keeping running state.
kworker/u32:3-262 [004] ...1 140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114
....
kworker/u32:3-262 [004] ...1 140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114

ext4_mb_new_blocks
repeat:
ext4_mb_discard_preallocations_should_retry(sb, ac, &seq)
freed = ext4_mb_discard_preallocations
ext4_mb_discard_group_preallocations
this_cpu_inc(discard_pa_seq);
---> freed == 0
seq_retry = ext4_get_discard_pa_seq_sum
for_each_possible_cpu(__cpu)
__seq += per_cpu(discard_pa_seq, __cpu);
if (seq_retry != *seq) {
*seq = seq_retry;
ret = true;
}

As we see seq_retry is sum of discard_pa_seq every cpu, if
ext4_mb_discard_group_preallocations return zero discard_pa_seq in this
cpu maybe increase one, so condition "seq_retry != *seq" have always
been met.
Ritesh Harjani suggest to in ext4_mb_discard_group_preallocations function we
only increase discard_pa_seq when there is some PA to free.

Fixes: 07b5b8e1ac40 ("ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200916113859.1556397-3-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 27bc446e 17-Aug-2020 brookxu <brookxu.cn@gmail.com>

ext4: limit the length of per-inode prealloc list

In the scenario of writing sparse files, the per-inode prealloc list may
be very long, resulting in high overhead for ext4_mb_use_preallocated().
To circumvent this problem, we limit the maximum length of per-inode
prealloc list to 512 and allow users to modify it.

After patching, we observed that the sys ratio of cpu has dropped, and
the system throughput has increased significantly. We created a process
to write the sparse file, and the running time of the process on the
fixed kernel was significantly reduced, as follows:

Running time on unfixed kernel:
[root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
real 0m2.051s
user 0m0.008s
sys 0m2.026s

Running time on fixed kernel:
[root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
real 0m0.471s
user 0m0.004s
sys 0m0.395s

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/d7a98178-056b-6db5-6bce-4ead23f4a257@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 66d5e027 17-Aug-2020 brookxu <brookxu.cn@gmail.com>

ext4: reorganize if statement of ext4_mb_release_context()

Reorganize the if statement of ext4_mb_release_context(), make it
easier to read.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/5439ac6f-db79-ad68-76c1-a4dda9aa0cc3@gmail.com
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# c55ee7d2 14-Aug-2020 brookxu <brookxu.cn@gmail.com>

ext4: add mb_debug logging when there are lost chunks

Lost chunks are when some other process raced with the current thread
to grab a particular block allocation. Add mb_debug log for
developers who wants to see how often this is happening for a
particular workload.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/0a165ac0-1912-aebd-8a0d-b42e7cd1aea1@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# e0d438c7 09-Aug-2020 Xu Wang <vulab@iscas.ac.cn>

mballoc: replace seq_printf with seq_puts

seq_puts is a lot cheaper than seq_printf, so use that to print
literal strings.

Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200810022158.9167-1-vulab@iscas.ac.cn
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# dddcd2f9 07-Aug-2020 brookxu <brookxu.cn@gmail.com>

ext4: optimize the implementation of ext4_mb_good_group()

It might be better to adjust the code in two places:
1. Determine whether grp is currupt or not should be placed first.
2. (cr<=2 && free <ac->ac_g_ex.fe_len)should may belong to the crx
strategy, and it may be more appropriate to put it in the
subsequent switch statement block. For cr1, cr2, the conditions
in switch potentially realize the above judgment. For cr0, we
should add (free <ac->ac_g_ex.fe_len) judgment, and then delete
(free / fragments) >= ac->ac_g_ex.fe_len), because cr0 returns
true by default.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/e20b2d8f-1154-adb7-3831-a9e11ba842e9@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 051e2ce8 07-Aug-2020 brookxu <brookxu.cn@gmail.com>

ext4: delete invalid comments near ext4_mb_check_limits()

These comments do not seem to be related to ext4_mb_check_limits(),
it may be invalid.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/c49faf0c-d5d5-9c51-6911-9e0ff57c6bfa@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# e9a3cd48 07-Aug-2020 brookxu <brookxu.cn@gmail.com>

ext4: fix typos in ext4_mb_regular_allocator() comment

Fix typos in ext4_mb_regular_allocator() comment

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/d6514145-73b3-808b-ec5a-a8be27c51f9c@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# ce9f24cc 28-Jul-2020 Jan Kara <jack@suse.cz>

ext4: check journal inode extents more carefully

Currently, system zones just track ranges of block, that are "important"
fs metadata (bitmaps, group descriptors, journal blocks, etc.). This
however complicates how extent tree (or indirect blocks) can be checked
for inodes that actually track such metadata - currently the journal
inode but arguably we should be treating quota files or resize inode
similarly. We cannot run __ext4_ext_check() on such metadata inodes when
loading their extents as that would immediately trigger the validity
checks and so we just hack around that and special-case the journal
inode. This however leads to a situation that a journal inode which has
extent tree of depth at least one can have invalid extent tree that gets
unnoticed until ext4_cache_extents() crashes.

To overcome this limitation, track inode number each system zone belongs
to (0 is used for zones not belonging to any inode). We can then verify
inode number matches the expected one when verifying extent tree and
thus avoid the false errors. With this there's no need to to
special-case journal inode during extent tree checking anymore so remove
it.

Fixes: 0a944e8a6c66 ("ext4: don't perform block validity checks on the journal inode")
Reported-by: Wolfgang Frisch <wolfgang.frisch@suse.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200728130437.7804-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 9375ac77 26-Jul-2020 brookxu <brookxu.cn@gmail.com>

ext4: delete the invalid BUGON in ext4_mb_load_buddy_gfp()

Delete the invalid BUGON in ext4_mb_load_buddy_gfp(), the previous
code has already judged whether page is NULL.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/ad68e8a2-5ec3-5beb-537f-f3e53f55367a@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 3d392b26 16-Jul-2020 Theodore Ts'o <tytso@mit.edu>

ext4: add prefetch_block_bitmaps mount option

For file systems where we can afford to keep the buddy bitmaps cached,
we can speed up initial writes to large file systems by starting to
load the block allocation bitmaps as soon as the file system is
mounted. This won't work well for _super_ large file systems, or
memory constrained systems, so we only enable this when it is
requested via a mount option.

Addresses-Google-Bug: 159488342
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>


# c1d2c7d4 19-Jun-2020 Alex Zhuravlev <azhuravlev@whamcloud.com>

ext4: skip non-loaded groups at cr=0/1 when scanning for good groups

cr=0 is supposed to be an optimization to save CPU cycles, but if
buddy data (in memory) is not initialized then all this makes no sense
as we have to do sync IO taking a lot of cycles. Also, at cr=0
mballoc doesn't choose any available chunk. cr=1 also skips groups
using heuristic based on avg. fragment size. It's more useful to skip
such groups and switch to cr=2 where groups will be scanned for
available chunks. However, we always read the first block group in a
flex_bg so metadata blocks will get read into the first flex_bg if
possible.

Using sparse image and dm-slow virtual device of 120TB was
simulated, then the image was formatted and filled using debugfs to
mark ~85% of available space as busy. mount process w/o the patch
couldn't complete in half an hour (according to vmstat it would take
~10-11 hours). With the patch applied mount took ~20 seconds.

Lustre-bug-id: https://jira.whamcloud.com/browse/LU-12988
Signed-off-by: Alex Zhuravlev <azhuravlev@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>


# cfd73237 21-Apr-2020 Alex Zhuravlev <bzzz@whamcloud.com>

ext4: add prefetching for block allocation bitmaps

This should significantly improve bitmap loading, especially for flex
groups as it tries to load all bitmaps within a flex.group instead of
one by one synchronously.

Prefetching is done in 8 * flex_bg groups, so it should be 8
read-ahead reads for a single allocating thread. At the end of
allocation the thread waits for read-ahead completion and initializes
buddy information so that read-aheads are not lost in case of memory
pressure.

At cr=0 the number of prefetching IOs is limited per allocation
context to prevent a situation when mballoc loads thousands of bitmaps
looking for a perfect group and ignoring groups with good chunks.

Together with the patch "ext4: limit scanning of uninitialized groups"
the mount time (which includes few tiny allocations) of a 1PB
filesystem is reduced significantly:

0% full 50%-full unpatched patched
mount time 33s 9279s 563s

[ Restructured by tytso; removed the state flags in the allocation
context, so it can be used to lazily prefetch the allocation bitmaps
immediately after the file system is mounted. Skip prefetching
block groups which are uninitialized. Finally pass in the
REQ_RAHEAD flag to the block layer while prefetching. ]

Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 3cb77bd2 14-Jul-2020 brookxu <brookxu.cn@gmail.com>

ext4: fix spelling typos in ext4_mb_initialize_context

Fix spelling typos in ext4_mb_initialize_context.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/883b523c-58ec-7f38-0bb8-cd2ea4393684@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 81198536 09-Jun-2020 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: mballoc: Use this_cpu_read instead of this_cpu_ptr

Simplify reading a seq variable by directly using this_cpu_read API
instead of doing this_cpu_ptr and then dereferencing it.

This also avoid the below kernel BUG: which happens when
CONFIG_DEBUG_PREEMPT is enabled

BUG: using smp_processor_id() in preemptible [00000000] code: syz-fuzzer/6927
caller is ext4_mb_new_blocks+0xa4d/0x3b70 fs/ext4/mballoc.c:4711
CPU: 1 PID: 6927 Comm: syz-fuzzer Not tainted 5.7.0-next-20200602-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x18f/0x20d lib/dump_stack.c:118
check_preemption_disabled+0x20d/0x220 lib/smp_processor_id.c:48
ext4_mb_new_blocks+0xa4d/0x3b70 fs/ext4/mballoc.c:4711
ext4_ext_map_blocks+0x201b/0x33e0 fs/ext4/extents.c:4244
ext4_map_blocks+0x4cb/0x1640 fs/ext4/inode.c:626
ext4_getblk+0xad/0x520 fs/ext4/inode.c:833
ext4_bread+0x7c/0x380 fs/ext4/inode.c:883
ext4_append+0x153/0x360 fs/ext4/namei.c:67
ext4_init_new_dir fs/ext4/namei.c:2757 [inline]
ext4_mkdir+0x5e0/0xdf0 fs/ext4/namei.c:2802
vfs_mkdir+0x419/0x690 fs/namei.c:3632
do_mkdirat+0x21e/0x280 fs/namei.c:3655
do_syscall_64+0x60/0xe0 arch/x86/entry/common.c:359
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 42f56b7a4a7d ("ext4: mballoc: introduce pcpu seqcnt for freeing PA
to improve ENOSPC handling")
Suggested-by: Borislav Petkov <bp@alien8.de>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reported-by: syzbot+82f324bb69744c5f6969@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/534f275016296996f54ecf65168bb3392b6f653d.1591699601.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 99377830 19-May-2020 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: mballoc: use lock for checking free blocks while retrying

Currently while doing block allocation grp->bb_free may be getting
modified if discard is happening in parallel.
For e.g. consider a case where there are lot of threads who have
preallocated lot of blocks and there is a thread which is trying
to discard all of this group's PA. Now it could happen that
we see all of those group's bb_free is zero and fail the allocation
while there is sufficient space if we free up all the PA.

So this patch adds another flag "EXT4_MB_STRICT_CHECK" which will be set
if we are unable to allocate any blocks in the first try (since we may
not have considered blocks about to be discarded from PA lists).
So during retry attempt to allocate blocks we will use ext4_lock_group()
for checking if the group is good or not.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/9cb740a117c958c36596f167b12af1beae9a68b7.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 8ef123fe 19-May-2020 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: mballoc: refactor ext4_mb_good_group()

ext4_mb_good_group() definition was changed some time back
and now it even initializes the buddy cache (via ext4_mb_init_group()),
if in case the EXT4_MB_GRP_NEED_INIT() is true for a group.
Note that ext4_mb_init_group() could sleep and so should not be called
under a spinlock held.
This is fine as of now because ext4_mb_good_group() is called before
loading the buddy bitmap without ext4_lock_group() held
and again called after loading the bitmap, only this time with
ext4_lock_group() held.
But still this whole thing is confusing.

So this patch refactors out ext4_mb_good_group_nolock() which should be
called when without holding ext4_lock_group().
Also in further patches we hold the spinlock (ext4_lock_group()) while
doing any calculations which involves grp->bb_free or grp->bb_fragments.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/d9f7d031a5fbe1c943fae6bf1ff5cdf0604ae722.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 07b5b8e1 19-May-2020 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling

There could be a race in function ext4_mb_discard_group_preallocations()
where the 1st thread may iterate through group's bb_prealloc_list and
remove all the PAs and add to function's local list head.
Now if the 2nd thread comes in to discard the group preallocations,
it will see that the group->bb_prealloc_list is empty and will return 0.

Consider for a case where we have less number of groups
(for e.g. just group 0),
this may even return an -ENOSPC error from ext4_mb_new_blocks()
(where we call for ext4_mb_discard_group_preallocations()).
But that is wrong, since 2nd thread should have waited for 1st thread
to release all the PAs and should have retried for allocation.
Since 1st thread was anyway going to discard the PAs.

The algorithm using this percpu seq counter goes below:
1. We sample the percpu discard_pa_seq counter before trying for block
allocation in ext4_mb_new_blocks().
2. We increment this percpu discard_pa_seq counter when we either allocate
or free these blocks i.e. while marking those blocks as used/free in
mb_mark_used()/mb_free_blocks().
3. We also increment this percpu seq counter when we successfully identify
that the bb_prealloc_list is not empty and hence proceed for discarding
of those PAs inside ext4_mb_discard_group_preallocations().

Now to make sure that the regular fast path of block allocation is not
affected, as a small optimization we only sample the percpu seq counter
on that cpu. Only when the block allocation fails and when freed blocks
found were 0, that is when we sample percpu seq counter for all cpus using
below function ext4_get_discard_pa_seq_sum(). This happens after making
sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty.

It can be well argued that why don't just check for grp->bb_free to
see if there are any free blocks to be allocated. So here are the two
concerns which were discussed:-

1. If for some reason the blocks available in the group are not
appropriate for allocation logic (say for e.g.
EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then
the retry logic may result into infinte looping since grp->bb_free is
non-zero.

2. Also before preallocation was clubbed with block allocation with the
same ext4_lock_group() held, there were lot of races where grp->bb_free
could not be reliably relied upon.
Due to above, this patch considers discard_pa_seq logic to determine if
we should retry for block allocation. Say if there are are n threads
trying for block allocation and none of those could allocate or discard
any of the blocks, then all of those n threads will fail the block
allocation and return -ENOSPC error. (Since the seq counter for all of
those will match as no block allocation/discard was done during that
duration).

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/7f254686903b87c419d798742fd9a1be34f0657b.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# cf5e2ca6 19-May-2020 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: mballoc: refactor ext4_mb_discard_preallocations()

Implement ext4_mb_discard_preallocations_should_retry()
which we will need in later patches to add more logic
like check for sequence number match to see if we should
retry for block allocation or not.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/1cfae0098d2aa9afbeb59331401258182868c8f2.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 53f86b17 19-May-2020 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: mballoc: add blocks to PA list under same spinlock after allocating blocks

ext4_mb_discard_preallocations() only checks for grp->bb_prealloc_list
of every group to discard the group's PA to free up the space if
allocation request fails. Consider below race:-

Process A Process B

1. allocate blocks
1. Fails block allocation from
ext4_mb_regular_allocator()
ext4_lock_group()
allocated blocks
more than ac_o_ex.fe_len
ext4_unlock_group()
2. Scans the
grp->bb_prealloc_list (under
ext4_lock_group()) and
find nothing and thus return
-ENOSPC.

2. Add the additional blocks to PA list

ext4_lock_group()
add blocks to grp->bb_prealloc_list
ext4_unlock_group()

Above race could be avoided if we add those additional blocks to
grp->bb_prealloc_list at the same time with block allocation when
ext4_lock_group() was still held.
With this discard-PA will know if there are actually any blocks which
could be freed from the PA

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/a2217dd782585b42328981832e6d396abaaccb80.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# d3df1453 10-May-2020 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: mballoc: make mb_debug() implementation to use pr_debug()

mb_debug() msg had only 1 control level for all type of msgs.
And if we enable mballoc_debug then all of those msgs would be enabled.
Instead of adding multiple debug levels for mb_debug() msgs, use
pr_debug() with which we could have finer control to print msgs at all
of different levels (i.e. at file, func, line no.).

Also add process name/pid, superblk id, and other info in mb_debug()
msg. This also kills the mballoc_debug module parameter, since it is
not needed any more.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/f0c660cbde9e2edbe95c67942ca9ad80dd2231eb.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# eb2b8ebb 10-May-2020 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: mballoc: fix possible NULL ptr & remove BUG_ONs from DOUBLE_CHECK

Make sure to check for e4b->bd_info->bb_bitmap == NULL, in
mb_cmp_bitmaps() and return if NULL, to avoid possible NULL ptr
dereference. Similar to how we do this in other ifdef DOUBLE_CHECK
functions.

Also remove the BUG_ON() logic if kmalloc() or ext4_read_block_bitmap()
fails. We should simply mark grp->bb_bitmap as NULL if above happens.
In fact ext4_read_block_bitmap() may even return an error in case of resize
ioctl. Hence remove this BUG_ON logic (fstests ext4/032 may trigger
this).

Link: https://lore.kernel.org/r/9a54f8a696ff17c057cd571be3d15ac3ec1407f1.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# a3450215 10-May-2020 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: mballoc: refactor code inside DOUBLE_CHECK into separate function

This patch implemets mb_group_bb_bitmap_alloc() and
mb_group_bb_bitmap_free() function to remove #ifdef DOUBLE_CHECK macro
and it's related code from inside
ext4_mb_add_groupinfo()/ext4_mb_release().

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/8c2095d74b779f0254a19b24982490dc6f07c4f9.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 4fca8f07 10-May-2020 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: mballoc: make ext4_mb_use_preallocated() return type as bool

Change return type of function ext4_mb_use_preallocated() to bool to
better reflect what this function can return.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/7880cb6ef911465beafefcd7e9c3ea214688744b.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# f283529a 10-May-2020 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: mballoc: simplify error handling in ext4_init_mballoc()

This patch simplifies error handling logic in ext4_init_mballoc(),
by adding all the cleanups at one place at the end of that function.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/8621a7bc68f7107a9ac4292afeb784515333bd25.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 004379d0 10-May-2020 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: mballoc: fix few other format specifier in mb_debug()

Fix few other format specifiers in mb_debug() msgs.
As such no other functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/574fa7f833abf2dbf3b53a2fea3195e71f6cdbd8.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 36bad423 10-May-2020 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: mballoc: correct the mb_debug() format specifier for pa_len var

pa->pa_len is an integer. Fix all of the format specifier used in
mb_debug() for pa_len to %d instead of %u.

As such no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/af4987f643c586f62bcc9961e43f0a67151d5551.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# bbc4ec77 10-May-2020 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: mballoc: add more mb_debug() msgs

This patch adds some more debugging mb_debug() msgs to help improve
mballoc code debugging.
Other than adding more mb_debug() msgs at few more places,
there should be no other functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/5fc8e7788b924e211fcfa4a4c1d2f8503511661a.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# e68cf40c 10-May-2020 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: mballoc: refactor ext4_mb_show_ac()

This factors out ext4_mb_show_pa() function to show all the group's
preallocation info. This could be useful info to be added in later
patches.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/8f07d890b0038dcc935e9c10e6043ec9f3792721.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 212da3ec 10-May-2020 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: mballoc: print bb_free info even when it is 0

Improve the debugging msg by also printing even if bb_free is 0.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/c894f1d1d30f86ae38f4e3a861949665b6dc61cd.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 907ea529 13-Apr-2020 Theodore Ts'o <tytso@mit.edu>

ext4: convert BUG_ON's to WARN_ON's in mballoc.c

If the in-core buddy bitmap gets corrupted (or out of sync with the
block bitmap), issue a WARN_ON and try to recover. In most cases this
involves skipping trying to allocate out of a particular block group.
We can end up declaring the file system corrupted, which is fair,
since the file system probably should be checked before we proceed any
further.

Link: https://lore.kernel.org/r/20200414035649.293164-1-tytso@mit.edu
Google-Bug-Id: 34811296
Google-Bug-Id: 34639169
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 54d3adbc 28-Mar-2020 Theodore Ts'o <tytso@mit.edu>

ext4: save all error info in save_error_info() and drop ext4_set_errno()

Using a separate function, ext4_set_errno() to set the errno is
problematic because it doesn't do the right thing once
s_last_error_errorcode is non-zero. It's also less racy to set all of
the error information all at once. (Also, as a bonus, it shrinks code
size slightly.)

Link: https://lore.kernel.org/r/20200329020404.686965-1-tytso@mit.edu
Fixes: 878520ac45f9 ("ext4: save the error code which triggered...")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# eb576086 10-Mar-2020 Dmitry Monakhov <dmonakhov@gmail.com>

ext4: mark block bitmap corrupted when found instead of BUGON

We already has similar code in ext4_mb_complex_scan_group(), but
ext4_mb_simple_scan_group() still affected.

Other reports: https://www.spinics.net/lists/linux-ext4/msg60231.html

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com>
Link: https://lore.kernel.org/r/20200310150156.641-1-dmonakhov@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 92e9c58c 13-Feb-2020 Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>

ext4: use built-in RCU list checking in mballoc

list_for_each_entry_rcu() has built-in RCU and lock checking.

Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
by default.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Link: https://lore.kernel.org/r/20200213152558.7070-1-madhuparnabhowmik10@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 7c990728 18-Feb-2020 Suraj Jitindar Singh <surajjs@amazon.com>

ext4: fix potential race between s_flex_groups online resizing and access

During an online resize an array of s_flex_groups structures gets replaced
so it can get enlarged. If there is a concurrent access to the array and
this memory has been reused then this can lead to an invalid memory access.

The s_flex_group array has been converted into an array of pointers rather
than an array of structures. This is to ensure that the information
contained in the structures cannot get out of sync during a resize due to
an accessor updating the value in the old structure after it has been
copied but before the array pointer is updated. Since the structures them-
selves are no longer copied but only the pointers to them this case is
mitigated.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
Link: https://lore.kernel.org/r/20200221053458.730016-4-tytso@mit.edu
Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org


# df3da4ea 18-Feb-2020 Suraj Jitindar Singh <surajjs@amazon.com>

ext4: fix potential race between s_group_info online resizing and access

During an online resize an array of pointers to s_group_info gets replaced
so it can get enlarged. If there is a concurrent access to the array in
ext4_get_group_info() and this memory has been reused then this can lead to
an invalid memory access.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
Link: https://lore.kernel.org/r/20200221053458.730016-3-tytso@mit.edu
Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Balbir Singh <sblbir@amazon.com>
Cc: stable@kernel.org


# 878520ac 19-Nov-2019 Theodore Ts'o <tytso@mit.edu>

ext4: save the error code which triggered an ext4_error() in the superblock

This allows the cause of an ext4_error() report to be categorized
based on whether it was triggered due to an I/O error, or an memory
allocation error, or other possible causes. Most errors are caused by
a detected file system inconsistency, so the default code stored in
the superblock will be EXT4_ERR_EFSCORRUPTED.

Link: https://lore.kernel.org/r/20191204032335.7683-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# c60990b3 19-Jun-2019 Theodore Ts'o <tytso@mit.edu>

ext4: clean up kerneldoc warnigns when building with W=1

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 4b99faa2 24-Apr-2019 Khazhismel Kumykov <khazhy@google.com>

ext4: cond_resched in work-heavy group loops

Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>


# 31562b95 06-Apr-2019 Jan Kara <jack@suse.cz>

ext4: make sanity check in mballoc more strict

The sanity check in mb_find_extent() only checked that returned extent
does not extend past blocksize * 8, however it should not extend past
EXT4_CLUSTERS_PER_GROUP(sb). This can happen when clusters_per_group <
blocksize * 8 and the tail of the bitmap is not properly filled by 1s
which happened e.g. when ancient kernels have grown the filesystem.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org


# 82dd124c 10-Feb-2019 Nikolay Borisov <nborisov@suse.com>

ext4: replace opencoded i_writecount usage with inode_is_open_for_write()

There is a function which clearly conveys the objective of checking
i_writecount. Additionally the usage in ext4_mb_initialize_context was
wrong, since a node would have wrongfully been reported as writable if
i_writecount had a negative value (MMAP_DENY_WRITE).

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# 9fe67149 01-Oct-2018 Eric Whitney <enwlinux@gmail.com>

ext4: adjust reserved cluster count when removing extents

Modify ext4_ext_remove_space() and the code it calls to correct the
reserved cluster count for pending reservations (delayed allocated
clusters shared with allocated blocks) when a block range is removed
from the extent tree. Pending reservations may be found for the clusters
at the ends of written or unwritten extents when a block range is removed.
If a physical cluster at the end of an extent is freed, it's necessary
to increment the reserved cluster count to maintain correct accounting
if the corresponding logical cluster is shared with at least one
delayed and unwritten extent as found in the extents status tree.

Add a new function, ext4_rereserve_cluster(), to reapply a reservation
on a delayed allocated cluster sharing blocks with a freed allocated
cluster. To avoid ENOSPC on reservation, a flag is applied to
ext4_free_blocks() to briefly defer updating the freeclusters counter
when an allocated cluster is freed. This prevents another thread
from allocating the freed block before the reservation can be reapplied.

Redefine the partial cluster object as a struct to carry more state
information and to clarify the code using it.

Adjust the conditional code structure in ext4_ext_remove_space to
reduce the indentation level in the main body of the code to improve
readability.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 863c37fc 04-Aug-2018 zhong jiang <zhongjiang@huawei.com>

ext4: remove unneeded variable "err" in ext4_mb_release_inode_pa()

The err is not used after initalization. So just remove the variable.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 1a5d5e5d 01-Aug-2018 Jeremy Cline <jcline@redhat.com>

ext4: fix spectre gadget in ext4_mb_regular_allocator()

'ac->ac_g_ex.fe_len' is a user-controlled value which is used in the
derivation of 'ac->ac_2order'. 'ac->ac_2order', in turn, is used to
index arrays which makes it a potential spectre gadget. Fix this by
sanitizing the value assigned to 'ac->ac2_order'. This covers the
following accesses found with the help of smatch:

* fs/ext4/mballoc.c:1896 ext4_mb_simple_scan_group() warn: potential
spectre issue 'grp->bb_counters' [w] (local cap)

* fs/ext4/mballoc.c:445 mb_find_buddy() warn: potential spectre issue
'EXT4_SB(e4b->bd_sb)->s_mb_offsets' [r] (local cap)

* fs/ext4/mballoc.c:446 mb_find_buddy() warn: potential spectre issue
'EXT4_SB(e4b->bd_sb)->s_mb_maxs' [r] (local cap)

Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jeremy Cline <jcline@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org


# 8844618d 13-Jun-2018 Theodore Ts'o <tytso@mit.edu>

ext4: only look at the bg_flags field if it is valid

The bg_flags field in the block group descripts is only valid if the
uninit_bg or metadata_csum feature is enabled. We were not
consistently looking at this field; fix this.

Also block group #0 must never have uninitialized allocation bitmaps,
or need to be zeroed, since that's where the root inode, and other
special inodes are set up. Check for these conditions and mark the
file system as corrupted if they are detected.

This addresses CVE-2018-10876.

https://bugzilla.kernel.org/show_bug.cgi?id=199403

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org


# 21c580d8 20-May-2018 Sean Fu <fxinrong@gmail.com>

ext4: remove NULL check before calling kmem_cache_destroy()

Signed-off-by: Sean Fu <fxinrong@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 247dbed8 11-Apr-2018 Christoph Hellwig <hch@lst.de>

ext4: simplify procfs code

Use remove_proc_subtree to remove the whole subtree on cleanup, and
unwind the registration loop into individual calls. Switch to use
proc_create_seq where applicable.

Signed-off-by: Christoph Hellwig <hch@lst.de>


# 736dedbb 11-May-2018 Wang Shilong <wshilong@ddn.com>

ext4: mark block bitmap corrupted when found

There are still some cases that we missed to set
block bitmaps corrupted bit properly:

1) block bitmap number is wrong.
2) failed to read block bitmap due to disk errors.
3) double free block bitmaps..
4) some mismatch check with bitmaps vs buddy information.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>


# db79e6d1 12-May-2018 Wang Shilong <wshilong@ddn.com>

ext4: add new ext4_mark_group_bitmap_corrupted() helper

Since there are many places to set inode/block bitmap
corrupt bit, add a new helper for it, which will make
codes more clear.

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>


# 49598e04 11-Jan-2018 Jun Piao <piaojun@huawei.com>

ext4: use 'sbi' instead of 'EXT4_SB(sb)'

We could use 'sbi' instead of 'EXT4_SB(sb)' to make code more elegant.

Signed-off-by: Jun Piao <piaojun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# f5166768 17-Dec-2017 Theodore Ts'o <tytso@mit.edu>

ext4: fix up remaining files with SPDX cleanups

A number of ext4 source files were skipped due because their copyright
permission statements didn't match the expected text used by the
automated conversion utilities. I've added SPDX tags for the rest.

While looking at some of these files, I've noticed that we have quite
a bit of variation on the licenses that were used --- in particular
some of the Red Hat licenses on the jbd2 files use a GPL2+ license,
and we have some files that have a LGPL-2.1 license (which was quite
surprising).

I've not attempted to do any license changes. Even if it is perfectly
legal to relicense to GPL 2.0-only for consistency's sake, that should
be done with ext4 developer community discussion.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# d77147ff 29-Oct-2017 harshads <harshads@google.com>

ext4: add support for online resizing with bigalloc

This patch adds support for online resizing on bigalloc file system by
implementing EXT4_IOC_RESIZE_FS ioctl. Old resize interfaces (add
block groups and extend last block group) are left untouched. Tests
performed with cluster sizes of 1, 2, 4 and 8 blocks (of size 4k) per
cluster. I will add these tests to xfstests.

Signed-off-by: Harshad Shirwadkar <harshads@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b80b32b6 14-Aug-2017 Theodore Ts'o <tytso@mit.edu>

ext4: fix clang build regression

Arnd Bergmann <arnd@arndb.de>

As Stefan pointed out, I misremembered what clang can do specifically,
and it turns out that the variable-length array at the end of the
structure did not work (a flexible array would have worked here
but not solved the problem):

fs/ext4/mballoc.c:2303:17: error: fields must have a constant size:
'variable length array in structure' extension will never be supported
ext4_grpblk_t counters[blocksize_bits + 2];

This reverts part of my previous patch, using a fixed-size array
again, but keeping the check for the array overflow.

Fixes: 2df2c3402fc8 ("ext4: fix warning about stack corruption")
Reported-by: Stefan Agner <stefan@agner.ch>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 2df2c340 05-Aug-2017 Arnd Bergmann <arnd@arndb.de>

ext4: fix warning about stack corruption

After commit 62d1034f53e3 ("fortify: use WARN instead of BUG for now"),
we get a warning about possible stack overflow from a memcpy that
was not strictly bounded to the size of the local variable:

inlined from 'ext4_mb_seq_groups_show' at fs/ext4/mballoc.c:2322:2:
include/linux/string.h:309:9: error: '__builtin_memcpy': writing between 161 and 1116 bytes into a region of size 160 overflows the destination [-Werror=stringop-overflow=]

We actually had a bug here that would have been found by the warning,
but it was already fixed last year in commit 30a9d7afe70e ("ext4: fix
stack memory corruption with 64k block size").

This replaces the fixed-length structure on the stack with a variable-length
structure, using the correct upper bound that tells the compiler that
everything is really fine here. I also change the loop count to check
for the same upper bound for consistency, but the existing code is
already correct here.

Note that while clang won't allow certain kinds of variable-length arrays
in structures, this particular instance is fine, as the array is at the
end of the structure, and the size is strictly bounded.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# e4510577 05-Aug-2017 Daeho Jeong <daeho.jeong@samsung.com>

ext4: release discard bio after sending discard commands

We've changed the discard command handling into parallel manner.
But, in this change, I forgot decreasing the usage count of the bio
which was used to send discard request. I'm sorry about that.

Fixes: a015434480dc ("ext4: send parallel discards on commit completions")
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# ff950156 06-Jul-2017 Colin Ian King <colin.king@canonical.com>

ext4: fix spelling mistake: "prellocated" -> "preallocated"

Trivial fix to spelling mistake in mb_debug debug message

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# a0154344 22-Jun-2017 Daeho Jeong <daeho.jeong@samsung.com>

ext4: send parallel discards on commit completions

Now, when we mount ext4 filesystem with '-o discard' option, we have to
issue all the discard commands for the blocks to be deallocated and
wait for the completion of the commands on the commit complete phase.
Because this procedure might involve a lot of sequential combinations of
issuing discard commands and waiting for that, the delay of this
procedure might be too much long, even to 17.0s in our test,
and it results in long commit delay and fsync() performance degradation.

To reduce this kind of delay, instead of adding callback for each
extent and handling all of them in a sequential manner on commit phase,
we instead add a separate list of extents to free to the superblock and
then process this list at once after transaction commits so that
we can issue all the discard commands in a parallel manner like XFS
filesystem.

Finally, we could enhance the discard command handling performance.
The result was such that 17.0s delay of a single commit in the worst
case has been enhanced to 4.8s.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Tested-by: Hobin Woo <hobin.woo@samsung.com>
Tested-by: Kitae Lee <kitae87.lee@samsung.com>
Reviewed-by: Jan Kara <jack@suse.cz>


# 02749a4c 22-Jun-2017 Tahsin Erdogan <tahsin@google.com>

ext4: add ext4_is_quota_file()

IS_NOQUOTA() indicates whether quota is disabled for an inode. Ext4
also uses it to check whether an inode is for a quota file. The
distinction currently doesn't matter because quota is disabled only
for the quota files. When we start disabling quota for other inodes
in the future, we will want to make the distinction clear.

Replace IS_NOQUOTA() call with ext4_is_quota_file() at places where
we are checking for quota files.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 9651e6b2 21-May-2017 Konstantin Khlebnikov <koct9i@gmail.com>

ext4: handle the rest of ext4_mb_load_buddy() ENOMEM errors

I've got another report about breaking ext4 by ENOMEM error returned from
ext4_mb_load_buddy() caused by memory shortage in memory cgroup.
This time inside ext4_discard_preallocations().

This patch replaces ext4_error() with ext4_warning() where errors returned
from ext4_mb_load_buddy() are not fatal and handled by caller:
* ext4_mb_discard_group_preallocations() - called before generating ENOSPC,
we'll try to discard other group or return ENOSPC into user-space.
* ext4_trim_all_free() - just stop trimming and return ENOMEM from ioctl.

Some callers cannot handle errors, thus __GFP_NOFAIL is used for them:
* ext4_discard_preallocations()
* ext4_mb_discard_lg_preallocations()

Fixes: adb7ef600cc9 ("ext4: use __GFP_NOFAIL in ext4_free_blocks()")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# a7c3e901 08-May-2017 Michal Hocko <mhocko@suse.com>

mm: introduce kv[mz]alloc helpers

Patch series "kvmalloc", v5.

There are many open coded kmalloc with vmalloc fallback instances in the
tree. Most of them are not careful enough or simply do not care about
the underlying semantic of the kmalloc/page allocator which means that
a) some vmalloc fallbacks are basically unreachable because the kmalloc
part will keep retrying until it succeeds b) the page allocator can
invoke a really disruptive steps like the OOM killer to move forward
which doesn't sound appropriate when we consider that the vmalloc
fallback is available.

As it can be seen implementing kvmalloc requires quite an intimate
knowledge if the page allocator and the memory reclaim internals which
strongly suggests that a helper should be implemented in the memory
subsystem proper.

Most callers, I could find, have been converted to use the helper
instead. This is patch 6. There are some more relying on __GFP_REPEAT
in the networking stack which I have converted as well and Eric Dumazet
was not opposed [2] to convert them as well.

[1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com

This patch (of 9):

Using kmalloc with the vmalloc fallback for larger allocations is a
common pattern in the kernel code. Yet we do not have any common helper
for that and so users have invented their own helpers. Some of them are
really creative when doing so. Let's just add kv[mz]alloc and make sure
it is implemented properly. This implementation makes sure to not make
a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
to not warn about allocation failures. This also rules out the OOM
killer as the vmalloc is a more approapriate fallback than a disruptive
user visible action.

This patch also changes some existing users and removes helpers which
are specific for them. In some cases this is not possible (e.g.
ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
require GFP_NO{FS,IO} context which is not vmalloc compatible in general
(note that the page table allocation is GFP_KERNEL). Those need to be
fixed separately.

While we are at it, document that __vmalloc{_node} about unsupported gfp
mask because there seems to be a lot of confusion out there.
kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
superset) flags to catch new abusers. Existing ones would have to die
slowly.

[sfr@canb.auug.org.au: f2fs fixup]
Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au
Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Andreas Dilger <adilger@dilger.ca> [ext4 part]
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0c9ec4be 29-Apr-2017 Darrick J. Wong <darrick.wong@oracle.com>

ext4: support GETFSMAP ioctls

Support the GETFSMAP ioctls so that we can use the xfs free space
management tools to probe ext4 as well. Note that this is a partial
implementation -- we only report fixed-location metadata and free space;
everything else is reported as "unknown".

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# d6006186 29-Apr-2017 Eric Biggers <ebiggers@google.com>

ext4: constify static data that is never modified

Constify static data in ext4 that is never (intentionally) modified so
that it is placed in .rodata and benefits from memory protection.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 93407472 27-Feb-2017 Fabian Frederick <fabf@skynet.be>

fs: add i_blocksize()

Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs
branch.

This patch also fixes multiple checkpatch warnings: WARNING: Prefer
'unsigned int' to bare use of 'unsigned'

Thanks to Andrew Morton for suggesting more appropriate function instead
of macro.

[geliangtang@gmail.com: truncate: use i_blocksize()]
Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com
Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d9b22cf9 09-Feb-2017 Jan Kara <jack@suse.cz>

ext4: fix stripe-unaligned allocations

When a filesystem is created using:

mkfs.ext4 -b 4096 -E stride=512 <dev>

and we try to allocate 64MB extent, we will end up directly in
ext4_mb_complex_scan_group(). This is because the request is detected
as power-of-two allocation (so we start in ext4_mb_regular_allocator()
with ac_criteria == 0) however the check before
ext4_mb_simple_scan_group() refuses the direct buddy scan because the
allocation request is too large. Since cr == 0, the check whether we
should use ext4_mb_scan_aligned() fails as well and we fall back to
ext4_mb_complex_scan_group().

Fix the problem by checking for upper limit on power-of-two requests
directly when detecting them.

Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# cd648b8a 27-Jan-2017 Jan Kara <jack@suse.cz>

ext4: trim allocation requests to group size

If filesystem groups are artifically small (using parameter -g to
mkfs.ext4), ext4_mb_normalize_request() can result in a request that is
larger than a block group. Trim the request size to not confuse
allocation code.

Reported-by: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org


# 43c73221 22-Jan-2017 Theodore Ts'o <tytso@mit.edu>

ext4: replace BUG_ON with WARN_ON in mb_find_extent()

The last BUG_ON in mb_find_extent() is apparently triggering in some
rare cases. Most of the time it indicates a bug in the buddy bitmap
algorithms, but there are some weird cases where it can trigger when
buddy bitmap is still in memory, but the block bitmap has to be read
from disk, and there is disk or memory corruption such that the block
bitmap and the buddy bitmap are out of sync.

Google-Bug-Id: #33702157

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 30a9d7af 14-Nov-2016 Chandan Rajendra <chandan@linux.vnet.ibm.com>

ext4: fix stack memory corruption with 64k block size

The number of 'counters' elements needed in 'struct sg' is
super_block->s_blocksize_bits + 2. Presently we have 16 'counters'
elements in the array. This is insufficient for block sizes >= 32k. In
such cases the memcpy operation performed in ext4_mb_seq_groups_show()
would cause stack memory corruption.

Fixes: c9de560ded61f
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org


# 69e43e8c 14-Nov-2016 Chandan Rajendra <chandan@linux.vnet.ibm.com>

ext4: fix mballoc breakage with 64k block size

'border' variable is set to a value of 2 times the block size of the
underlying filesystem. With 64k block size, the resulting value won't
fit into a 16-bit variable. Hence this commit changes the data type of
'border' to 'unsigned int'.

Fixes: c9de560ded61f
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org


# 554a5ccc 14-Jul-2016 Vegard Nossum <vegard.nossum@oracle.com>

ext4: fix reference counting bug on block allocation error

If we hit this error when mounted with errors=continue or
errors=remount-ro:

EXT4-fs error (device loop0): ext4_mb_mark_diskspace_used:2940: comm ext4.exe: Allocating blocks 5090-6081 which overlap fs metadata

then ext4_mb_new_blocks() will call ext4_mb_release_context() and try to
continue. However, ext4_mb_release_context() is the wrong thing to call
here since we are still actually using the allocation context.

Instead, just error out. We could retry the allocation, but there is a
possibility of getting stuck in an infinite loop instead, so this seems
safer.

[ Fixed up so we don't return EAGAIN to userspace. --tytso ]

Fixes: 8556e8f3b6 ("ext4: Don't allow new groups to be added during block allocation")
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org


# d08854f5 26-Jun-2016 Theodore Ts'o <tytso@mit.edu>

ext4: optimize ext4_should_retry_alloc() to improve ENOSPC performance

If there are no pending blocks to be released after a commit, forcing
a journal commit has no hope of helping. It's possible that a commit
had just completed, so if there are now free blocks available for
allocation, it's worth retrying the commit.

Reported-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 84c60b13 27-May-2016 Al Viro <viro@zeniv.linux.org.uk>

drop redundant ->owner initializations

it's not needed for file_operations of inodes located on fs defined
in the hosting module and for file_operations that go into procfs.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 935244cd 05-May-2016 Nicolai Stange <nicstange@gmail.com>

ext4: silence UBSAN in ext4_mb_init()

Currently, in ext4_mb_init(), there's a loop like the following:

do {
...
offset += 1 << (sb->s_blocksize_bits - i);
i++;
} while (i <= sb->s_blocksize_bits + 1);

Note that the updated offset is used in the loop's next iteration only.

However, at the last iteration, that is at i == sb->s_blocksize_bits + 1,
the shift count becomes equal to (unsigned)-1 > 31 (c.f. C99 6.5.7(3))
and UBSAN reports

UBSAN: Undefined behaviour in fs/ext4/mballoc.c:2621:15
shift exponent 4294967295 is too large for 32-bit type 'int'
[...]
Call Trace:
[<ffffffff818c4d25>] dump_stack+0xbc/0x117
[<ffffffff818c4c69>] ? _atomic_dec_and_lock+0x169/0x169
[<ffffffff819411ab>] ubsan_epilogue+0xd/0x4e
[<ffffffff81941cac>] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
[<ffffffff81941ab1>] ? __ubsan_handle_load_invalid_value+0x158/0x158
[<ffffffff814b6dc1>] ? kmem_cache_alloc+0x101/0x390
[<ffffffff816fc13b>] ? ext4_mb_init+0x13b/0xfd0
[<ffffffff814293c7>] ? create_cache+0x57/0x1f0
[<ffffffff8142948a>] ? create_cache+0x11a/0x1f0
[<ffffffff821c2168>] ? mutex_lock+0x38/0x60
[<ffffffff821c23ab>] ? mutex_unlock+0x1b/0x50
[<ffffffff814c26ab>] ? put_online_mems+0x5b/0xc0
[<ffffffff81429677>] ? kmem_cache_create+0x117/0x2c0
[<ffffffff816fcc49>] ext4_mb_init+0xc49/0xfd0
[...]

Observe that the mentioned shift exponent, 4294967295, equals (unsigned)-1.

Unless compilers start to do some fancy transformations (which at least
GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
such calculated value of offset is never used again.

Silence UBSAN by introducing another variable, offset_incr, holding the
next increment to apply to offset and adjust that one by right shifting it
by one position per loop iteration.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161

Cc: stable@vger.kernel.org
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b5cb316c 05-May-2016 Nicolai Stange <nicstange@gmail.com>

ext4: address UBSAN warning in mb_find_order_for_block()

Currently, in mb_find_order_for_block(), there's a loop like the following:

while (order <= e4b->bd_blkbits + 1) {
...
bb += 1 << (e4b->bd_blkbits - order);
}

Note that the updated bb is used in the loop's next iteration only.

However, at the last iteration, that is at order == e4b->bd_blkbits + 1,
the shift count becomes negative (c.f. C99 6.5.7(3)) and UBSAN reports

UBSAN: Undefined behaviour in fs/ext4/mballoc.c:1281:11
shift exponent -1 is negative
[...]
Call Trace:
[<ffffffff818c4d35>] dump_stack+0xbc/0x117
[<ffffffff818c4c79>] ? _atomic_dec_and_lock+0x169/0x169
[<ffffffff819411bb>] ubsan_epilogue+0xd/0x4e
[<ffffffff81941cbc>] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
[<ffffffff81941ac1>] ? __ubsan_handle_load_invalid_value+0x158/0x158
[<ffffffff816e93a0>] ? ext4_mb_generate_from_pa+0x590/0x590
[<ffffffff816502c8>] ? ext4_read_block_bitmap_nowait+0x598/0xe80
[<ffffffff816e7b7e>] mb_find_order_for_block+0x1ce/0x240
[...]

Unless compilers start to do some fancy transformations (which at least
GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
such calculated value of bb is never used again.

Silence UBSAN by introducing another variable, bb_incr, holding the next
increment to apply to bb and adjust that one by right shifting it by one
position per loop iteration.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161

Cc: stable@vger.kernel.org
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 8d2ae1cb 26-Apr-2016 Jakub Wilk <jwilk@jwilk.net>

ext4: remove trailing \n from ext4_warning/ext4_error calls

Messages passed to ext4_warning() or ext4_error() don't need trailing
newlines, because these function add the newlines themselves.

Signed-off-by: Jakub Wilk <jwilk@jwilk.net>


# ea1754a0 01-Apr-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage

Mostly direct substitution with occasional adjustment or removing
outdated comments.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 09cbfeaf 01-Apr-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# adb7ef60 13-Mar-2016 Konstantin Khlebnikov <koct9i@gmail.com>

ext4: use __GFP_NOFAIL in ext4_free_blocks()

This might be unexpected but pages allocated for sbi->s_buddy_cache are
charged to current memory cgroup. So, GFP_NOFS allocation could fail if
current task has been killed by OOM or if current memory cgroup has no
free memory left. Block allocator cannot handle such failures here yet.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b8a07463 09-Mar-2016 Adam Buchbinder <adam.buchbinder@gmail.com>

ext4: fix misspellings in comments.

Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# f96c450d 21-Feb-2016 Daeho Jeong <daeho.jeong@samsung.com>

ext4: make sure to revoke all the freeable blocks in ext4_free_blocks

Now, ext4_free_blocks() doesn't revoke data blocks of per-file data
journalled inode and it can cause file data inconsistency problems.
Even though data blocks of per-file data journalled inode are already
forgotten by jbd2_journal_invalidatepage() in advance of invoking
ext4_free_blocks(), we still need to revoke the data blocks here.
Moreover some of the metadata blocks, which are not found by
sb_find_get_block(), are still needed to be revoked, but this is also
missing here.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# 802cf1f9 11-Feb-2016 Huaitong Han <huaitong.han@intel.com>

ext4: add a line break for proc mb_groups display

This patch adds a line break for proc mb_groups display.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>


# 79211c8e 09-Nov-2015 Andrew Morton <akpm@linux-foundation.org>

remove abs64()

Switch everything to the new and more capable implementation of abs().
Mainly to give the new abs() a bit of a workout.

Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 16175039 18-Oct-2015 John Stultz <john.stultz@linaro.org>

ext4: fix abs() usage in ext4_mb_check_group_pa

The ext4_fsblk_t type is a long long, which should not be used
with abs(), as is done in ext4_mb_check_group_pa().

This patch modifies ext4_mb_check_group_pa() to use abs64()
instead.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 9c02ac97 17-Oct-2015 Daeho Jeong <daeho.jeong@samsung.com>

ext4: fix xfstest generic/269 double revoked buffer bug with bigalloc

When you repeatly execute xfstest generic/269 with bigalloc_1k option
enabled using the below command:

"./kvm-xfstests -c bigalloc_1k -m nodelalloc -C 1000 generic/269"

you can easily see the below bug message.

"JBD2 unexpected failure: jbd2_journal_revoke: !buffer_revoked(bh);"

This means that an already revoked buffer is erroneously revoked again
and it is caused by doing revoke for the buffer at the wrong position
in ext4_free_blocks(). We need to re-position the buffer revoke
procedure for an unspecified buffer after checking the cluster boundary
for bigalloc option. If not, some part of the cluster can be doubly
revoked.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>


# 9008a58e 17-Oct-2015 Darrick J. Wong <darrick.wong@oracle.com>

ext4: make the bitmap read routines return real error codes

Make the bitmap reaading routines return real error codes (EIO,
EFSCORRUPTED, EFSBADCRC) which can then be reflected back to
userspace for more precise diagnosis work.

In particular, this means that mballoc no longer claims that we're out
of memory if the block bitmaps become corrupt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# ebd173be 22-Sep-2015 Theodore Ts'o <tytso@mit.edu>

ext4: move procfs registration code to fs/ext4/sysfs.c

This allows us to refactor the procfs code, which saves a bit of
compiled space. More importantly it isolates most of the procfs
support code into a single file, so it's easier to #ifdef it out if
the proc file system has been disabled.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 7444a072 04-Jul-2015 Michal Hocko <mhocko@suse.cz>

ext4: replace open coded nofail allocation in ext4_free_blocks()

ext4_free_blocks is looping around the allocation request and mimics
__GFP_NOFAIL behavior without any allocation fallback strategy. Let's
remove the open coded loop and replace it with __GFP_NOFAIL. Without the
flag the allocator has no way to find out never-fail requirement and
cannot help in any way.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org


# 97b4af2f 14-Jun-2015 Rasmus Villemoes <linux@rasmusvillemoes.dk>

ext4: mballoc: avoid 20-argument function call

Making a function call with 20 arguments is rather expensive in both
stack and .text. In this case, doing the formatting manually doesn't
make it any less readable, so we might as well save 155 bytes of .text
and 112 bytes of stack.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>


# 42ac1848 08-Jun-2015 Lukas Czerner <lczerner@redhat.com>

ext4: return error code from ext4_mb_good_group()

Currently ext4_mb_good_group() only returns 0 or 1 depending on whether
the allocation group is suitable for use or not. However we might get
various errors and fail while initializing new group including -EIO
which would never get propagated up the call chain. This might lead to
an endless loop at writeback when we're trying to find a good group to
allocate from and we fail to initialize new group (read error for
example).

Fix this by returning proper error code from ext4_mb_good_group() and
using it in ext4_mb_regular_allocator(). In ext4_mb_regular_allocator()
we will always return only the first occurred error from
ext4_mb_good_group() and we only propagate it back to the caller if we
do not get any other errors and we fail to allocate any blocks.

Note that with other modes than errors=continue, we will fail
immediately in ext4_mb_good_group() in case of error, however with
errors=continue we should try to continue using the file system, that's
why we're not going to fail immediately when we see an error from
ext4_mb_good_group(), but rather when we fail to find a suitable block
group to allocate from due to an problem in group initialization.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>


# bbdc322f 08-Jun-2015 Lukas Czerner <lczerner@redhat.com>

ext4: try to initialize all groups we can in case of failure on ppc64

Currently on the machines with page size > block size when initializing
block group buddy cache we initialize it for all the block group bitmaps
in the page. However in the case of read error, checksum error, or if
a single bitmap is in any way corrupted we would fail to initialize all
of the bitmaps. This is problematic because we will not have access to
the other allocation groups even though those might be perfectly fine
and usable.

Fix this by reading all the bitmaps instead of error out on the first
problem and simply skip the bitmaps which were either not read properly,
or are not valid.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 66114cad 22-May-2015 Tejun Heo <tj@kernel.org>

writeback: separate out include/linux/backing-dev-defs.h

With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.

This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it. c files
which need access to more backing-dev details now include
backing-dev.h directly. This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.

v2: fs/fat build failure fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>


# bfcba2d0 25-Nov-2014 Markus Elfring <elfring@users.sourceforge.net>

ext4: Remove an unnecessary check for NULL before iput()

The iput() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 4fdb5543 25-Nov-2014 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: cleanup GFP flags inside resize path

We must use GFP_NOFS instead GFP_KERNEL inside ext4_mb_add_groupinfo
and ext4_calculate_overhead() because they are called from inside a
journal transaction. Call trace:

ioctl
->ext4_group_add
->journal_start
->ext4_setup_new_descs
->ext4_mb_add_groupinfo -> GFP_KERNEL
->ext4_flex_group_add
->ext4_update_super
->ext4_calculate_overhead -> GFP_KERNEL
->journal_stop

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b93b41d4 19-Nov-2014 Al Viro <viro@ZenIV.linux.org.uk>

ext4: kill ext4_kvfree()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# dfe076c1 01-Oct-2014 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: get rid of code duplication

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 754cfed6 04-Sep-2014 Theodore Ts'o <tytso@mit.edu>

ext4: drop the EXT4_STATE_DELALLOC_RESERVED flag

Having done a full regression test, we can now drop the
DELALLOC_RESERVED state flag.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# e3cf5d5d 04-Sep-2014 Theodore Ts'o <tytso@mit.edu>

ext4: prepare to drop EXT4_STATE_DELALLOC_RESERVED

The EXT4_STATE_DELALLOC_RESERVED flag was originally implemented
because it was too hard to make sure the mballoc and get_block flags
could be reliably passed down through all of the codepaths that end up
calling ext4_mb_new_blocks().

Since then, we have mb_flags passed down through most of the code
paths, so getting rid of EXT4_STATE_DELALLOC_RESERVED isn't as tricky
as it used to.

This commit plumbs in the last of what is required, and then adds a
WARN_ON check to make sure we haven't missed anything. If this passes
a full regression test run, we can then drop
EXT4_STATE_DELALLOC_RESERVED.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>


# a0b6bc63 16-Aug-2014 Christoph Lameter <cl@linux.com>

block: Replace __this_cpu_ptr with raw_cpu_ptr

__this_cpu_ptr is being phased out use raw_cpu_ptr instead which was
introduced in 3.15-rc1.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>


# c99d1e6e 23-Aug-2014 Theodore Ts'o <tytso@mit.edu>

ext4: fix BUG_ON in mb_free_blocks()

If we suffer a block allocation failure (for example due to a memory
allocation failure), it's possible that we will call
ext4_discard_allocated_blocks() before we've actually allocated any
blocks. In that case, fe_len and fe_start in ac->ac_f_ex will still
be zero, and this will result in mb_free_blocks(inode, e4b, 0, 0)
triggering the BUG_ON on mb_free_blocks():

BUG_ON(last >= (sb->s_blocksize << 3));

Fix this by bailing out of ext4_discard_allocated_blocks() if fs_len
is zero.

Also fix a missing ext4_mb_unload_buddy() call in
ext4_discard_allocated_blocks().

Google-Bug-Id: 16844242

Fixes: 86f0afd463215fc3e58020493482faa4ac3a4d69
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org


# 86f0afd4 30-Jul-2014 Theodore Ts'o <tytso@mit.edu>

ext4: fix ext4_discard_allocated_blocks() if we can't allocate the pa struct

If there is a failure while allocating the preallocation structure, a
number of blocks can end up getting marked in the in-memory buddy
bitmap, and then not getting released. This can result in the
following corruption getting reported by the kernel:

EXT4-fs error (device sda3): ext4_mb_generate_buddy:758: group 1126,
12793 clusters in bitmap, 12729 in gd

In that case, we need to release the blocks using mb_free_blocks().

Tested: fs smoke test; also demonstrated that with injected errors,
the file system is no longer getting corrupted

Google-Bug-Id: 16657874

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org


# b27b1535 27-Jul-2014 Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>

ext4: fix wrong size computation in ext4_mb_normalize_request()

As the member fe_len defined in struct ext4_free_extent is expressed as
number of clusters, the variable "size" computation is wrong, we need to
first translate fe_len to block number, then to bytes.

Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>


# 71d4f7d0 15-Jul-2014 Theodore Ts'o <tytso@mit.edu>

ext4: remove metadata reservation checks

Commit 27dd43854227b ("ext4: introduce reserved space") reserves 2% of
the file system space to make sure metadata allocations will always
succeed. Given that, tracking the reservation of metadata blocks is
no longer necessary.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 94d4c066 05-Jul-2014 Theodore Ts'o <tytso@mit.edu>

ext4: clarify ext4_error message in ext4_mb_generate_buddy_error()

We are spending a lot of time explaining to users what this error
means. Let's try to improve the message to avoid this problem.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org


# e43bb4e6 26-Jun-2014 Namjae Jeon <namjae.jeon@samsung.com>

ext4: decrement free clusters/inodes counters when block group declared bad

We should decrement free clusters counter when block bitmap is marked
as corrupt and free inodes counter when the allocation bitmap is
marked as corrupt to avoid misunderstanding due to incorrect available
size in statfs result. User can get immediately ENOSPC error from
write begin without reaching for the writepages.

Cc: Darrick J. Wong<darrick.wong@oracle.com>
Reported-by: Amit Sahrawat <amit.sahrawat83@gmail.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>


# 2457aec6 04-Jun-2014 Mel Gorman <mgorman@suse.de>

mm: non-atomically mark page accessed during page cache allocation where possible

aops->write_begin may allocate a new page and make it visible only to have
mark_page_accessed called almost immediately after. Once the page is
visible the atomic operations are necessary which is noticable overhead
when writing to an in-memory filesystem like tmpfs but should also be
noticable with fast storage. The objective of the patch is to initialse
the accessed information with non-atomic operations before the page is
visible.

The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial
allocation of a page cache page. This patch adds an init_page_accessed()
helper which behaves like the first call to mark_page_accessed() but may
called before the page is visible and can be done non-atomically.

The primary APIs of concern in this care are the following and are used
by most filesystems.

find_get_page
find_lock_page
find_or_create_page
grab_cache_page_nowait
grab_cache_page_write_begin

All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its
behavior such as whether the page should be marked accessed or not. Then
old API is preserved but is basically a thin wrapper around this core
function.

Each of the filesystems are then updated to avoid calling
mark_page_accessed when it is known that the VM interfaces have already
done the job. There is a slight snag in that the timing of the
mark_page_accessed() has now changed so in rare cases it's possible a page
gets to the end of the LRU as PageReferenced where as previously it might
have been repromoted. This is expected to be rare but it's worth the
filesystem people thinking about it in case they see a problem with the
timing change. It is also the case that some filesystems may be marking
pages accessed that previously did not but it makes sense that filesystems
have consistent behaviour in this regard.

The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of the
file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.

The test machine was single socket and UMA to avoid any scheduling or NUMA
artifacts. Throughput and wall times are presented for sync IO, only wall
times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.

The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.

async dd
3.15.0-rc3 3.15.0-rc3
vanilla accessed-v2
ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)

The XFS figure is a bit strange as it managed to avoid a worst case by
sheer luck but the average figures looked reasonable.

samples percentage
ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

[akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b5b60778 26-May-2014 Maurizio Lombardi <mlombard@redhat.com>

ext4: fix wrong assert in ext4_mb_normalize_request()

The variable "size" is expressed as number of blocks and not as
number of clusters, this could trigger a kernel panic when using
ext4 with the size of a cluster different from the size of a block.

Cc: stable@vger.kernel.org
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 5d601255 12-May-2014 liang xie <xieliang007@gmail.com>

ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access

Make them more consistently

Signed-off-by: xieliang <xieliang@xiaomi.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 029b10c5 11-May-2014 Andrey Tsyvarev <tsyvarev@ispras.ru>

ext4: do not destroy ext4_groupinfo_caches if ext4_mb_init() fails

Caches from 'ext4_groupinfo_caches' may be in use by other mounts,
which have already existed. So, it is incorrect to destroy them when
newly requested mount fails.

Found by Linux File System Verification project (linuxtesting.org).

Signed-off-by: Andrey Tsyvarev <tsyvarev@ispras.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>


# e2cbd587 12-Apr-2014 jon ernst <jonernst07@gmail.com>

ext4: silence sparse check warning for function ext4_trim_extent

This fixes the following sparse warning:

CHECK fs/ext4/mballoc.c
fs/ext4/mballoc.c:5019:9: warning: context imbalance in
'ext4_trim_extent' - unexpected unlock

Signed-off-by: "Jon Ernst" <jonernst07@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# c57ab39b 10-Apr-2014 Younger Liu <younger.liucn@gmail.com>

ext4: return ENOMEM rather than EIO when find_###_page() fails

Return ENOMEM rather than EIO when find_get_page() fails in
ext4_mb_get_buddy_page_lock() and find_or_create_page() fails in
ext4_mb_load_buddy().

Signed-off-by: Younger Liu <younger.liucn@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# dc9ddd98 20-Feb-2014 Eric Sandeen <sandeen@redhat.com>

ext4: remove unused ac_ex_scanned

When looking at a bug report with:

> kernel: EXT4-fs: 0 scanned, 0 found

I thought wow, 0 scanned, that's odd? But it's not odd; it's printing
a variable that is initialized to 0 and never touched again.

It's never been used since the original merge, so I don't really even
know what the original intent was, either.

If anyone knows how to hook it up, speak now via patch, otherwise just
yank it so it's not making a confusing situation more confusing in
kernel logs.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# ab0c00fc 19-Feb-2014 Theodore Ts'o <tytso@mit.edu>

ext4: make sure ex.fe_logical is initialized

The lowest levels of mballoc set all of the fields of struct
ext4_free_extent except for fe_logical, since they are just trying to
find the requested free set of blocks, and the logical block hasn't
been set yet. This makes some static code checkers sad. Set it to
various different debug values, which would be useful when
debugging mballoc if these values were to ever show up due to the
parts of mballoc triyng to use ac->ac_b_ex.fe_logical before it is
properly upper layers of mballoc failing to properly set, usually by
ext4_mb_use_best_found().

Addresses-Coverity-Id: #139697
Addresses-Coverity-Id: #139698
Addresses-Coverity-Id: #139699

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# f5a44db5 20-Dec-2013 Theodore Ts'o <tytso@mit.edu>

ext4: add explicit casts when masking cluster sizes

The missing casts can cause the high 64-bits of the physical blocks to
be lost. Set up new macros which allows us to make sure the right
thing happen, even if at some point we end up supporting larger
logical block numbers.

Thanks to the Emese Revfy and the PaX security team for reporting this
issue.

Reported-by: PaX Team <pageexec@freemail.hu>
Reported-by: Emese Revfy <re.emese@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org


# 4e8d2139 03-Dec-2013 Junho Ryu <jayr@google.com>

ext4: fix use-after-free in ext4_mb_new_blocks

ext4_mb_put_pa should hold pa->pa_lock before accessing pa->pa_count.
While ext4_mb_use_preallocated checks pa->pa_deleted first and then
increments pa->count later, ext4_mb_put_pa decrements pa->pa_count
before holding pa->pa_lock and then sets pa->pa_deleted.

* Free sequence
ext4_mb_put_pa (1): atomic_dec_and_test pa->pa_count
ext4_mb_put_pa (2): lock pa->pa_lock
ext4_mb_put_pa (3): check pa->pa_deleted
ext4_mb_put_pa (4): set pa->pa_deleted=1
ext4_mb_put_pa (5): unlock pa->pa_lock
ext4_mb_put_pa (6): remove pa from a list
ext4_mb_pa_callback: free pa

* Use sequence
ext4_mb_use_preallocated (1): iterate over preallocation
ext4_mb_use_preallocated (2): lock pa->pa_lock
ext4_mb_use_preallocated (3): check pa->pa_deleted
ext4_mb_use_preallocated (4): increase pa->pa_count
ext4_mb_use_preallocated (5): unlock pa->pa_lock
ext4_mb_release_context: access pa

* Use-after-free sequence
[initial status] <pa->pa_deleted = 0, pa_count = 1>
ext4_mb_use_preallocated (1): iterate over preallocation
ext4_mb_use_preallocated (2): lock pa->pa_lock
ext4_mb_use_preallocated (3): check pa->pa_deleted
ext4_mb_put_pa (1): atomic_dec_and_test pa->pa_count
[pa_count decremented] <pa->pa_deleted = 0, pa_count = 0>
ext4_mb_use_preallocated (4): increase pa->pa_count
[pa_count incremented] <pa->pa_deleted = 0, pa_count = 1>
ext4_mb_use_preallocated (5): unlock pa->pa_lock
ext4_mb_put_pa (2): lock pa->pa_lock
ext4_mb_put_pa (3): check pa->pa_deleted
ext4_mb_put_pa (4): set pa->pa_deleted=1
[race condition!] <pa->pa_deleted = 1, pa_count = 1>
ext4_mb_put_pa (5): unlock pa->pa_lock
ext4_mb_put_pa (6): remove pa from a list
ext4_mb_pa_callback: free pa
ext4_mb_release_context: access pa

AddressSanitizer has detected use-after-free in ext4_mb_new_blocks
Bug report: http://goo.gl/rG1On3

Signed-off-by: Junho Ryu <jayr@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org


# 8f9ff189 30-Oct-2013 Lukas Czerner <lczerner@redhat.com>

ext4: fix FITRIM in no journal mode

When using FITRIM ioctl on a file system without journal it will
only trim the block group once, no matter how many times you invoke
FITRIM ioctl and how many block you release from the block group.

It is because we only clear EXT4_GROUP_INFO_WAS_TRIMMED_BIT in journal
callback. Fix this by clearing the bit in no journal mode as well.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Jorge Fábregas <jorge.fabregas@gmail.com>


# 163a203d 28-Aug-2013 Darrick J. Wong <darrick.wong@oracle.com>

ext4: mark block group as corrupt on block bitmap error

When we notice a block-bitmap corruption (because of device failure or
something else), we should mark this group as corrupt and prevent
further block allocations/deallocations from it. Currently, we end up
generating one error message for every block in the bitmap. This
potentially could make the system unstable as noticed in some
bugs. With this patch, the error will be printed only the first time
and mark the entire block group as corrupted. This prevents future
access allocations/deallocations from it.

Also tested by corrupting the block
bitmap and forcefully introducing the mb_free_blocks error:
(1) create a largefile (2Gb)
$ dd if=/dev/zero of=largefile oflag=direct bs=10485760 count=200
(2) umount filesystem. use dumpe2fs to see which block-bitmaps
are in use by largefile and note their block numbers
(3) use dd to zero-out the used block bitmaps
$ dd if=/dev/zero of=/dev/hdc4 bs=4096 seek=14 count=8 oflag=direct
(4) mount the FS and delete the largefile.
(5) recreate the largefile. verify that the new largefile does not
get any blocks from the groups marked as bad.
Without the patch, we will see mb_free_blocks error for each bit in
each zero'ed out bitmap at (4). With the patch, we only see the error
once per blockgroup:
[ 309.706803] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 15: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[ 309.720824] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 14: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[ 309.732858] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
[ 309.748321] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 13: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[ 309.760331] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
[ 309.769695] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 12: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[ 309.781721] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
[ 309.798166] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 11: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[ 309.810184] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
[ 309.819532] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 10: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.

Google-Bug-Id: 7258357

[darrick.wong@oracle.com]
Further modifications (by Darrick) to make more obvious that this corruption
bit applies to blocks only. Set the corruption flag if the block group bitmap
verification fails.

Original-author: Aditya Kali <adityakali@google.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 7d734532 17-Aug-2013 Jan Kara <jack@suse.cz>

ext4: fix warning in ext4_da_update_reserve_space()

reaim workfile.dbase test easily triggers warning in
ext4_da_update_reserve_space():

EXT4-fs warning (device ram0): ext4_da_update_reserve_space:365:
ino 12, allocated 1 with only 0 reserved metadata blocks (releasing 1
blocks with reserved 9 data blocks)

The problem is that (one of) tests creates file and then randomly writes
to it with O_SYNC. That results in writing back pages of the file in
random order so we create extents for written blocks say 0, 2, 4, 6, 8
- this last allocation also allocates new block for extents. Then we
writeout block 1 so we have extents 0-2, 4, 6, 8 and we release
indirect extent block because extents fit in the inode again. Then we
writeout block 10 and we need to allocate indirect extent block again
which triggers the warning because we don't have the reservation
anymore.

Fix the problem by giving back freed metadata blocks resulting from
extent merging into inode's reservation pool.

Signed-off-by: Jan Kara <jack@suse.cz>


# e7676a70 12-Jul-2013 Theodore Ts'o <tytso@mit.edu>

ext4: don't allow ext4_free_blocks() to fail due to ENOMEM

The filesystem should not be marked inconsistent if ext4_free_blocks()
is not able to allocate memory. Unfortunately some callers (most
notably ext4_truncate) don't have a way to reflect an error back up to
the VFS. And even if we did, most userspace applications won't deal
with most system calls returning ENOMEM anyway.

Reported-by: Nagachandra P <nagachandra@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org


# 2c00ef3e 01-Jul-2013 Alexey Khoroshilov <khoroshilov@ispras.ru>

ext4: implement error handling of ext4_mb_new_preallocation()

If memory allocation in ext4_mb_new_group_pa() is failed,
it returns error code, ext4_mb_new_preallocation() propages it,
but ext4_mb_new_blocks() ignores it.

An observed result was:

- allocation fail means ext4_mb_new_group_pa() does not update
ext4_allocation_context;

- ext4_mb_new_blocks() sets ext4_allocation_request->len (ar->len =
ac->ac_b_ex.fe_len;) to number of blocks preallocated (512) instead
of number of blocks requested (1);

- that activates update cycle in ext4_splice_branch():
for (i = 1; i < blks; i++) <-- blks is 512 instead of 1 here
*(where->p + i) = cpu_to_le32(current_block++);

- it iterates 511 times and corrupts a chunk of memory including inode
structure;

- page fault happens at EXT4_SB(inode->i_sb) in ext4_mark_inode_dirty();

- system hangs with 'scheduling while atomic' BUG.

The patch implements a check for ext4_mb_new_preallocation() error
code and handles its failure as if ext4_mb_regular_allocator() fails.

Found by Linux File System Verification project (linuxtesting.org).

[ Patch restructed by tytso to make the flow of control easier to follow. ]

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 2ed5724d 12-Jun-2013 Theodore Ts'o <tytso@mit.edu>

ext4: add cond_resched() to ext4_free_blocks() & ext4_mb_regular_allocator()

For a file systems with a very large number of block groups, if all of
the block group bitmaps are in memory and the file system is
relatively badly fragmented, it's possible ext4_mb_regular_allocator()
to take a long time trying to find a good match. This is especially
true if the tuning parameter mb_max_to_scan has been sent to a very
large number. So add a cond_resched() to avoid soft lockup warnings
and to provide better system responsiveness.

For ext4_free_blocks(), if we are deleting a large range of blocks,
and data=journal is enabled so that EXT4_FREE_BLOCKS_FORGET is passed,
the loop to call sb_find_get_block() and to call ext4_forget() can
take over 10-15 milliseocnds or more. So it's better to add a
cond_resched() here a well.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# e6155736 05-May-2013 Lachlan McIlroy <lmcilroy@redhat.com>

ext4: limit group search loop for non-extent files

In the case where we are allocating for a non-extent file,
we must limit the groups we allocate from to those below
2^32 blocks, and ext4_mb_regular_allocator() attempts to
do this initially by putting a cap on ngroups for the
subsequent search loop.

However, the initial target group comes in from the
allocation context (ac), and it may already be beyond
the artificially limited ngroups. In this case,
the limit

if (group == ngroups)
group = 0;

at the top of the loop is never true, and the loop will
run away.

Catch this case inside the loop and reset the search to
start at group 0.

[sandeen@redhat.com: add commit msg & comments]

Signed-off-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org


# 8c8e0ca6 09-Apr-2013 Dmitri Monakho <dmonakhov@openvz.org>

ext4: fix usless declarations

This patch should fix sparse complains about shadow declatations.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# d9dda78b 31-Mar-2013 Al Viro <viro@zeniv.linux.org.uk>

procfs: new helper - PDE_DATA(inode)

The only part of proc_dir_entry the code outside of fs/proc
really cares about is PDE(inode)->data. Provide a helper
for that; static inline for now, eventually will be moved
to fs/proc, along with the knowledge of struct proc_dir_entry
layout.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# eabe0444 08-Apr-2013 Andrey Sidorov <qrxd43@motorola.com>

ext4: speed-up releasing blocks on commit

Improve mb_free_blocks speed by clearing entire range at once instead
of iterating over each bit. Freeing block-by-block also makes buddy
bitmap subtree flip twice making most of the work a no-op. Very few
bits in buddy bitmap require change, e.g. freeing entire group is a 1
bit flip only. As a result, releasing blocks of 60G file now takes
5ms instead of 2.7s. This is especially good for non-preemptive
kernels as there is no rescheduling during release.

Signed-off-by: Andrey Sidorov <qrxd43@motorola.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# bd86298e 03-Apr-2013 Lukas Czerner <lczerner@redhat.com>

ext4: introduce ext4_get_group_number()

Currently on many places in ext4 we're using
ext4_get_group_no_and_offset() even though we're only interested in
knowing the block group of the particular block, not the offset within
the block group so we can use more efficient way to compute block
group.

This patch introduces ext4_get_group_number() which computes block
group for a given block much more efficiently. Use this function
instead of ext4_get_group_no_and_offset() everywhere where we're only
interested in knowing the block group.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 5d3ee208 03-Apr-2013 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: fix journal callback list traversal

It is incorrect to use list_for_each_entry_safe() for journal callback
traversial because ->next may be removed by other task:
->ext4_mb_free_metadata()
->ext4_mb_free_metadata()
->ext4_journal_callback_del()

This results in the following issue:

WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250()
Hardware name:
list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b
Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
Pid: 16400, comm: jbd2/dm-1-8 Tainted: G W 3.8.0-rc3+ #107
Call Trace:
[<ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0
[<ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50
[<ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0
[<ffffffff8148cae0>] __list_del_entry+0x1c0/0x250
[<ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0
[<ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570
[<ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0
[<ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0
[<ffffffff813d3ecf>] kjournald2+0x19f/0x6a0
[<ffffffff810ad630>] ? wake_up_bit+0x40/0x40
[<ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80
[<ffffffff810ac6be>] kthread+0x10e/0x120
[<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
[<ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0
[<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70

This patch fix the issue as follows:
- ext4_journal_commit_callback() make list truly traversial safe
simply by always starting from list_head
- fix race between two ext4_journal_callback_del() and
ext4_journal_callback_try_del()

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.com


# b10a44c3 03-Apr-2013 Theodore Ts'o <tytso@mit.edu>

ext4: add might_sleep() annotations

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>


# 90ba983f 11-Mar-2013 Theodore Ts'o <tytso@mit.edu>

ext4: use atomic64_t for the per-flexbg free_clusters count

A user who was using a 8TB+ file system and with a very large flexbg
size (> 65536) could cause the atomic_t used in the struct flex_groups
to overflow. This was detected by PaX security patchset:

http://forums.grsecurity.net/viewtopic.php?f=3&t=3289&p=12551#p12551

This bug was introduced in commit 9f24e4208f7e, so it's been around
since 2.6.30. :-(

Fix this by using an atomic64_t for struct orlav_stats's
free_clusters.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org


# bb8b20ed 10-Mar-2013 Lukas Czerner <lczerner@redhat.com>

ext4: do not use yield()

Using yield() is strongly discouraged (see sched/core.c) especially
since we can just use cond_resched().

Replace all use of yield() with cond_resched().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# e3d85c36 10-Mar-2013 Lukas Czerner <lczerner@redhat.com>

ext4: remove unused variable in ext4_free_blocks()

Remove unused variable 'freed' in ext4_free_blocks().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 810da240 02-Mar-2013 Lukas Czerner <lczerner@redhat.com>

ext4: convert number of blocks to clusters properly

We're using macro EXT4_B2C() to convert number of blocks to number of
clusters for bigalloc file systems. However, we should be using
EXT4_NUM_B2C().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org


# a0b30c12 09-Feb-2013 Theodore Ts'o <tytso@mit.edu>

ext4: use module parameters instead of debugfs for mballoc_debug

There are multiple reasons to move away from debugfs. First of all,
we are only using it for a single parameter, and it is much more
complicated to set up (some 30 lines of code compared to 3), and one
more thing that might fail while loading the ext4 module.

Secondly, as a module paramter it can be specified as a boot option if
ext4 is built into the kernel, or as a parameter when the module is
loaded, and it can also be manipulated dynamically under
/sys/module/ext4/parameters/mballoc_debug. So it is more flexible.

Ultimately we want to move away from using mb_debug() towards
tracepoints, but for now this is still a useful simplification of the
code base.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 40ae3487 04-Feb-2013 Theodore Ts'o <tytso@mit.edu>

ext4: optimize mballoc for large allocations

The ext4 block allocator only maintains buddy bitmaps for chunks which
are less than or equal to one quarter of a block group. That is, for
a file aystem with a 1k blocksize, and where the number of blocks in a
block group is 8192 blocks, the largest chunk size tracked by buddy
bitmaps is 2048 blocks.

For a file system with a 4k blocksize, and where the number of blocks
in a block group is 32768 blocks, the largest chunk size tracked by
buddy bitmaps is 8192 blocks.

To work around this code, mballoc.c before this commit would truncate
allocation requests to the number of blocks in a block group minus 10.
Why 10? Aside from being a completely arbitrary number, it avoids
block allocation to be a power of two larger than 25% of the block
group. If you try to explicitly fallocate 50% of the block group
size, this will demonstrate the problem; the block allocation code
will scan the all of the blocks in the file system with cr==0 (since
the request is for a natural power of two), but then completely fail
for all blocks groups, since the buddy bitmaps don't track chunk sizes
of 50% of the block group.

To fix this, in these we use ext4_mb_complex_scan_group() instead of
ext4_mb_simple_scan_group().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger@dilger.ca>


# f1167009 01-Feb-2013 Niu Yawei <yawei.niu@gmail.com>

ext4: fix race in ext4_mb_add_n_trim()

In ext4_mb_add_n_trim(), lg_prealloc_lock should be taken when
changing the lg_prealloc_list.

Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org


# d71c1ae2 08-Nov-2012 Lukas Czerner <lczerner@redhat.com>

ext4: warn when discard request fails other than EOPNOTSUPP

We should warn user then the discard request fails. However we need to
exclude -EOPNOTSUPP case since parts of the device might not support it
while other parts can. So print the kernel warning when the error !=
-EOPNOTSUPP is returned from ext4_issue_discard().

We should also handle error cases in batched discard, again excluding
EOPNOTSUPP.

Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# d8ec0c39 07-Nov-2012 Alan Cox <alan@linux.intel.com>

ext4: remove unused assignment

Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 6d138ced 08-Nov-2012 Eric Sandeen <sandeen@redhat.com>

ext4: fix awful goto in ext4_mb_new_blocks()

I think the whole function could be made prettier, but
that goto really took the cake for too-clever-by-half.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 5de35e8d 22-Oct-2012 Lukas Czerner <lczerner@redhat.com>

ext4: Avoid underflow in ext4_trim_fs()

Currently if len argument in ext4_trim_fs() is smaller than one block,
the 'end' variable underflow. Avoid that by returning EINVAL if len is
smaller than file system block.

Also remove useless unlikely().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org


# 79f1ba49 21-Oct-2012 Tao Ma <boyu.mt@taobao.com>

ext4: Checksum the block bitmap properly with bigalloc enabled

In mke2fs, we only checksum the whole bitmap block and it is right.
While in the kernel, we use EXT4_BLOCKS_PER_GROUP to indicate the
size of the checksumed bitmap which is wrong when we enable bigalloc.
The right size should be EXT4_CLUSTERS_PER_GROUP and this patch fixes
it.

Also as every caller of ext4_block_bitmap_csum_set and
ext4_block_bitmap_csum_verify pass in EXT4_BLOCKS_PER_GROUP(sb)/8,
we'd better removes this parameter and sets it in the function itself.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org


# aaf7d73e 26-Sep-2012 Lukas Czerner <lczerner@redhat.com>

ext4: enable FITRIM ioctl on bigalloc file system

With a minor tweaks regarding minimum extent size to discard and
discarded bytes reporting the FITRIM can be enabled on bigalloc file
system and it works without any problem.

This patch fixes minlen handling and discarded bytes reporting to
take into consideration bigalloc enabled file systems and finally
removes the restriction and allow FITRIM to be used on file system with
bigalloc feature enabled.

Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 85556c9a 26-Sep-2012 Wei Yongjun <yongjun_wei@trendmicro.com.cn>

ext4: use kmem_cache_zalloc instead of kmem_cache_alloc/memset

Using kmem_cache_zalloc() instead of kmem_cache_alloc() and memset().

spatch with a semantic match is used to found this problem.
(http://coccinelle.lip6.fr/)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 838cd0cf 23-Sep-2012 Yongqiang Yang <xiaoqiangnk@gmail.com>

ext4: check free block counters in ext4_mb_find_by_goal

Free block counters should be checked before doing allocation.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# b5e2368b 18-Sep-2012 Theodore Ts'o <tytso@mit.edu>

ext4: re-enable -o discard functionality in no-journal mode

This is a revert of commit b56ff9d397ce, which removed the call to
ext4_issue_discard() to fix a BUG reported because
ext4_issue_discard() was being called from inside a block group
spinlock. As it turns out this bug had already been fixed by Lukas
Czerner in commit 53fdcf992d61 by the simple expedient of moving when
we call ext4_issue_discard() outside the spinlock.

So it should be safe to re-enable this functionality, which I tested
by putting an BUG_ON(in_atomic) just after the restored callsite to
ext4_issue_discard().

Addresses-Google-Bug: #6750518

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Anatol Pomozov <anatol.pomozov@gmail.com>


# 28623c2f 04-Sep-2012 Theodore Ts'o <tytso@mit.edu>

ext4: grow the s_group_info array as needed

Previously we allocated the s_group_info array with enough space for
any future possible growth of the file system via online resize. This
is unfortunate because it wastes memory, and it doesn't work for the
meta_bg scheme, since there is no limit based on the number of
reserved gdt blocks. So add the code to grow the s_group_info array
as needed.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 4907cb7b 01-Sep-2012 Anatol Pomozov <anatol.pomozov@gmail.com>

treewide: fix comment/printk/variable typos

Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# 15c006a2 17-Aug-2012 Robin Dong <sanbai@taobao.com>

ext4: remove unused function argument 'order' in mb_find_extent()

All the routines call mb_find_extent are setting argument 'order' to 0
just like:

mb_find_extent(e4b, 0, ex.fe_start, ex.fe_len, &ex);

therefore the useless argument should be removed.

Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 01fc48e8 17-Aug-2012 Theodore Ts'o <tytso@mit.edu>

ext4: don't load the block bitmap for block groups which have no space

Add a short circuit check to ext4_mb_group_group() so that we don't
bother to load the block bitmap for a block group which does not have
any space available. (Or which does not have enough space until we
are in desperation mode, i.e., when cr == 3.)

Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=45741
Reported-by: mirek@me.com
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 97a74068 22-Jul-2012 Jan Kara <jack@suse.cz>

ext4: remove useless marking of superblock dirty

Commit a0375156 properly notes that superblock doesn't need to be marked
as dirty when only number of free inodes / blocks / number of directories
changes since that is recomputed on each mount anyway. However that comment
leaves some unnecessary markings as dirty in place. Remove these.

Artem: tested using xfstests for both journalled and non-journalled ext4.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>


# 62a1391d 09-Jul-2012 Haibo Liu <HaiboLiu6@gmai.com>

ext4: remove an unused statement in ext4_mb_get_buddy_page_lock()

In this patch, the statement "poff = block % blocks_per_page"
in ext4_mb_get_buddy_page_lock has no effect.

It will be optimized out by the compiler, but it's better to remove it.

Signed-off-by: Haibo Liu <HaiboLiu6@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 1c8457ca 30-Jun-2012 Aditya Kali <adityakali@google.com>

ext4: avoid uneeded calls to ext4_mb_load_buddy() while reading mb_groups

Currently ext4_mb_load_buddy is called for every group, irrespective
of whether the group info is already in memory, while reading
/proc/fs/ext4/<partition>/mb_groups proc file. For the purpose of
mb_groups proc file, it is unnecessary to load the file group info
from disk if it was loaded in past. These calls to ext4_mb_load_buddy
make reading the mb_groups proc file expensive.

Also, the locks around ext4_get_group_info are not required.

This patch modifies the code to call ext4_mb_load_buddy only if the
group info had never been loaded into memory in past. It also removes
the mb group locking around ext4_get_group_info call.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 95599968 31-May-2012 Salman Qazi <sqazi@google.com>

ext4: remove mb_groups before tearing down the buddy_cache

We can't have references held on pages in the s_buddy_cache while we are
trying to truncate its pages and put the inode. All the pages must be
gone before we reach clear_inode. This can only be gauranteed if we
can prevent new users from grabbing references to s_buddy_cache's pages.

The original bug can be reproduced and the bug fix can be verified by:

while true; do mount -t ext4 /dev/ram0 /export/hda3/ram0; \
umount /export/hda3/ram0; done &

while true; do cat /proc/fs/ext4/ram0/mb_groups; done

Signed-off-by: Salman Qazi <sqazi@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org


# 02b78310 31-May-2012 Salman Qazi <sqazi@google.com>

ext4: add ext4_mb_unload_buddy in the error path

ext4_free_blocks fails to pair an ext4_mb_load_buddy with a matching
ext4_mb_unload_buddy when it fails a memory allocation.

Signed-off-by: Salman Qazi <sqazi@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org


# 400db9d3 28-May-2012 Zheng Liu <gnehzuil.liu@gmail.com>

ext4: cleanup in ext4_discard_allocated_blocks()

remove 'len' variable in ext4_discard_allocated_blocks() because it is
useless.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 9d99012f 28-May-2012 Akira Fujita <a-fujita@rs.jp.nec.com>

ext4: remove needs_recovery in ext4_mb_init()

needs_recovery in ext4_mb_init() is not used, remove it.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.ne.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# feb0ab32 29-Apr-2012 Darrick J. Wong <djwong@us.ibm.com>

ext4: make block group checksums use metadata_csum algorithm

metadata_csum supersedes uninit_bg. Convert the ROCOMPAT uninit_bg
flag check to a helper function that covers both, and make the
checksum calculation algorithm use either crc16 or the metadata_csum
chosen algorithm depending on which flag is set. Print a warning if
we try to mount a filesystem with both feature flags set.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# fa77dcfa 29-Apr-2012 Darrick J. Wong <djwong@us.ibm.com>

ext4: calculate and verify block bitmap checksum

Compute and verify the checksum of the block bitmap; this checksum is
stored in the block group descriptor.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# a7967f05 21-Mar-2012 Lukas Czerner <lczerner@redhat.com>

ext4: always set then trimmed blocks count into len

Currently if the range to trim is too small, for example on 1K fs
the request to trim the first block, then the 'range->len' is not set
reporting wrong number of discarded block to the caller.

Fix this by always setting the 'range->len' before we return. Note that
when there is a failure (-EINVAL) caller can not depend on 'range->len'
being set properly.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 21e7fd22 21-Mar-2012 Lukas Czerner <lczerner@redhat.com>

ext4: fix trimmed block count accunting

Currently when there is not enough free blocks in the block group to
discard (grp->bb_free < minlen) the 'trimmed' is bumped up anyway with
the number of discarded blocks from the previous iteration. Fix this
by bumping up 'trimmed' only if the ext4_trim_all_free() was actually
run.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 913eed83 21-Mar-2012 Lukas Czerner <lczerner@redhat.com>

ext4: fix start and len arguments handling in ext4_trim_fs()

The overflow can happen when we are calling get_group_no_and_offset()
which stores the group number in the ext4_grpblk_t type which is
actually int. However when the blocknr is big enough the group number
might be bigger than ext4_grpblk_t resulting in overflow. This will
most likely happen with FITRIM default argument len = ULLONG_MAX.

Fix this by using "end" variable instead of "start+len" as it is easier
to get right and specifically check that the end is not beyond the end
of the file system, so we are sure that the result of
get_group_no_and_offset() will not overflow. Otherwise truncate it to
the size of the file system.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 1084f252 19-Mar-2012 Theodore Ts'o <tytso@mit.edu>

ext4: remove trailing newlines from ext4_msg() and ext4_error() messages

The functions ext4_msg() and ext4_error() already tack on a trailing
newline, so remove the unnecessary extra newline.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 7f6a11e7 19-Mar-2012 Joe Perches <joe@perches.com>

ext4: remove redundant "EXT4-fs: " from uses of ext4_msg

ext4_msg adds "EXT4-fs: " to the messsage output.
Remove the redundant bits from uses.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# c5e8f3f3 20-Feb-2012 Theodore Ts'o <tytso@mit.edu>

ext4: remove EXT4_MB_{BITMAP,BUDDY} macros

The EXT4_MB_BITMAP and EXT4_MB_BUDDY macros obfuscate more than they
provide any abstraction. So remove them.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 1592d2c5 20-Feb-2012 Curt Wohlgemuth <curtw@google.com>

ext4: mark possibly unused variable in ext4_mb_normalize_request()

The 'orig_size' local variable is only used in a call to
mb_debug(). Mark it with '__maybe_unused'.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 18aadd47 20-Feb-2012 Bobi Jam <bobijam@whamcloud.com>

ext4: expand commit callback and

The per-commit callback was used by mballoc code to manage free space
bitmaps after deleted blocks have been released. This patch expands
it to support multiple different callbacks, to allow other things to
be done after the commit has been completed.

Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 813e5727 20-Feb-2012 Theodore Ts'o <tytso@mit.edu>

ext4: fix race when setting bitmap_uptodate flag

In ext4_read_{inode,block}_bitmap() we were setting bitmap_uptodate()
before submitting the buffer for read. The is bad, since we check
bitmap_uptodate() without locking the buffer, and so if another
process is racing with us, it's possible that they will think the
bitmap is uptodate even though the read has not completed yet,
resulting in inodes and blocks potentially getting allocated more than
once if we get really unlucky.

Addresses-Google-Bug: 2828254

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 60e07cf5 18-Dec-2011 Yongqiang Yang <xiaoqiangnk@gmail.com>

ext4: do not reference pa_inode from group_pa

pa_inode in group_pa is set NULL in ext4_mb_new_group_pa, so
pa_inode should be not referenced.

Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 0a10da73 26-Oct-2011 Robin Dong <sanbai@taobao.com>

ext4: fix a wrong comment in __mb_check_buddy()

The comment says the bit should be 0, but the after code assert the
bit to be 1. This makes people confused, so fix it.

Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# b051d8dc 26-Oct-2011 Robin Dong <sanbai@taobao.com>

ext4: remove unused variable in mb_find_extent()

The variable 'ord' in function mb_find_extent() is redundant, so
remove it.

Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 66a83cde 26-Oct-2011 Robin Dong <sanbai@taobao.com>

ext4: remove unused variable in ext4_mb_generate_from_pa()

The variable 'count' in function ext4_mb_generate_from_pa() looks
useless, so remove it.

Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# ebbe0277 26-Oct-2011 Robin Dong <sanbai@taobao.com>

ext4: use stream-alloc when mb_group_prealloc set to zero

The kernel will crash on

ext4_mb_mark_diskspace_used:
BUG_ON(ac->ac_b_ex.fe_len <= 0);

after we set /sys/fs/ext4/sda/mb_group_prealloc to zero and create new files in an ext4 filesystem.

The reason is: ac_b_ex.fe_len also set to zero(mb_group_prealloc) in ext4_mb_normalize_group_request
because the ac_flags contains EXT4_MB_HINT_GROUP_ALLOC.

I think when someone set mb_group_prealloc to zero, it means DO NOT USE GROUP PREALLOCATION,
so we should set alloc-strategy to STREAM in this case.

Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 45dc63e7 20-Oct-2011 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: Allow quota file use root reservation

Quota file is fs's metadata, so it is reasonable to permit use
root resevation if necessary. This patch fix 265'th xfstest failure

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 7aa0baea 06-Oct-2011 Tao Ma <boyu.mt@taobao.com>

ext4: Free resources in ext4_mb_init()'s error paths

In commit 79a77c5ac, we move ext4_mb_init_backend after the allocation
of s_locality_group to avoid memory leak in error path, but there are
still some other error paths in ext4_mb_init that need to do the same
work. So this patch adds all the error patch for ext4_mb_init. And all
the pointers are reset to NULL in case the caller may double free them.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# e7d5f315 09-Sep-2011 Theodore Ts'o <tytso@mit.edu>

ext4: rename ext4_claim_free_blocks() to ext4_claim_free_clusters()

This function really claims a number of free clusters, not blocks, so
rename it so it's clearer what's going on.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# cff1dfd7 09-Sep-2011 Theodore Ts'o <tytso@mit.edu>

ext4: rename ext4_free_blocks_after_init() to ext4_free_clusters_after_init()

This function really returns the number of clusters after initializing
an uninitalized block bitmap has been initialized.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 021b65bb 09-Sep-2011 Theodore Ts'o <tytso@mit.edu>

ext4: Rename ext4_free_blks_{count,set}() to refer to clusters

The field bg_free_blocks_count_{lo,high} in the block group
descriptor has been repurposed to hold the number of free clusters for
bigalloc functions. So rename the functions so it makes it easier to
read and audit the block allocation and block freeing code.

Note: at this point in bigalloc development we doesn't support
online resize, so this also makes it really obvious all of the places
we need to fix up to add support for online resize.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 7b415bf6 09-Sep-2011 Aditya Kali <adityakali@google.com>

ext4: Fix bigalloc quota accounting and i_blocks value

With bigalloc changes, the i_blocks value was not correctly set (it was still
set to number of blocks being used, but in case of bigalloc, we want i_blocks
to represent the number of clusters being used). Since the quota subsystem sets
the i_blocks value, this patch fixes the quota accounting and makes sure that
the i_blocks value is set correctly.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 27baebb8 09-Sep-2011 Theodore Ts'o <tytso@mit.edu>

ext4: tune mballoc's default group prealloc size for bigalloc file systems

The default group preallocation size had been previously set to 512
blocks/clusters, regardless of the block/cluster size. This is
probably too big for large cluster sizes. So adjust the default so
that it is 2 megabytes or 32 clusters, whichever is larger.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 24aaa8ef 09-Sep-2011 Theodore Ts'o <tytso@mit.edu>

ext4: convert the free_blocks field in s_flex_groups to be free_clusters

Convert the free_blocks to be free_clusters to make the final revised
bigalloc changes easier to read/understand.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 57042651 09-Sep-2011 Theodore Ts'o <tytso@mit.edu>

ext4: convert s_{dirty,free}blocks_counter to s_{dirty,free}clusters_counter

Convert the percpu counters s_dirtyblocks_counter and
s_freeblocks_counter in struct ext4_super_info to be
s_dirtyclusters_counter and s_freeclusters_counter.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 84130193 09-Sep-2011 Theodore Ts'o <tytso@mit.edu>

ext4: teach ext4_free_blocks() about bigalloc and clusters

The ext4_free_blocks() function now has two new flags that indicate
whether a partial cluster at the beginning or the end of the block
extents should be freed or not. That will be up the caller (i.e.,
truncate), who can figure out whether partial clusters at the
beginning or the end of a block range can be freed.

We also have to update the ext4_mb_free_metadata() and
release_blocks_on_commit() machinery to be cluster-based, since it is
used by ext4_free_blocks().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 53accfa9 09-Sep-2011 Theodore Ts'o <tytso@mit.edu>

ext4: teach mballoc preallocation code about bigalloc clusters

In most of mballoc.c, we do everything in units of clusters, since the
block allocation bitmaps and buddy bitmaps are all denominated in
clusters. The one place where we do deal with absolute block numbers
is in the code that handles the preallocation regions, since in the
case of inode-based preallocation regions, the start of the
preallocation region can't be relative to the beginning of the group.

So this adds a bit of complexity, where pa_pstart and pa_lstart are
block numbers, while pa_free, pa_len, and fe_len are denominated in
units of clusters.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 7137d7a4 09-Sep-2011 Theodore Ts'o <tytso@mit.edu>

ext4: convert instances of EXT4_BLOCKS_PER_GROUP to EXT4_CLUSTERS_PER_GROUP

Change the places in fs/ext4/mballoc.c where EXT4_BLOCKS_PER_GROUP are
used to indicate the number of bits in a block bitmap (which is really
a cluster allocation bitmap in bigalloc file systems). There are
still some places in the ext4 codebase where usage of
EXT4_BLOCKS_PER_GROUP needs to be audited/fixed, in code paths that
aren't used given the initial restricted assumptions for bigalloc.
These will need to be fixed before we can relax those restrictions.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 79a77c5a 01-Aug-2011 Yu Jian <yujian@whamcloud.com>

ext4: prevent memory leaks from ext4_mb_init_backend() on error path

In ext4_mb_init(), if the s_locality_group allocation fails it will
currently cause the allocations made in ext4_mb_init_backend() to
be leaked. Moving the ext4_mb_init_backend() allocation after the
s_locality_group allocation avoids that problem.

Signed-off-by: Yu Jian <yujian@whamcloud.com>
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 48e6061b 01-Aug-2011 Yu Jian <yujian@whamcloud.com>

ext4: use EXT4_BAD_INO for buddy cache to avoid colliding with valid inode #

Signed-off-by: Yu Jian <yujian@whamcloud.com>
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 9d8b9ec4 01-Aug-2011 Theodore Ts'o <tytso@mit.edu>

ext4: use ext4_msg() instead of printk in mballoc

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# f18a5f21 01-Aug-2011 Theodore Ts'o <tytso@mit.edu>

ext4: use ext4_kvzalloc()/ext4_kvmalloc() for s_group_desc and s_group_info

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# c3e94d1d 26-Jul-2011 Yongqiang Yang <xiaoqiangnk@gmail.com>

ext4: let setup_new_group_blocks() set multiple bits at a time

Rename mb_set_bits() to ext4_set_bits() and make it a global function
so that setup_new_group_blocks() can use it.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 4740b830 26-Jul-2011 Yongqiang Yang <xiaoqiangnk@gmail.com>

ext4: let ext4_group_add_blocks() handle 0 blocks quickly

If ext4_group_add_blocks() is called with 0 block, make it return 0
without doing any extra work.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# cc7365df 26-Jul-2011 Yongqiang Yang <xiaoqiangnk@gmail.com>

ext4: let ext4_group_add_blocks() return an error code

This patch lets ext4_group_add_blocks() return an error code if it
fails, so that upper functions can handle error correctly.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 0529155e 26-Jul-2011 Yongqiang Yang <xiaoqiangnk@gmail.com>

ext4: rename ext4_add_groupblocks() to ext4_group_add_blocks()

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# ced156e4 23-Jul-2011 Tao Ma <boyu.mt@taobao.com>

ext4: don't increment s_mb_buddies_generated in ext4_mb_release

In ext4_mb_release, we use s_mb_buddies_generated++. Although
the output is OK, but I don't think we need this extra ++.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 529da704 23-Jul-2011 Tao Ma <boyu.mt@taobao.com>

ext4: remove unnecessary ext4_get_group_info in ext4_mb_load_buddy

ext4_mb_load_buddy() calls ext4_get_group_info() for setting both
"grp" and "e4b->bd_info", but it could do "e4b->bd_info = grp".

Reported-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# d7a1fee1 17-Jul-2011 Dan Ehrenberg <dehrenberg@google.com>

ext4: make the preallocation size be a multiple of stripe size

Previously, if a stripe width was provided, then it would be used
as the preallocation granularity, with no santiy checking and no
way to override this. Now, mb_prealloc_size defaults to the smallest
multiple of stripe size that is greater than or equal to the old
default mb_prealloc_size, and this can be overridden with the sysfs
interface.

Signed-off-by: Dan Ehrenberg <dehrenberg@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# caaf7a29 11-Jul-2011 Tao Ma <boyu.mt@taobao.com>

ext4: Fix a double free of sbi->s_group_info in ext4_mb_init_backend

If we meet with an error in ext4_mb_add_groupinfo, we kfree
sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)], but fail to
reset it to NULL. So the caller ext4_mb_init_backend will try to kfree
it again and causes a double free. So fix it by resetting it to NULL.

Some typo in comments of mballoc.c are also changed.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 823ba01f 11-Jul-2011 Tao Ma <boyu.mt@taobao.com>

ext4: fix a race which could leak memory in ext4_groupinfo_create_slab()

In ext4_groupinfo_create_slab, we create ext4_groupinfo_caches within
ext4_grpinfo_slab_create_mutex, but set it outside the lock, and there
does exist some case that we may create it twice and causes a memory
leak. So set it before we call mutex_unlock.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 22612283 10-Jul-2011 Tao Ma <boyu.mt@taobao.com>

ext4: Change the wrong param comment for ext4_trim_all_free

at ext4_trim_all_free() comment, there is no longer an @e4b parameter,
instead it is @group.

Reported-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 3d56b8d2 10-Jul-2011 Tao Ma <boyu.mt@taobao.com>

ext4: Speed up FITRIM by recording flags in ext4_group_info

In ext4, when FITRIM is called every time, we iterate all the
groups and do trim one by one. It is a bit time wasting if the
group has been trimmed and there is no change since the last
trim.

So this patch adds a new flag in ext4_group_info->bb_state to
indicate that the group has been trimmed, and it will be cleared
if some blocks is freed(in release_blocks_on_commit). Another
trim_minlen is added in ext4_sb_info to record the last minlen
we use to trim the volume, so that if the caller provide a small
one, we will go on the trim regardless of the bb_state.

A simple test with my intel x25m ssd:
df -h shows:
/dev/sdb1 40G 21G 17G 56% /mnt/ext4
Block size: 4096

run the FITRIM with the following parameter:
range.start = 0;
range.len = UINT64_MAX;
range.minlen = 1048576;

without the patch:
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real 0m5.505s
user 0m0.000s
sys 0m1.224s
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real 0m5.359s
user 0m0.000s
sys 0m1.178s
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real 0m5.228s
user 0m0.000s
sys 0m1.151s

with the patch:
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real 0m5.625s
user 0m0.000s
sys 0m1.269s
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real 0m0.002s
user 0m0.000s
sys 0m0.001s
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real 0m0.002s
user 0m0.000s
sys 0m0.001s

A big improvement for the 2nd and 3rd run.

Even after I delete some big image files, it is still much
faster than iterating the whole disk.

[root@boyu-tm test]# time ./ftrim /mnt/ext4/a
real 0m1.217s
user 0m0.000s
sys 0m0.196s

Cc: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# b3d4c2b1 10-Jul-2011 Tao Ma <boyu.mt@taobao.com>

ext4: Add new ext4 trim tracepoints

Add ext4_trim_extent and ext4_trim_all_free.

Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 169ddc3e 10-Jul-2011 Tao Ma <boyu.mt@taobao.com>

ext4: speed up group trim with the right free block count

When we trim some free blocks in a group of ext4, we need to
calculate the free blocks properly and check whether there are
enough freed blocks left for us to trim. Current solution will
only calculate free spaces if they are large for a trim which
isn't appropriate.

Let us see a small example:
a group has 1.5M free which are 300k, 300k, 300k, 300k, 300k.
And minblocks is 1M. With current solution, we have to iterate
the whole group since these 300k will never be subtracted from
1.5M. But actually we should exit after we find the first 2
free spaces since the left 3 chunks only sum up to 900K if we
subtract the first 600K although they can't be trimed.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 22f10457 10-Jul-2011 Tao Ma <boyu.mt@taobao.com>

ext4: fix trim length underflow with small trim length

In 0f0a25b, we adjust 'len' with s_first_data_block - start, but
it could underflow in case blocksize=1K, fstrim_range.len=512 and
fstrim_range.start = 0. In this case, when we run the code:
len -= first_data_blk - start; len will be underflow to -1ULL.
In the end, although we are safe that last_group check later will limit
the trim to the whole volume, but that isn't what the user really want.

So this patch fix it. It also adds the check for 'start' like ext3 so that
we can break immediately if the start is invalid.

Cc: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 7132de74 10-Jul-2011 Maxim Patlasov <maxim.patlasov@gmail.com>

ext4: fix i_blocks/quota accounting when extent insertion fails

The current implementation of ext4_free_blocks() always calls
dquot_free_block This looks quite sensible in the most cases: blocks
to be freed are associated with inode and were accounted in quota and
i_blocks some time ago.

However, there is a case when blocks to free were not accounted by the
time calling ext4_free_blocks() yet:

1. delalloc is on, write_begin pre-allocated some space in quota
2. write-back happens, ext4 allocates some blocks in ext4_ext_map_blocks()
3. then ext4_ext_map_blocks() gets an error (e.g. ENOSPC) from
ext4_ext_insert_extent() and calls ext4_free_blocks().

In this scenario, ext4_free_blocks() calls dquot_free_block() who, in
turn, decrements i_blocks for blocks which were not accounted yet (due
to delalloc) After clean umount, e2fsck reports something like:

> Inode 21, i_blocks is 5080, should be 5128. Fix<y>?
because i_blocks was erroneously decremented as explained above.

The patch fixes the problem by passing the new flag
EXT4_FREE_BLOCKS_NO_QUOT_UPDATE to ext4_free_blocks(), to request
that the dquot_free_block() call be skipped.

Signed-off-by: Maxim Patlasov <maxim.patlasov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org


# 9331b626 28-Jun-2011 Yongqiang Yang <xiaoqiangnk@gmail.com>

ext4: quiet 'unused variables' compile warnings

Unused variables was deleted.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# a9c667f8 06-Jun-2011 Lukas Czerner <lczerner@redhat.com>

ext4: fixed tracepoints cleanup

While creating fixed tracepoints for ext3, basically by porting them
from ext4, I found a lot of useless retyping, wrong type usage, useless
variable passing and other inconsistencies in the ext4 fixed tracepoint
code.

This patch cleans the fixed tracepoint code for ext4 and also simplify
some of them.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 5def1360 05-Jun-2011 Yongqiang Yang <xiaoqiangnk@gmail.com>

ext4: correct comments for ext4_free_blocks()

metadata is not parameter of ext4_free_blocks() any more.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 55f020db 25-May-2011 Allison Henderson <achender@linux.vnet.ibm.com>

ext4: add flag to ext4_has_free_blocks

This patch adds an allocation request flag to the ext4_has_free_blocks
function which enables the use of reserved blocks. This will allow a
punch hole to proceed even if the disk is full. Punching a hole may
require additional blocks to first split the extents.

Because ext4_has_free_blocks is a low level function, the flag needs
to be passed down through several functions listed below:

ext4_ext_insert_extent
ext4_ext_create_new_leaf
ext4_ext_grow_indepth
ext4_ext_split
ext4_ext_new_meta_block
ext4_mb_new_blocks
ext4_claim_free_blocks
ext4_has_free_blocks

[ext4 punch hole patch series 1/5 v7]

Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>


# 28739eea 24-May-2011 Lukas Czerner <lczerner@redhat.com>

ext4: protect bb_first_free in ext4_trim_all_free() with group lock

We should protect reading bd_info->bb_first_free with the group lock
because otherwise we might miss some free blocks. This is not a big deal
at all, but the change to do right thing is really simple, so lets do
that.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 78944086 24-May-2011 Lukas Czerner <lczerner@redhat.com>

ext4: only load buddy bitmap in ext4_trim_fs() when it is needed

Currently we are loading buddy ext4_mb_load_buddy() for every block
group we are going through in ext4_trim_fs() in many cases just to find
out that there is not enough space to be bothered with. As Amir Goldstein
suggested we can use bb_free information directly from ext4_group_info.

This commit removes ext4_mb_load_buddy() from ext4_trim_fs() and rather
get the ext4_group_info via ext4_get_group_info() and use the bb_free
information directly from that. This avoids unnecessary call to load
buddy in the case the group does not have enough free space to trim.
Loading buddy is now moved to ext4_trim_all_free().

Tested by me with xfstests 251.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 44183d42 09-May-2011 Amir Goldstein <amir73il@users.sf.net>

ext4: remove alloc_semp

After taking care of all group init races, all that remains is to
remove alloc_semp from ext4_allocation_context and ext4_buddy structs.

Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 9b8b7d35 09-May-2011 Amir Goldstein <amir73il@users.sf.net>

ext4: teach ext4_mb_init_cache() to skip uptodate buddy caches

After online resize which adds new groups, some of the groups
in a buddy page may be initialized and uptodate, while other
(new ones) may be uninitialized.

The indication for init of new block groups is when ext4_mb_init_cache()
is called with an uptodate buddy page. In this case, initialized groups
on that buddy page must be skipped when initializing the buddy cache.

Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 2de8807b 09-May-2011 Amir Goldstein <amir73il@users.sf.net>

ext4: synchronize ext4_mb_init_group() with buddy page lock

The old routines ext4_mb_[get|put]_buddy_cache_lock(), which used
to take grp->alloc_sem for all groups on the buddy page have been
replaced with the routines ext4_mb_[get|put]_buddy_page_lock().

The new routines take both buddy and bitmap page locks to protect
against concurrent init of groups on the same buddy page.

The GROUP_NEED_INIT flag is tested again under page lock to check
if the group was initialized by another caller.

Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# e73a347b 09-May-2011 Amir Goldstein <amir73il@users.sf.net>

ext4: implement ext4_add_groupblocks() by freeing blocks

The old imlementation used to take grp->alloc_sem and set the
GROUP_NEED_INIT flag, so that the buddy cache would be reloaded.

The new implementation updates the buddy cache by freeing the added
blocks and making them available for use, so there is no need to
reload the buddy cache and there is no need to take grp->alloc_sem.

Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 2cd05cc3 09-May-2011 Theodore Ts'o <tytso@mit.edu>

ext4: remove unneeded ext4_journal_get_undo_access

The block allocation code used to use jbd2_journal_get_undo_access as
a way to make changes that wouldn't show up until the commit took
place. The new multi-block allocation code has a its own way of
preventing newly freed blocks from getting reused until the commit
takes place (it avoids updating the buddy bitmaps until the commit is
done), so we don't need to use jbd2_journal_get_undo_access(), which
has extra overhead compared to jbd2_journal_get_write_access().

There was one last vestigal use of ext4_journal_get_undo_access() in
ext4_add_groupblocks(); change it to use ext4_journal_get_write_access()
and then remove the ext4_journal_get_undo_access() support.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 2846e820 09-May-2011 Amir Goldstein <amir73il@users.sf.net>

ext4: move ext4_add_groupblocks() to mballoc.c

In preparation for the next patch, the function ext4_add_groupblocks()
is moved to mballoc.c, where it could use some static functions.

Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# d9f34504 30-Apr-2011 Theodore Ts'o <tytso@mit.edu>

ext4: ignore errors when issuing discards

This is an effective revert of commit a30eec2a8: "ext4: stop issuing
discards if not supported by device". The problem is that there are
some devices that may return errors in response to a discard request
some times but not others. (One example would be a hybrid dm device
which concatenates an SSD and an HDD device).

By this logic, I also removed the error checking from ext4's FITRIM
code; so that an error from a discard will not stop the FITRIM from
trying to trim the rest of the file system.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 26626f11 16-Apr-2011 Yang Ruirui <ruirui.r.yang@tieto.com>

ext4: release page cache in ext4_mb_load_buddy error path

Add missing page_cache_release in the error path of ext4_mb_load_buddy

Signed-off-by: Yang Ruirui <ruirui.r.yang@tieto.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org


# 25985edc 30-Mar-2011 Lucas De Marchi <lucas.demarchi@profusion.mobi>

Fix common misspellings

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>


# 0ba08517 23-Mar-2011 Tao Ma <boyu.mt@taobao.com>

ext4: fix a BUG in mb_mark_used during trim.

In a bs=4096 volume, if we call FITRIM with the following parameter as
fstrim_range(start = 102400, len = 134144000, minlen = 10240),
we will trigger this BUG_ON:

BUG_ON(start + len > (e4b->bd_sb->s_blocksize << 3));

Mar 4 00:55:52 boyu-tm kernel: ------------[ cut here ]------------
Mar 4 00:55:52 boyu-tm kernel: kernel BUG at fs/ext4/mballoc.c:1506!
Mar 4 01:21:09 boyu-tm kernel: Code: d4 00 00 00 00 49 89 fe 8b 56 0c 44 8b 7e 04 89 55 c4 48 8b 4f 28 89 d6 44 01 fe 48 63 d6 48 8b 41 18 48 c1 e0 03 48 39 c2 76 04 <0f> 0b eb fe 48 8b 55 b0 8b 47 34 3b 42 08 74 04 0f 0b eb fe 48
Mar 4 01:21:09 boyu-tm kernel: RIP [<ffffffffa053eb42>] mb_mark_used+0x47/0x26c [ext4]
Mar 4 01:21:09 boyu-tm kernel: RSP <ffff880121e45c38>
Mar 4 01:21:09 boyu-tm kernel: ---[ end trace 9f461696f6a9dcf2 ]---

Fix this bug by doing the accounting correctly.

Cc: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 4596fe07 21-Mar-2011 Eric Sandeen <sandeen@redhat.com>

ext4: don't kfree uninitialized s_group_info members

We can call kfree on uninitialized members of the s_group_info array
on an the error path. We can avoid this by kzalloc'ing the array.

This doesn't entirely solve the oops on mount if we fail down this
path; failed_mount4: frees the sbi, for one, which gets referenced
later in the failed mount paths - I haven't worked that out yet.

https://bugzilla.kernel.org/show_bug.cgi?id=30872

Reported-by: Eugene A. Shatokhin <dame_eugene@mail.ru>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 4dd89fc6 27-Feb-2011 Theodore Ts'o <tytso@mit.edu>

ext4: suppress verbose debugging information if malloc-debug is off

If CONFIG_EXT4_DEBUG is enabled, then if a block allocation fails due
to disk being full, a verbose debugging message is printed, even if
the malloc-debug switch has not been enabled. Suppress the debugging
message so that nothing is printed unless malloc-debug has been turned
on.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 5a54b2f1 24-Feb-2011 Coly Li <colyli@gmail.com>

ext4: mballoc: don't replace the current preallocation group unnecessarily

In ext4_mb_check_group_pa(), the current preallocation space is
replaced with a new preallocation space when the two have the same
distance from the goal block.

This doesn't actually gain us anything, so change things so that the
function only switches to the new preallocation group if its distance
from the goal block is strictly smaller than the current preallocaiton
group's distance from the goal block.

Signed-off-by: Coly Li <bosong.ly@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 7c786059 24-Feb-2011 Coly Li <i@coly.li>

mballoc: add comments to ext4_mb_mark_free_simple()

This patch adds comments to ext4_mb_mark_free_simple to make it more
understandable.

Signed-off-by: Coly Li <bosong.ly@taobao.com>
Cc: Alex Tomas <alex@clusterfs.com>
Cc: Theodore Tso <tytso@google.com>


# 235772da 24-Feb-2011 Coly Li <i@coly.li>

ext4: remove unncessary call mb_find_buddy() in debugging code

In __mb_check_buddy(), look at the code below:
591 fstart = -1;
592 buddy = mb_find_buddy(e4b, 0, &max);
593 for (i = 0; i < max; i++) {
594 if (!mb_test_bit(i, buddy)) {
595 MB_CHECK_ASSERT(i >= e4b->bd_info->bb_first_free);
596 if (fstart == -1) {
597 fragments++;
598 fstart = i;
599 }
600 continue;
601 }
602 fstart = -1;
603 /* check used bits only */
604 for (j = 0; j < e4b->bd_blkbits + 1; j++) {
605 buddy2 = mb_find_buddy(e4b, j, &max2);
606 k = i >> j;
607 MB_CHECK_ASSERT(k < max2);
608 MB_CHECK_ASSERT(mb_test_bit(k, buddy2));
609 }
610 }
611 MB_CHECK_ASSERT(!EXT4_MB_GRP_NEED_INIT(e4b->bd_info));
612 MB_CHECK_ASSERT(e4b->bd_info->bb_fragments == fragments);
613
614 grp = ext4_get_group_info(sb, e4b->bd_group);
615 buddy = mb_find_buddy(e4b, 0, &max);

On line 592, buddy is fetched by mb_find_buddy() with order 0, between
line 593 to line 615, buddy is not changed, therefore there is
no need to fetch buddy again from mb_find_buddy() with order 0 again.

We can safely remove the second mb_find_buddy() on line 615.

Signed-off-by: Coly Li <bosong.ly@taobao.com>
Cc: Alex Tomas <alex@clusterfs.com>
Cc: Theodore Tso <tytso@google.com>


# 84b775a3 23-Feb-2011 Coly Li <i@coly.li>

ext4: code cleanup in mb_find_buddy()

Current code calculate max no matter whether order is zero, it's
unnecessary. This cleanup patch sets max to "1 << (e4b->bd_blkbits
+ 3)" only when order == 0.

Signed-off-by: Coly Li <bosong.ly@taobao.com>
Cc: Alex Tomas <alex@clusterfs.com>
Cc: Theodore Tso <tytso@google.com>


# 0b75a840 22-Feb-2011 Lukas Czerner <lczerner@redhat.com>

ext4: mark file-local functions and variables as static

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 2892c15d 12-Feb-2011 Eric Sandeen <sandeen@redhat.com>

ext4: make grpinfo slab cache names static

In 2.6.37 I was running into oopses with repeated module
loads & unloads. I tracked this down to:

fb1813f4 ext4: use dedicated slab caches for group_info structures

(this was in addition to the features advert unload problem)

The kstrdup & subsequent kfree of the cache name was causing
a double free. In slub, at least, if I read it right it allocates
& frees the name itself, slab seems to do something different...
so in slub I think we were leaking -our- cachep->name, and double
freeing the one allocated by slub.

After getting lost in slab/slub/slob a bit, I just looked at other
sized-caches that get allocated. jbd2, biovec, sgpool all do it
more or less the way jbd2 does. Below patch follows the jbd2
method of dynamically allocating a cache at mount time from
a list of static names.

(This might also possibly fix a race creating the caches with
parallel mounts running).

[Folded in a fix from Dan Carpenter which fixed an off-by-one error in
the original patch]

Cc: stable@kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 0f0a25bf 11-Jan-2011 Jan Kara <jack@suse.cz>

ext4: fix trimming starting with block 0 with small blocksize

When s_first_data_block is not zero (which happens e.g. when block size is 1KB)
and trim ioctl is called to start trimming from block 0, the math in
ext4_get_group_no_and_offset() overflows. The overall result is that ioctl
returns EINVAL which is kind of unexpected and we probably don't want
userspace tools to bother with internal details of filesystem structure.
So just silently increase starting offset (and shorten length) when starting
block is below s_first_data_block.

CC: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 0a2179b1 11-Jan-2011 Theodore Ts'o <tytso@mit.edu>

ext4: revert buggy trim overflow patch

This reverts commit 4f531501e44: ext4: fix possible overflow in
ext4_trim_fs()

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# a5196f8c 09-Jan-2011 Theodore Ts'o <tytso@mit.edu>

ext4: remove ext4_mb_return_to_preallocation()

This function was never implemented, except for a BUG_ON which was
tripping when ext4 is run without a journal. The problem is that
although the comment asserts that "truncate (which is the only way to
free block) discards all preallocations", ext4_free_blocks() is also
called in various error recovery paths when blocks have been
allocated, but for various reasons, we were not able to use those data
blocks (for example, because we ran out of memory while trying to
manipulate the extent tree, or some other similar situation).

In addition to the fact that this function isn't implemented except
for the incorrect BUG_ON, the single caller of this function,
ext4_free_blocks(), doesn't use it all if the journal is enabled.

So remove the (stub) function entirely for now. If we decide it's
better to add it back, it's only going to be useful with a relatively
large number of code changes anyway.

Google-Bug-Id: 3236408

Cc: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# ca6e909f 09-Jan-2011 Jan Kara <jack@suse.cz>

ext4: fix trimming of a single group

When ext4_trim_fs() is called to trim a part of a single group, the
logic will wrongly set last block of the interval to 'len' instead
of 'first_block + len'. Thus a shorter interval is possibly trimmed.
Fix it.

CC: Lukas Czerner <lczerner@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# f2321097 09-Jan-2011 Theodore Ts'o <tytso@mit.edu>

ext4: replace i_delalloc_reserved_flag with EXT4_STATE_DELALLOC_RESERVED

Remove the short element i_delalloc_reserved_flag from the
ext4_inode_info structure and replace it a new bit in i_state_flags.
Since we have an ext4_inode_info for every ext4 inode cached in the
inode cache, any savings we can produce here is a very good thing from
a memory utilization perspective.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 93259636 09-Jan-2011 Lukas Czerner <lczerner@redhat.com>

ext4: remove warning message from ext4_issue_discard helper

ext4_issue_discard is supposed to be helper for calling discard, however
in case that underlying device does not support discard it prints out
the warning message and clears the DISCARD t_mount_opt flag. Since it
can be (and is) used by others, it should not do anything and let the
caller to handle the error case.

This commit removes warning message and flag setting from
ext4_issue_discard and use it just in place where it is really needed
(release_blocks_on_commit). FITRIM ioctl should not set any flags nor it
should print out warning messages, so get rid of the warning as well.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>


# 4f531501 09-Jan-2011 Lukas Czerner <lczerner@redhat.com>

ext4: fix possible overflow in ext4_trim_fs()

When determining last group through ext4_get_group_no_and_offset() the
result may be wrong in cases when range->start and range-len are too
big, because it may overflow when summing up those two numbers.

Fix that by checking range->len and limit its value to
ext4_blocks_count(). This commit was tested by myself with expected
result.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>


# b72143ab 20-Dec-2010 Theodore Ts'o <tytso@mit.edu>

ext4: Add error checking to kmem_cache_alloc() call in ext4_free_blocks()

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# fd8c37ec 15-Dec-2010 Theodore Ts'o <tytso@mit.edu>

ext4: Simplify the usage of clear_opt() and set_opt() macros

Change clear_opt() and set_opt() to take a superblock pointer instead
of a pointer to EXT4_SB(sb)->s_mount_opt. This makes it easier for us
to support a second mount option field.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# b56ff9d3 08-Nov-2010 Theodore Ts'o <tytso@mit.edu>

ext4: Don't call sb_issue_discard() in ext4_free_blocks()

Commit 5c521830cf (ext4: Support discard requests when running in
no-journal mode) attempts to add sb_issue_discard() for data blocks
(in data=writeback mode) and in no-journal mode. Unfortunately, this
no longer works, because in commit dd3932eddf (block: remove
BLKDEV_IFL_WAIT), sb_issue_discard() only presents a synchronous
interface, and there are times when we call ext4_free_blocks() when we
are are holding a spinlock, or are otherwise in an atomic context.

For now, I've removed the call to sb_issue_discard() to prevent a
deadlock or (if spinlock debugging is enabled) failures like this:

BUG: scheduling while atomic: rc.sysinit/1376/0x00000002
Pid: 1376, comm: rc.sysinit Not tainted 2.6.36-ARCH #1
Call Trace:
[<ffffffff810397ce>] __schedule_bug+0x5e/0x70
[<ffffffff81403110>] schedule+0x950/0xa70
[<ffffffff81060bad>] ? insert_work+0x7d/0x90
[<ffffffff81060fbd>] ? queue_work_on+0x1d/0x30
[<ffffffff81061127>] ? queue_work+0x37/0x60
[<ffffffff8140377d>] schedule_timeout+0x21d/0x360
[<ffffffff812031c3>] ? generic_make_request+0x2c3/0x540
[<ffffffff81402680>] wait_for_common+0xc0/0x150
[<ffffffff81041490>] ? default_wake_function+0x0/0x10
[<ffffffff812034bc>] ? submit_bio+0x7c/0x100
[<ffffffff810680a0>] ? wake_bit_function+0x0/0x40
[<ffffffff814027b8>] wait_for_completion+0x18/0x20
[<ffffffff8120a969>] blkdev_issue_discard+0x1b9/0x210
[<ffffffff811ba03e>] ext4_free_blocks+0x68e/0xb60
[<ffffffff811b1650>] ? __ext4_handle_dirty_metadata+0x110/0x120
[<ffffffff811b098c>] ext4_ext_truncate+0x8cc/0xa70
[<ffffffff810d713e>] ? pagevec_lookup+0x1e/0x30
[<ffffffff81191618>] ext4_truncate+0x178/0x5d0
[<ffffffff810eacbb>] ? unmap_mapping_range+0xab/0x280
[<ffffffff810d8976>] vmtruncate+0x56/0x70
[<ffffffff811925cb>] ext4_setattr+0x14b/0x460
[<ffffffff811319e4>] notify_change+0x194/0x380
[<ffffffff81117f80>] do_truncate+0x60/0x90
[<ffffffff811e08fa>] ? security_inode_permission+0x1a/0x20
[<ffffffff811eaec1>] ? tomoyo_path_truncate+0x11/0x20
[<ffffffff81127539>] do_last+0x5d9/0x770
[<ffffffff811278bd>] do_filp_open+0x1ed/0x680
[<ffffffff8140644f>] ? page_fault+0x1f/0x30
[<ffffffff81132bfc>] ? alloc_fd+0xec/0x140
[<ffffffff81118db1>] do_sys_open+0x61/0x120
[<ffffffff81118e8b>] sys_open+0x1b/0x20
[<ffffffff81002e6b>] system_call_fastpath+0x16/0x1b

https://bugzilla.kernel.org/show_bug.cgi?id=22302

Reported-by: Mathias Burén <mathias.buren@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: jiayingz@google.com


# eee4adc7 27-Oct-2010 Eric Sandeen <sandeen@redhat.com>

ext4: move ext4_mb_{get,put}_buddy_cache_lock and make them static

These functions are only used within fs/ext4/mballoc.c, so move them
so they are used after they are defined, and then make them be static.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 5dabfc78 27-Oct-2010 Theodore Ts'o <tytso@mit.edu>

ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*()

This is a cleanup to avoid namespace leaks out of fs/ext4

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 7360d173 27-Oct-2010 Lukas Czerner <lczerner@redhat.com>

ext4: Add batched discard support for ext4

Walk through allocation groups and trim all free extents. It can be
invoked through FITRIM ioctl on the file system. The main idea is to
provide a way to trim the whole file system if needed, since some SSD's
may suffer from performance loss after the whole device was filled (it
does not mean that fs is full!).

It search for free extents in allocation groups specified by Byte range
start -> start+len. When the free extent is within this range, blocks
are marked as used and then trimmed. Afterwards these blocks are marked
as free in per-group bitmap.

Since fstrim is a long operation it is good to have an ability to
interrupt it by a signal. This was added by Dmitry Monakhov.
Thanks Dimitry.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 77ca6cdf 27-Oct-2010 Lukas Czerner <lczerner@redhat.com>

ext4: Use return value from sb_issue_discard()

Use return value from sb_issue_discard() as return value in
ext4_issue_discard(). Since sb_issue_discard() may result in more
serious errors than just -EOPNOTSUPP it is worth to inform user of this
function about them to handle error cases properly.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 87783690 27-Oct-2010 Namhyung Kim <namhyung@gmail.com>

ext4: Check return value of sb_getblk() and friends

Fail block allocation if sb_getblk() returns NULL. In that case,
sb_find_get_block() also likely to fail so that it should skip
calling ext4_forget().

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 16828088 27-Oct-2010 Theodore Ts'o <tytso@mit.edu>

ext4: use KMEM_CACHE instead of kmem_cache_create

Also remove the SLAB_RECLAIM_ACCOUNT flag from the system zone kmem
cache. This slab tends to be fairly static, so it shouldn't be marked
as likely to have free pages that can be reclaimed.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 3e1e5f50 27-Oct-2010 Eric Sandeen <sandeen@redhat.com>

ext4: don't use ext4_allocation_contexts for tracing

Many tracepoints were populating an ext4_allocation_context
to pass in, but this requires a slab allocation even when
tracepoints are off. In fact, 4 of 5 of these allocations
were only for tracing. In addition, we were only using a
small fraction of the 144 bytes of this structure for this
purpose.

We can do away with all these alloc/frees of the ac and
simply pass in the bits we care about, instead.

I tested this by turning on tracing and running through
xfstests on x86_64. I did not actually do anything with
the trace output, however.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 53fdcf99 27-Oct-2010 Lukas Czerner <lczerner@redhat.com>

ext4: don't hold spinlock while calling ext4_issue_discard()

We can't hold the block group spinlock because we ext4_issue_discard()
calls wait and hence can get rescheduled.

Google-Bug-Id: 3017678

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 58298709 27-Oct-2010 Lukas Czerner <lczerner@redhat.com>

ext4: check for negative error code from sb_issue_discard

sb_issue_discard() is returning negative error code, so check for
-EOPNOTSUPP.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# fb1813f4 27-Oct-2010 Curt Wohlgemuth <curtw@google.com>

ext4: use dedicated slab caches for group_info structures

ext4_group_info structures are currently allocated with kmalloc().
With a typical 4K block size, these are 136 bytes each -- meaning
they'll each consume a 256-byte slab object. On a system with many
ext4 large partitions, that's a lot of wasted kernel slab space.
(E.g., a single 1TB partition will have about 8000 block groups, using
about 2MB of slab, of which nearly 1MB is wasted.)

This patch creates an array of slab pointers created as needed --
depending on the superblock block size -- and uses these slabs to
allocate the group info objects.

Google-Bug-Id: 2980809

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 85fe4025 23-Oct-2010 Christoph Hellwig <hch@lst.de>

fs: do not assign default i_ino in new_inode

Instead of always assigning an increasing inode number in new_inode
move the call to assign it into those callers that actually need it.
For now callers that need it is estimated conservatively, that is
the call is added to all filesystems that do not assign an i_ino
by themselves. For a few more filesystems we can avoid assigning
any inode number given that they aren't user visible, and for others
it could be done lazily when an inode number is actually needed,
but that's left for later patches.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# dd3932ed 16-Sep-2010 Christoph Hellwig <hch@lst.de>

block: remove BLKDEV_IFL_WAIT

All the blkdev_issue_* helpers can only sanely be used for synchronous
caller. To issue cache flushes or barriers asynchronously the caller needs
to set up a bio by itself with a completion callback to move the asynchronous
state machine ahead. So drop the BLKDEV_IFL_WAIT flag that is always
specified when calling blkdev_issue_* and also remove the now unused flags
argument to blkdev_issue_flush and blkdev_issue_zeroout. For
blkdev_issue_discard we need to keep it for the secure discard flag, which
gains a more descriptive name and loses the bitops vs flag confusion.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 61002f7db 18-Aug-2010 Christoph Hellwig <hch@infradead.org>

ext4: do not send discards as barriers

ext4 already uses synchronous discards, no need to add I/O barriers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 2cf6d26a 18-Aug-2010 Christoph Hellwig <hch@infradead.org>

block: pass gfp_mask and flags to sb_issue_discard

We'll need to get rid of the BLKDEV_IFL_BARRIER flag, and to facilitate
that and to make the interface less confusing pass all flags explicitly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 6c7a120a 05-Aug-2010 Aditya Kali <adityakali@google.com>

ext4: Adding error check after calling ext4_mb_regular_allocator()

If the bitmap block on disk is bad, ext4_mb_load_buddy() returns an
error. This error is returned to the caller,
ext4_mb_regular_allocator() and then to ext4_mb_new_blocks(). But
ext4_mb_new_blocks() did not check for the return value of
ext4_mb_regular_allocator() and would repeatedly try to load the
bitmap block. The fix simply catches the return value and exits out of
the 'repeat' loop after cleanup.

We also take the opportunity to clean up the error handling in
ext4_mb_new_blocks().

Google-Bug-Id: 2853530

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 73b2c716 30-Jul-2010 Uwe Kleine-König <u.kleine-koenig@pengutronix.de>

fix comment typo "choosed" -> "chosen"

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# e3570639 27-Jul-2010 Eric Sandeen <sandeen@redhat.com>

ext4: don't print scary messages for allocation failures post-abort

I often get emails containing the "This should not happen!!" message,
conveniently trimmed to remove things like:

sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 03 13 c9 70 00 00 28 00
end_request: I/O error, dev sda, sector 51628400
Aborting journal on device dm-0-8.
EXT4-fs error (device dm-0): ext4_journal_start_sb: Detected aborted journal
EXT4-fs (dm-0): Remounting filesystem read-only

I don't think there is any value to the verbosity if the reason is
due to a filesystem abort; it just obfuscates the root cause.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 506bf2d8 27-Jul-2010 Eric Sandeen <sandeen@redhat.com>

ext4: allocate stripe-multiple IOs on stripe boundaries

For some reason, today mballoc only allocates IOs which are exactly
stripe-sized on a stripe boundary. If you have a multiple (say, a
128k IO on a 64k stripe) you may end up unaligned.

It seems to me that a simple change to align stripe-multiple IOs
on stripe boundaries would be a very good idea, unless this breaks
some other mballoc heuristic for some reason...

Reported-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 5c521830 27-Jul-2010 Jiaying Zhang <jiayingz@google.com>

ext4: Support discard requests when running in no-journal mode

Issue discard request in ext4_free_blocks() when ext4 has no journal and
is mounted with discard option.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# a271fe85 27-Jul-2010 Joe Perches <joe@perches.com>

ext4: Remove unnecessary casts of private_data

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# e5880d76 27-Jul-2010 Theodore Ts'o <tytso@mit.edu>

ext4: fix potential NULL dereference while tracing

The allocation_context pointer can be NULL.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# e29136f8 28-Jun-2010 Theodore Ts'o <tytso@mit.edu>

ext4: Enhance ext4_grp_locked_error() to take block and function numbers

Also use a macro definition so that __func__ and __LINE__ is implicit.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 5a0790c2 14-Jun-2010 Andi Kleen <andi@firstfloor.org>

ext4: remove initialized but not read variables

No real bugs found, just removed some dead code.

Found by gcc 4.6's new warnings.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# a0375156 11-Jun-2010 Theodore Ts'o <tytso@mit.edu>

ext4: Clean up s_dirt handling

We don't need to set s_dirt in most of the ext4 code when journaling
is enabled. In ext3/4 some of the summary statistics for # of free
inodes, blocks, and directories are calculated from the per-block
group statistics when the file system is mounted or unmounted. As a
result the superblock doesn't have to be updated, either via the
journal or by setting s_dirt. There are a few exceptions, most
notably when resizing the file system, where the superblock needs to
be modified --- and in that case it should be done as a journalled
operation if possible, and s_dirt set only in no-journal mode.

This patch will optimize out some unneeded disk writes when using ext4
with a journal.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 60e6679e 17-May-2010 Theodore Ts'o <tytso@mit.edu>

ext4: Drop whitespace at end of lines

This patch was generated using:

#!/usr/bin/perl -i
while (<>) {
s/[ ]+$//;
print;
}

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# f307333e 17-May-2010 Theodore Ts'o <tytso@mit.edu>

ext4: Add new tracepoints to track mballoc's buddy bitmap loads

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 12e9b892 16-May-2010 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: Use bitops to read/modify i_flags in struct ext4_inode_info

At several places we modify EXT4_I(inode)->i_flags without holding
i_mutex (ext4_do_update_inode, ...). These modifications are racy and
we can lose updates to i_flags. So convert handling of i_flags to use
bitops which are atomic.

https://bugzilla.kernel.org/show_bug.cgi?id=15792

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 291dae47 16-May-2010 Curt Wohlgemuth <curtw@google.com>

ext4: Fix for ext4_mb_collect_stats()

Fix ext4_mb_collect_stats() to use the correct test for s_bal_success; it
should be testing "best-extent.fe_len >= orig-extent.fe_len" , not
"orig-extent.fe_len >= goal-extent.fe_len" .

Signed-off-by: Curt Wohlgemuth <curtw@google.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 8a57d9d6 16-May-2010 Curt Wohlgemuth <curtw@google.com>

ext4: check for a good block group before loading buddy pages

This adds a new field in ext4_group_info to cache the largest available
block range in a block group; and don't load the buddy pages until *after*
we've done a sanity check on the block group.

With large allocation requests (e.g., fallocate(), 8MiB) and relatively full
partitions, it's easy to have no block groups with a block extent large
enough to satisfy the input request length. This currently causes the loop
during cr == 0 in ext4_mb_regular_allocator() to load the buddy bitmap pages
for EVERY block group. That can be a lot of pages. The patch below allows
us to call ext4_mb_good_group() BEFORE we load the buddy pages (although we
have check again after we lock the block group).

Addresses-Google-Bug: #2578108
Addresses-Google-Bug: #2704453

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# a30eec2a 16-May-2010 Eric Sandeen <sandeen@redhat.com>

ext4: stop issuing discards if not supported by device

Turn off issuance of discard requests if the device does
not support it - similar to the action we take for barriers.
This will save a little computation time if a non-discardable
device is mounted with -o discard, and also makes it obvious
that it's not doing what was asked at mount time ...

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# e39e07fd 13-May-2010 Jing Zhang <zj.barak@gmail.com>

ext4: rename ext4_mb_release_desc() to ext4_mb_unload_buddy()

This function cleans up after ext4_mb_load_buddy(), so the renaming
makes the code clearer.

Signed-off-by: Jing Zhang <zj.barak@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 62e823a2 12-May-2010 Jing Zhang <zj.barak@gmail.com>

ext4: Remove unnecessary call to ext4_get_group_desc() in mballoc

Signed-off-by: Jing Zhang <zj.barak@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# b90f6870 20-Apr-2010 Theodore Ts'o <tytso@mit.edu>

ext4: Issue the discard operation *before* releasing the blocks to be reused

Otherwise, we can end up having data corruption because the blocks
could get reused and then discarded!

https://bugzilla.kernel.org/show_bug.cgi?id=15579

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# 64e290ec 04-Mar-2010 Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>

ext4: fix up rb_root initializations to use RB_ROOT

ext4 uses rb_node = NULL; to zero rb_root at few places. Using
RB_ROOT as the initializer is more portable in case the underlying
implementation of rbtrees changes in the future.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Eric Paris <eparis@redhat.com>


# 5dd4056d 03-Mar-2010 Christoph Hellwig <hch@infradead.org>

dquot: cleanup space allocation / freeing routines

Get rid of the alloc_space, free_space, reserve_space, claim_space and
release_rsv dquot operations - they are always called from the filesystem
and if a filesystem really needs their own (which none currently does)
it can just call into it's own routine directly.

Move shared logic into the common __dquot_alloc_space,
dquot_claim_space_nodirty and __dquot_free_space low-level methods,
and rationalize the wrappers around it to move as much as possible
code into the common block for CONFIG_QUOTA vs not. Also rename
all these helpers to be named dquot_* instead of vfs_dq_*.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>


# bda00de7 03-Mar-2010 Akinobu Mita <akinobu.mita@gmail.com>

ext4: cleanup to use ext4_grp_offs_to_block()

More cleanup to convert open-coded calculations of the first block
number of a free extent to use ext4_grp_offs_to_block() instead.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger@sun.com>


# 5661bd68 03-Mar-2010 Akinobu Mita <akinobu.mita@gmail.com>

ext4: cleanup to use ext4_group_first_block_no()

This is a cleanup and simplification patch which takes some open-coded
calculations to calculate the first block number of a group and
converts them to use the (already defined) ext4_group_first_block_no()
function.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger@sun.com>


# cc483f10 01-Mar-2010 Tao Ma <tao.ma@oracle.com>

ext4: Fix fencepost error in chosing choosing group vs file preallocation.

The ext4 multiblock allocator decides whether to use group or file
preallocation based on the file size. When the file size reaches
s_mb_stream_request (default is 16 blocks), it changes to use a
file-specific preallocation. This is cool, but it has a tiny problem.

See a simple script:
mkfs.ext4 -b 1024 /dev/sda8 1000000
mount -t ext4 -o nodelalloc /dev/sda8 /mnt/ext4
for((i=0;i<5;i++))
do
cat /mnt/4096>>/mnt/ext4/a #4096 is a file with 4096 characters.
cat /mnt/4096>>/mnt/ext4/b
done
debuge4fs -R 'stat a' /dev/sda8|grep BLOCKS -A 1

And you get
BLOCKS:
(0-14):8705-8719, (15):2356, (16-19):8465-8468

So there are 3 extents, a bit strange for the lonely 15th logical
block. As we write to the 16 blocks, we choose file preallocation in
ext4_mb_group_or_file, but in ext4_mb_normalize_request, we meet with
the 16*1024 range, so no preallocation will be carried. file b then
reserves the space after '2356', so when when write 16, we start from
another part.

This patch just change the check in ext4_mb_group_or_file, so
that for the lonely 15 we will still use group preallocation.
After the patch, we will get:
debuge4fs -R 'stat a' /dev/sda8|grep BLOCKS -A 1
BLOCKS:
(0-15):8705-8720, (16-19):8465-8468

Looks more sane. Thanks.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 12062ddd 15-Feb-2010 Eric Sandeen <sandeen@redhat.com>

ext4: move __func__ into a macro for ext4_warning, ext4_error

Just a pet peeve of mine; we had a mishash of calls with either __func__
or "function_name" and the latter tends to get out of sync.

I think it's easier to just hide the __func__ in a macro, and it'll
be consistent from then on.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 1537a363 29-Jan-2010 Daniel Mack <daniel@caiaq.de>

tree-wide: fix 'lenght' typo in comments and code

Some misspelled occurences of 'octet' and some comments were also fixed
as I was on it.

Signed-off-by: Daniel Mack <daniel@caiaq.de>
Cc: Jiri Kosina <trivial@kernel.org>
Cc: Joe Perches <joe@perches.com>
Cc: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# d21cd8f1 09-Dec-2009 Dmitry Monakhov <dmonakhov@openvz.org>

ext4: Fix potential quota deadlock

We have to delay vfs_dq_claim_space() until allocation context destruction.
Currently we have following call-trace:
ext4_mb_new_blocks()
/* task is already holding ac->alloc_semp */
->ext4_mb_mark_diskspace_used
->vfs_dq_claim_space() /* acquire dqptr_sem here. Possible deadlock */
->ext4_mb_release_context() /* drop ac->alloc_semp here */

Let's move quota claiming to ext4_da_update_reserve_space()

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.32-rc7 #18
-------------------------------------------------------
write-truncate-/3465 is trying to acquire lock:
(&s->s_dquot.dqptr_sem){++++..}, at: [<c025e73b>] dquot_claim_space+0x3b/0x1b0

but task is already holding lock:
(&meta_group_info[i]->alloc_sem){++++..}, at: [<c02ce962>] ext4_mb_load_buddy+0xb2/0x370

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #3 (&meta_group_info[i]->alloc_sem){++++..}:
[<c017d04b>] __lock_acquire+0xd7b/0x1260
[<c017d5ea>] lock_acquire+0xba/0xd0
[<c0527191>] down_read+0x51/0x90
[<c02ce962>] ext4_mb_load_buddy+0xb2/0x370
[<c02d0c1c>] ext4_mb_free_blocks+0x46c/0x870
[<c029c9d3>] ext4_free_blocks+0x73/0x130
[<c02c8cfc>] ext4_ext_truncate+0x76c/0x8d0
[<c02a8087>] ext4_truncate+0x187/0x5e0
[<c01e0f7b>] vmtruncate+0x6b/0x70
[<c022ec02>] inode_setattr+0x62/0x190
[<c02a2d7a>] ext4_setattr+0x25a/0x370
[<c022ee81>] notify_change+0x151/0x340
[<c021349d>] do_truncate+0x6d/0xa0
[<c0221034>] may_open+0x1d4/0x200
[<c022412b>] do_filp_open+0x1eb/0x910
[<c021244d>] do_sys_open+0x6d/0x140
[<c021258e>] sys_open+0x2e/0x40
[<c0103100>] sysenter_do_call+0x12/0x32

-> #2 (&ei->i_data_sem){++++..}:
[<c017d04b>] __lock_acquire+0xd7b/0x1260
[<c017d5ea>] lock_acquire+0xba/0xd0
[<c0527191>] down_read+0x51/0x90
[<c02a5787>] ext4_get_blocks+0x47/0x450
[<c02a74c1>] ext4_getblk+0x61/0x1d0
[<c02a7a7f>] ext4_bread+0x1f/0xa0
[<c02bcddc>] ext4_quota_write+0x12c/0x310
[<c0262d23>] qtree_write_dquot+0x93/0x120
[<c0261708>] v2_write_dquot+0x28/0x30
[<c025d3fb>] dquot_commit+0xab/0xf0
[<c02be977>] ext4_write_dquot+0x77/0x90
[<c02be9bf>] ext4_mark_dquot_dirty+0x2f/0x50
[<c025e321>] dquot_alloc_inode+0x101/0x180
[<c029fec2>] ext4_new_inode+0x602/0xf00
[<c02ad789>] ext4_create+0x89/0x150
[<c0221ff2>] vfs_create+0xa2/0xc0
[<c02246e7>] do_filp_open+0x7a7/0x910
[<c021244d>] do_sys_open+0x6d/0x140
[<c021258e>] sys_open+0x2e/0x40
[<c0103100>] sysenter_do_call+0x12/0x32

-> #1 (&sb->s_type->i_mutex_key#7/4){+.+...}:
[<c017d04b>] __lock_acquire+0xd7b/0x1260
[<c017d5ea>] lock_acquire+0xba/0xd0
[<c0526505>] mutex_lock_nested+0x65/0x2d0
[<c0260c9d>] vfs_load_quota_inode+0x4bd/0x5a0
[<c02610af>] vfs_quota_on_path+0x5f/0x70
[<c02bc812>] ext4_quota_on+0x112/0x190
[<c026345a>] sys_quotactl+0x44a/0x8a0
[<c0103100>] sysenter_do_call+0x12/0x32

-> #0 (&s->s_dquot.dqptr_sem){++++..}:
[<c017d361>] __lock_acquire+0x1091/0x1260
[<c017d5ea>] lock_acquire+0xba/0xd0
[<c0527191>] down_read+0x51/0x90
[<c025e73b>] dquot_claim_space+0x3b/0x1b0
[<c02cb95f>] ext4_mb_mark_diskspace_used+0x36f/0x380
[<c02d210a>] ext4_mb_new_blocks+0x34a/0x530
[<c02c83fb>] ext4_ext_get_blocks+0x122b/0x13c0
[<c02a5966>] ext4_get_blocks+0x226/0x450
[<c02a5ff3>] mpage_da_map_blocks+0xc3/0xaa0
[<c02a6ed6>] ext4_da_writepages+0x506/0x790
[<c01de272>] do_writepages+0x22/0x50
[<c01d766d>] __filemap_fdatawrite_range+0x6d/0x80
[<c01d7b9b>] filemap_flush+0x2b/0x30
[<c02a40ac>] ext4_alloc_da_blocks+0x5c/0x60
[<c029e595>] ext4_release_file+0x75/0xb0
[<c0216b59>] __fput+0xf9/0x210
[<c0216c97>] fput+0x27/0x30
[<c02122dc>] filp_close+0x4c/0x80
[<c014510e>] put_files_struct+0x6e/0xd0
[<c01451b7>] exit_files+0x47/0x60
[<c0146a24>] do_exit+0x144/0x710
[<c0147028>] do_group_exit+0x38/0xa0
[<c0159abc>] get_signal_to_deliver+0x2ac/0x410
[<c0102849>] do_notify_resume+0xb9/0x890
[<c01032d2>] work_notifysig+0x13/0x21

other info that might help us debug this:

3 locks held by write-truncate-/3465:
#0: (jbd2_handle){+.+...}, at: [<c02e1f8f>] start_this_handle+0x38f/0x5c0
#1: (&ei->i_data_sem){++++..}, at: [<c02a57f6>] ext4_get_blocks+0xb6/0x450
#2: (&meta_group_info[i]->alloc_sem){++++..}, at: [<c02ce962>] ext4_mb_load_buddy+0xb2/0x370

stack backtrace:
Pid: 3465, comm: write-truncate- Not tainted 2.6.32-rc7 #18
Call Trace:
[<c0524cb3>] ? printk+0x1d/0x22
[<c017ac9a>] print_circular_bug+0xca/0xd0
[<c017d361>] __lock_acquire+0x1091/0x1260
[<c016bca2>] ? sched_clock_local+0xd2/0x170
[<c0178fd0>] ? trace_hardirqs_off_caller+0x20/0xd0
[<c017d5ea>] lock_acquire+0xba/0xd0
[<c025e73b>] ? dquot_claim_space+0x3b/0x1b0
[<c0527191>] down_read+0x51/0x90
[<c025e73b>] ? dquot_claim_space+0x3b/0x1b0
[<c025e73b>] dquot_claim_space+0x3b/0x1b0
[<c02cb95f>] ext4_mb_mark_diskspace_used+0x36f/0x380
[<c02d210a>] ext4_mb_new_blocks+0x34a/0x530
[<c02c601d>] ? ext4_ext_find_extent+0x25d/0x280
[<c02c83fb>] ext4_ext_get_blocks+0x122b/0x13c0
[<c016bca2>] ? sched_clock_local+0xd2/0x170
[<c016be60>] ? sched_clock_cpu+0x120/0x160
[<c016beef>] ? cpu_clock+0x4f/0x60
[<c0178fd0>] ? trace_hardirqs_off_caller+0x20/0xd0
[<c052712c>] ? down_write+0x8c/0xa0
[<c02a5966>] ext4_get_blocks+0x226/0x450
[<c016be60>] ? sched_clock_cpu+0x120/0x160
[<c016beef>] ? cpu_clock+0x4f/0x60
[<c017908b>] ? trace_hardirqs_off+0xb/0x10
[<c02a5ff3>] mpage_da_map_blocks+0xc3/0xaa0
[<c01d69cc>] ? find_get_pages_tag+0x16c/0x180
[<c01d6860>] ? find_get_pages_tag+0x0/0x180
[<c02a73bd>] ? __mpage_da_writepage+0x16d/0x1a0
[<c01dfc4e>] ? pagevec_lookup_tag+0x2e/0x40
[<c01ddf1b>] ? write_cache_pages+0xdb/0x3d0
[<c02a7250>] ? __mpage_da_writepage+0x0/0x1a0
[<c02a6ed6>] ext4_da_writepages+0x506/0x790
[<c016beef>] ? cpu_clock+0x4f/0x60
[<c016bca2>] ? sched_clock_local+0xd2/0x170
[<c016be60>] ? sched_clock_cpu+0x120/0x160
[<c016be60>] ? sched_clock_cpu+0x120/0x160
[<c02a69d0>] ? ext4_da_writepages+0x0/0x790
[<c01de272>] do_writepages+0x22/0x50
[<c01d766d>] __filemap_fdatawrite_range+0x6d/0x80
[<c01d7b9b>] filemap_flush+0x2b/0x30
[<c02a40ac>] ext4_alloc_da_blocks+0x5c/0x60
[<c029e595>] ext4_release_file+0x75/0xb0
[<c0216b59>] __fput+0xf9/0x210
[<c0216c97>] fput+0x27/0x30
[<c02122dc>] filp_close+0x4c/0x80
[<c014510e>] put_files_struct+0x6e/0xd0
[<c01451b7>] exit_files+0x47/0x60
[<c0146a24>] do_exit+0x144/0x710
[<c017b163>] ? lock_release_holdtime+0x33/0x210
[<c0528137>] ? _spin_unlock_irq+0x27/0x30
[<c0147028>] do_group_exit+0x38/0xa0
[<c017babb>] ? trace_hardirqs_on+0xb/0x10
[<c0159abc>] get_signal_to_deliver+0x2ac/0x410
[<c0102849>] do_notify_resume+0xb9/0x890
[<c0178fd0>] ? trace_hardirqs_off_caller+0x20/0xd0
[<c017b163>] ? lock_release_holdtime+0x33/0x210
[<c0165b50>] ? autoremove_wake_function+0x0/0x50
[<c017ba54>] ? trace_hardirqs_on_caller+0x134/0x190
[<c017babb>] ? trace_hardirqs_on+0xb/0x10
[<c0300ba4>] ? security_file_permission+0x14/0x20
[<c0215761>] ? vfs_write+0x131/0x190
[<c0214f50>] ? do_sync_write+0x0/0x120
[<c0103115>] ? sysenter_do_call+0x27/0x32
[<c01032d2>] work_notifysig+0x13/0x21

CC: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>


# 1f2acb60 22-Jan-2010 Theodore Ts'o <tytso@mit.edu>

ext4: Add block validity check when truncating indirect block mapped inodes

Add checks to ext4_free_branches() to make sure a block number found
in an indirect block are valid before trying to free it. If a bad
block number is found, stop freeing the indirect block immediately,
since the file system is corrupt and we will need to run fsck anyway.
This also avoids spamming the logs, and specifically avoids
driver-level "attempt to access beyond end of device" errors obscure
what is really going on.

If you get *really*, *really*, *really* unlucky, without this patch, a
supposed indirect block containing garbage might contain a reference
to a primary block group descriptor, in which case
ext4_free_branches() could end up zero'ing out a block group
descriptor block, and if then one of the block bitmaps for a block
group described by that bg descriptor block is not in memory, and is
read in by ext4_read_block_bitmap(). This function calls
ext4_valid_block_bitmap(), which assumes that bg_inode_table() was
validated at mount time and hasn't been modified since. Since this
assumption is no longer valid, it's possible for the value
(ext4_inode_table(sb, desc) - group_first_block) to go negative, which
will cause ext4_find_next_zero_bit() to trigger a kernel GPF.

Addresses-Google-Bug: #2220436

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# b844167e 08-Dec-2009 Curt Wohlgemuth <curtw@google.com>

ext4: remove blocks from inode prealloc list on failure

This fixes a leak of blocks in an inode prealloc list if device failures
cause ext4_mb_mark_diskspace_used() to fail.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Acked-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# af901ca1 14-Nov-2009 André Goddard Rosa <andre.goddard@gmail.com>

tree-wide: fix assorted typos all over the place

That is "success", "unknown", "through", "performance", "[re|un]mapping"
, "access", "default", "reasonable", "[con]currently", "temperature"
, "channel", "[un]used", "application", "example","hierarchy", "therefore"
, "[over|under]flow", "contiguous", "threshold", "enough" and others.

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# 9084d471 22-Nov-2009 Theodore Ts'o <tytso@mit.edu>

ext4: use ext4_data_block_valid() in ext4_free_blocks()

The block validity framework does a more comprehensive set of checks,
and it saves object code space to use the ext4_data_block_valid() than
the limited open-coded version that had been in ext4_free_blocks().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# e6362609 23-Nov-2009 Theodore Ts'o <tytso@mit.edu>

ext4: call ext4_forget() from ext4_free_blocks()

Add the facility for ext4_forget() to be called from
ext4_free_blocks(). This simplifies the code in a large number of
places, and centralizes most of the work of calling ext4_forget() into
a single place.

Also fix a bug in the extents migration code; it wasn't calling
ext4_forget() when releasing the indirect blocks during the
conversion. As a result, if the system cashed during or shortly after
the extents migration, and the released indirect blocks get reused as
data blocks, the journal replay would corrupt the data blocks. With
this new patch, fixing this bug was as simple as adding the
EXT4_FREE_BLOCKS_FORGET flags to the call to ext4_free_blocks().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>


# 44338711 22-Nov-2009 Theodore Ts'o <tytso@mit.edu>

ext4: fold ext4_free_blocks() and ext4_mb_free_blocks()

ext4_mb_free_blocks() is only called by ext4_free_blocks(), and the
latter function doesn't really do much. So merge the two functions
together, such that ext4_free_blocks() is now found in
fs/ext4/mballoc.c. This saves about 200 bytes of compiled text space.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 5328e635 19-Nov-2009 Eric Sandeen <sandeen@redhat.com>

ext4: make trim/discard optional (and off by default)

It is anticipated that when sb_issue_discard starts doing
real work on trim-capable devices, we may see issues. Make
this mount-time optional, and default it to off until we know
that things are working out OK.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# ca0c9584 03-Oct-2009 Christoph Lameter <cl@linux-foundation.org>

this_cpu: Straight transformations

Use this_cpu_ptr and __this_cpu_ptr in locations where straight
transformations are possible because per_cpu_ptr is used with
either smp_processor_id() or raw_smp_processor_id().

cc: David Howells <dhowells@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
cc: Ingo Molnar <mingo@elte.hu>
cc: Rusty Russell <rusty@rustcorp.com.au>
cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>


# 296c355c 29-Sep-2009 Theodore Ts'o <tytso@mit.edu>

ext4: Use tracepoints for mb_history trace file

The /proc/fs/ext4/<dev>/mb_history was maintained manually, and had a
number of problems: it required a largish amount of memory to be
allocated for each ext4 filesystem, and the s_mb_history_lock
introduced a CPU contention problem.

By ripping out the mb_history code and replacing it with ftrace
tracepoints, and we get more functionality: timestamps, event
filtering, the ability to correlate mballoc history with other ext4
tracepoints, etc.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 90576c0b 29-Sep-2009 Theodore Ts'o <tytso@mit.edu>

ext4, jbd2: Drop unneeded printks at mount and unmount time

There are a number of kernel printk's which are printed when an ext4
filesystem is mounted and unmounted. Disable them to economize space
in the system logs. In addition, disabling the mballoc stats by
default saves a number of unneeded atomic operations for every block
allocation or deallocation.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 71780577 27-Sep-2009 Theodore Ts'o <tytso@mit.edu>

ext4: Fix hueristic which avoids group preallocation for closed files

The hueristic was designed to avoid using locality group preallocation
when writing the last segment of a closed file. Fix it by move
setting size to the maximum of size and isize until after we check
whether size == isize.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# fb0a387d 16-Sep-2009 Eric Sandeen <sandeen@redhat.com>

ext4: limit block allocations for indirect-block files to < 2^32

Today, the ext4 allocator will happily allocate blocks past
2^32 for indirect-block files, which results in the block
numbers getting truncated, and corruption ensues.

This patch limits such allocations to < 2^32, and adds
BUG_ONs if we do get blocks larger than that.

This should address RH Bug 519471, ext4 bitmap allocator
must limit blocks to < 2^32

* ext4_find_goal() is modified to choose a goal < UINT_MAX,
so that our starting point is in an acceptable range.

* ext4_xattr_block_set() is modified such that the goal block
is < UINT_MAX, as above.

* ext4_mb_regular_allocator() is modified so that the group
search does not continue into groups which are too high

* ext4_mb_use_preallocated() has a check that we don't use
preallocated space which is too far out

* ext4_alloc_blocks() and ext4_xattr_block_set() add some BUG_ONs

No attempt has been made to limit inode locations to < 2^32,
so we may wind up with blocks far from their inodes. Doing
this much already will lead to some odd ENOSPC issues when the
"lower 32" gets full, and further restricting inodes could
make that even weirder.

For high inodes, choosing a goal of the original, % UINT_MAX,
may be a bit odd, but then we're in an odd situation anyway,
and I don't know of a better heuristic.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 08c3a813 09-Sep-2009 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Clarify the locking details in mballoc

We don't need to take the alloc_sem lock when we are adding new
groups, since mballoc won't see the new group added until we bump
sbi->s_groups_count.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>


# f41c0750 09-Sep-2009 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: check for need init flag in ext4_mb_load_buddy

We should check for need init flag with the group's alloc_sem held, to
make sure while we are loading the buddy cache and holding a reference
to it, a file system resize can't add new blocks to same group.

The patch also drops the need init flag check in
ext4_mb_regular_allocator() because doing the check without holding
alloc_sem is racy.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>


# b6a758ec 09-Sep-2009 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: move ext4_mb_init_group() function earlier in the mballoc.c

This moves the function around so that it can be called from
ext4_mb_load_buddy().

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 7f1346a9 05-Sep-2009 Tobias Klauser <tklauser@distanz.ch>

ext4: Declare seq_operations and file_operations structures as const

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# a36b4498 25-Aug-2009 Eric Sandeen <sandeen@redhat.com>

ext4: use ext4_grpblk_t more extensively

unsigned short is potentially too small to track blocks within
a group; today it is safe due to restrictions in e2fsprogs but
we have _lo / _hi bits for group blocks with the intent to go
up to 32 bits, so clean this up now.

There are many more places where we use unsigned/int/unsigned int
to contain a group block but this should at least fix all the
short types.

I added a few comments to the struct ext4_group_info definition
as well.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 1927805e 25-Aug-2009 Eric Sandeen <sandeen@redhat.com>

ext4: use variables not types in sizeofs() for allocations

Precursor to changing some types; to keep things in sync, it
seems better to allocate/memset based on the size of the
variables we are using rather than on some disconnected
basic type like "unsigned short"

Signed-off-by: Eric Sandeen <sandeen@redhat.com>


# 38877f4e 17-Aug-2009 Eric Sandeen <sandeen@redhat.com>

simplify some logic in ext4_mb_normalize_request

While reading through some of the mballoc code it seems that a couple
spots in the size normalization function could be streamlined.

The test for non-overlapping PAs can be or'd for the start & end
conditions, and the tests for adjacent PAs can be else-if'd -
it's essentially independently testing:

if (A + B <= C)
...
if (A > C)
...

These cannot both be true so it seems like the else-if might
be slightly more efficient and/or informative.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 0373130d 17-Aug-2009 Eric Sandeen <sandeen@redhat.com>

ext4: open-code ext4_mb_update_group_info

ext4_mb_update_group_info is only called in one place, and it's
extremely simple. There's no reason to have it in a separate function
in a separate file as far as I can tell, it just obfuscates what's
really going on.

Perhaps it was intended to keep the grp->bb_* manipulation local to
mballoc.c but we're already accessing other grp-> fields in balloc.c
directly so this seems ok.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 50797481 18-Sep-2009 Theodore Ts'o <tytso@mit.edu>

ext4: Avoid group preallocation for closed files

Currently the group preallocation code tries to find a large (512)
free block from which to do per-cpu group allocation for small files.
The problem with this scheme is that it leaves the filesystem horribly
fragmented. In the worst case, if the filesystem is unmounted and
remounted (after a system shutdown, for example) we forget the fact
that wee were using a particular (now-partially filled) 512 block
extent. So the next time we try to allocate space for a small file,
we will find *another* completely free 512 block chunk to allocate
small files. Given that there are 32,768 blocks in a block group,
after 64 iterations of "mount, write one 4k file in a directory,
unmount", the block group will have 64 files, each separated by 511
blocks, and the block group will no longer have any free 512
completely free chunks of blocks for group preallocation space.

So if we try to allocate blocks for a file that has been closed, such
that we know the final size of the file, and the filesystem is not
busy, avoid using group preallocation.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 4ba74d00 09-Aug-2009 Theodore Ts'o <tytso@mit.edu>

ext4: Fix bugs in mballoc's stream allocation mode

The logic around sbi->s_mb_last_group and sbi->s_mb_last_start was all
screwed up. These fields were getting unconditionally all the time,
set even when stream allocation had not taken place, and if they were
being used when the file was smaller than s_mb_stream_request, which
is when the allocation should _not_ be doing stream allocation.

Fix this by determining whether or not we stream allocation should
take place once, in ext4_mb_group_or_file(), and setting a flag which
gets used in ext4_mb_regular_allocator() and ext4_mb_use_best_found().
This simplifies the code and assures that we are consistently using
(or not using) the stream allocation logic.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 0ef90db9 09-Aug-2009 Theodore Ts'o <tytso@mit.edu>

ext4: Display the mballoc flags in mb_history in hex instead of decimal

Displaying the flags in base 16 makes it easier to see which flags
have been set.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 6ba495e9 18-Sep-2009 Theodore Ts'o <tytso@mit.edu>

ext4: Add configurable run-time mballoc debugging

Allow mballoc debugging to be enabled at run-time instead of just at
compile time.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 833576b3 13-Jul-2009 Theodore Ts'o <tytso@mit.edu>

ext4: Fix ext4_mb_initialize_context() to initialize all fields

Pavel Roskin pointed out that kmemcheck indicated that
ext4_mb_store_history() was accessing uninitialized values of
ac->ac_tail and ac->ac_buddy leading to garbage in the mballoc
history. Fix this by initializing the entire structure to all zeros
first.

Also, two fields were getting doubly initialized by the caller of
ext4_mb_initialize_context, so remove them for efficiency's sake.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 1c718505 05-Jul-2009 Akira Fujita <a-fujita@rs.jp.nec.com>

ext4: Fix compile warnings with MB_DEBUG

When MB_DEBUG is enabled, we get some compile warnings because
ext4_group_t is unsigned int. This patch fixes them.

Signed-off-by Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 5a4a7989 05-Jul-2009 Joe Perches <joe@perches.com>

ext4: Remove unnecessary semicolons in mballoc.c

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 024eab4d 17-Jul-2009 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Fix memory leak fix when mounting an ext4 filesystem

The allocation of the ext4_group_info array was moved to a new
function ext4_mb_add_group_info() in commit 5f21b0e6 so that online
resize would use a common (and correct) codepath. Unfortunately, the
call to the new ext4_mb_add_group_info() function was added without
removing the code which originally allocated the array. This caused a
memory leak each time an ext4 filesystem was mounted.

The fix is simple; remove the code that did the original allocation,
since it is no longer needed.

Reported-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 0610b6e9 15-Jun-2009 Theodore Ts'o <tytso@mit.edu>

ext4: Fix 64-bit block type problem on 32-bit platforms

The function ext4_mb_free_blocks() was using an "unsigned long" to
pass a block number; this will cause 64-bit block numbers to get
truncated on x86 and other 32-bit platforms.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>


# 3e03f9ca 05-Jul-2009 Jesper Dangaard Brouer <hawk@comx.dk>

ext4: Use rcu_barrier() on module unload.

The ext4 module uses rcu_call() thus it should use rcu_barrier()on
module unload.

The kmem cache ext4_pspace_cachep is sometimes free'ed using
call_rcu() callbacks. Thus, we must wait for completion of call_rcu()
before doing kmem_cache_destroy().

Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 089ceecc 05-Jul-2009 Eric Sandeen <sandeen@redhat.com>

ext4: mark several more functions in mballoc.c as noinline

Ted noticed a stack-deep callchain through
writepages->ext4_mb_regular_allocator->ext4_mb_init_cache->submit_bh ...

With all the static functions in mballoc.c, gcc helpfully
inlines for us, and we get something like this:

ext4_mb_regular_allocator (232 bytes stack)
ext4_mb_init_cache (232 bytes stack)
submit_bh (starts 464 deeper)

the 2 ext4 functions here get several others inlined; by telling
gcc not to inline them, we can save stack space for when we
head off into submit_bh land and associated block layer callchains.
The following noinlined functions are only called once, so this
won't impact any other callchains:

ext4_mb_regular_allocator (104) (was 232)
ext4_mb_find_by_goal (56) (noinlined)
ext4_mb_init_group (24) (noinlined)
ext4_mb_init_cache (136) (was 232)
ext4_mb_generate_buddy (88) (noinlined)
ext4_mb_generate_from_pa (40) (noinlined)
submit_bh
ext4_mb_simple_scan_group (24) (noinlined)
ext4_mb_scan_aligned (56) (noinlined)
ext4_mb_complex_scan_group (40) (noinlined)
ext4_mb_try_best_found (24) (noinlined)

now when we head off into submit_bh() we're only 264 bytes deeper
in stack than when we entered ext4_mb_regular_allocator()
(vs. 464 bytes before). Every 200 bytes helps. :)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 6fd058f7 17-May-2009 Theodore Ts'o <tytso@mit.edu>

ext4: Add a comprehensive block validity check to ext4_get_blocks()

To catch filesystem bugs or corruption which could lead to the
filesystem getting severly damaged, this patch adds a facility for
tracking all of the filesystem metadata blocks by contiguous regions
in a red-black tree. This allows quick searching of the tree to
locate extents which might overlap with filesystem metadata blocks.

This facility is also used by the multi-block allocator to assure that
it is not allocating blocks out of the system zone, as well as by the
routines used when reading indirect blocks and extents information
from disk to make sure their contents are valid.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# bc8e6740 15-May-2009 Vincent Minet <vincent@vincent-minet.net>

ext4: Fix spinlock assertions on UP systems

On UP systems without DEBUG_SPINLOCK, ext4_is_group_locked always fails
which triggers a BUG_ON() call.
This patch fixes it by using assert_spin_locked instead.

Signed-off-by: Vincent Minet <vincent@vincent-minet.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 955ce5f5 02-May-2009 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Convert ext4_lock_group to use sb_bgl_lock

We have sb_bgl_lock() and ext4_group_info.bb_state
bit spinlock to protech group information. The later is only
used within mballoc code. Consolidate them to use sb_bgl_lock().
This makes the mballoc.c code much simpler and also avoid
confusion with two locks protecting same info.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# f4033903 01-May-2009 Curt Wohlgemuth <curtw@google.com>

ext4: Make the length of the mb_history file tunable

In memory-constrained systems with many partitions, the ~68K for each
partition for the mb_history buffer can be excessive.

This patch adds a new mount option, mb_history_length, as well as a
way of setting the default via a module parameter (or via a sysfs
parameter in /sys/module/ext4/parameter/default_mb_history_length).
If the mb_history_length is set to zero, the mb_history facility is
disabled entirely.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 75507efb 30-Apr-2009 Theodore Ts'o <tytso@mit.edu>

ext4: Don't avoid using BLOCK_UNINIT block groups in mballoc

By avoiding the use of not-yet-used block groups (i.e., block groups
with the BLOCK_UNINIT flag), mballoc had a tendency to create large
files with large non-contiguous gaps. In addition avoiding the use of
new block groups had a tendency to push regular file data into the
first block group in a flex_bg group, which slows down the speed of
e2fsck pass 2, since it has a tendency to seek much more. For
example:

Before Patch After Patch
Time in seconds Time in seconds
Real / User/ Sys MB/s Real / User/ Sys MB/s
Pass 1 8.52 / 2.21 / 0.46 20.43 8.84 / 4.97 / 1.11 19.68
Pass 2 21.16 / 1.02 / 1.86 11.30 6.54 / 1.77 / 1.78 36.39
Pass 3 0.01 / 0.00 / 0.00 139.00 0.01 / 0.01 / 0.00 128.90
Pass 4 0.16 / 0.15 / 0.00 0.00 0.17 / 0.17 / 0.00 0.00
Pass 5 2.52 / 1.99 / 0.09 0.79 2.31 / 1.78 / 0.06 0.86
Total 32.40 / 5.11 / 2.49 12.81 17.99 / 8.75 / 2.98 23.01

This was on a sample 80 gig root filesystem which was approximately
50% full. Note the improved e2fsck pass 2 performance, by over a
factor of 3, due to a decreased number of seeks. (The total amount of
I/O in pass 2 was unchanged; the layout of the directory blocks was
simply much better from e2fsck's's perspective.)

Other changes as a result of this patch on this sample filesystem:

Before Patch After Patch
# of non-contig files 762 779
# of non-contig directories 571 570
# of BLOCK_UNINIT bg's 307 293
# of INODE_UNINIT bg's 503 503

Out of 640 block groups, of which 333 were in use, this patch caused
an extra 14 block groups to be utilized. The number of non-contiguous
files did go up slightly, but when measured against the 99.9% of the
files (603,154) which were contiguously allocated, this is pretty
insignificant.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andreas Dilger <adilger@sun.com>


# 9bffad1e 17-Jun-2009 Theodore Ts'o <tytso@mit.edu>

ext4: convert instrumentation from markers to tracepoints

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 8df9675f 01-May-2009 Theodore Ts'o <tytso@mit.edu>

ext4: Avoid races caused by on-line resizing and SMP memory reordering

Ext4's on-line resizing adds a new block group and then, only at the
last step adjusts s_groups_count. However, it's possible on SMP
systems that another CPU could see the updated the s_group_count and
not see the newly initialized data structures for the just-added block
group. For this reason, it's important to insert a SMP read barrier
after reading s_groups_count and before reading any (for example) the
new block group descriptors allowed by the increased value of
s_groups_count.

Unfortunately, we rather blatently violate this locking protocol
documented in fs/ext4/resize.c. Fortunately, (1) on-line resizes
happen relatively rarely, and (2) it seems rare that the filesystem
code will immediately try to use just-added block group before any
memory ordering issues resolve themselves. So apparently problems
here are relatively hard to hit, since ext3 has been vulnerable to the
same issue for years with no one apparently complaining.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# e7c9e3e9 27-Mar-2009 Theodore Ts'o <tytso@mit.edu>

ext4: fix locking typo in mballoc which could cause soft lockup hangs

Smatch (http://repo.or.cz/w/smatch.git/) complains about the locking in
ext4_mb_add_n_trim() from fs/ext4/mballoc.c

4438 list_for_each_entry_rcu(tmp_pa, &lg->lg_prealloc_list[order],
4439 pa_inode_list) {
4440 spin_lock(&tmp_pa->pa_lock);
4441 if (tmp_pa->pa_deleted) {
4442 spin_unlock(&pa->pa_lock);
4443 continue;
4444 }

Brown paper bag time...

Reported-by: Dan Carpenter <error27@gmail.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org


# a7b19448 27-Mar-2009 Dan Carpenter <error27@gmail.com>

ext4: fix typo which causes a memory leak on error path

This was found by smatch (http://repo.or.cz/w/smatch.git/)

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org


# cc0fb9ad 27-Mar-2009 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Rename pa_linear to pa_type

Impact: code cleanup

This patch rename pa_linear to pa_type and add MB_INODE_PA
and MB_GROUP_PA to indicate inode and group prealloc space.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# a269eb18 26-Jan-2009 Jan Kara <jack@suse.cz>

ext4: Use lowercase names of quota functions

Use lowercase names of quota functions instead of old uppercase ones.

Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mingming Cao <cmm@us.ibm.com>
CC: linux-ext4@vger.kernel.org


# 60e58e0f 22-Jan-2009 Mingming Cao <cmm@us.ibm.com>

ext4: quota reservation for delayed allocation

Uses quota reservation/claim/release to handle quota properly for delayed
allocation in the three steps: 1) quotas are reserved when data being copied
to cache when block allocation is defered 2) when new blocks are allocated.
reserved quotas are converted to the real allocated quota, 2) over-booked
quotas for metadata blocks are released back.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>


# d33a1976 16-Mar-2009 Eric Sandeen <sandeen@redhat.com>

ext4: fix bb_prealloc_list corruption due to wrong group locking

This is for Red Hat bug 490026: EXT4 panic, list corruption in
ext4_mb_new_inode_pa

ext4_lock_group(sb, group) is supposed to protect this list for
each group, and a common code flow to remove an album is like
this:

ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL);
ext4_lock_group(sb, grp);
list_del(&pa->pa_group_list);
ext4_unlock_group(sb, grp);

so it's critical that we get the right group number back for
this prealloc context, to lock the right group (the one
associated with this pa) and prevent concurrent list manipulation.

however, ext4_mb_put_pa() passes in (pa->pa_pstart - 1) with a
comment, "-1 is to protect from crossing allocation group".

This makes sense for the group_pa, where pa_pstart is advanced
by the length which has been used (in ext4_mb_release_context()),
and when the entire length has been used, pa_pstart has been
advanced to the first block of the next group.

However, for inode_pa, pa_pstart is never advanced; it's just
set once to the first block in the group and not moved after
that. So in this case, if we subtract one in ext4_mb_put_pa(),
we are actually locking the *previous* group, and opening the
race with the other threads which do not subtract off the extra
block.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 8d03c7a0 14-Mar-2009 Eric Sandeen <sandeen@redhat.com>

ext4: fix bogus BUG_ONs in in mballoc code

Thiemo Nagel reported that:

# dd if=/dev/zero of=image.ext4 bs=1M count=2
# mkfs.ext4 -v -F -b 1024 -m 0 -g 512 -G 4 -I 128 -N 1 \
-O large_file,dir_index,flex_bg,extent,sparse_super image.ext4
# mount -o loop image.ext4 mnt/
# dd if=/dev/zero of=mnt/file

oopsed, with a BUG_ON in ext4_mb_normalize_request because
size == EXT4_BLOCKS_PER_GROUP

It appears to me (esp. after talking to Andreas) that the BUG_ON
is bogus; a request of exactly EXT4_BLOCKS_PER_GROUP should
be allowed, though larger sizes do indicate a problem.

Fix that an another (apparently rare) codepath with a similar check.

Reported-by: Thiemo Nagel <thiemo.nagel@ph.tum.de>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 9f24e420 04-Mar-2009 Theodore Ts'o <tytso@mit.edu>

ext4: Use atomic_t's in struct flex_groups

Reduce pressure on the sb_bgl_lock family of locks by using atomic_t's
to track the number of free blocks and inodes in each flex_group.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# b713a5ec 31-Mar-2009 Theodore Ts'o <tytso@mit.edu>

ext4: remove /proc tuning knobs

Remove tuning knobs in /proc/fs/ext4/<dev/* since they have been
replaced by knobs in sysfs at /sys/fs/ext4/<dev>/*.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# a4912123 11-Mar-2009 Theodore Ts'o <tytso@mit.edu>

ext4: New inode/block allocation algorithms for flex_bg filesystems

The find_group_flex() inode allocator is now only used if the
filesystem is mounted using the "oldalloc" mount option. It is
replaced with the original Orlov allocator that has been updated for
flex_bg filesystems (it should behave the same way if flex_bg is
disabled). The inode allocator now functions by taking into account
each flex_bg group, instead of each block group, when deciding whether
or not it's time to allocate a new directory into a fresh flex_bg.

The block allocator has also been changed so that the first block
group in each flex_bg is preferred for use for storing directory
blocks. This keeps directory blocks close together, which is good for
speeding up e2fsck since large directories are more likely to look
like this:

debugfs: stat /home/tytso/Maildir/cur
Inode: 1844562 Type: directory Mode: 0700 Flags: 0x81000
Generation: 1132745781 Version: 0x00000000:0000ad71
User: 15806 Group: 15806 Size: 1060864
File ACL: 0 Directory ACL: 0
Links: 2 Blockcount: 2072
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x499c0ff4:164961f4 -- Wed Feb 18 08:41:08 2009
atime: 0x499c0ff4:00000000 -- Wed Feb 18 08:41:08 2009
mtime: 0x49957f51:00000000 -- Fri Feb 13 09:10:25 2009
crtime: 0x499c0f57:00d51440 -- Wed Feb 18 08:38:31 2009
Size of extra inode fields: 28
BLOCKS:
(0):7348651, (1-258):7348654-7348911
TOTAL: 259

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# d794bf8e 14-Feb-2009 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Initialize preallocation list_head's properly

When creating a new ext4_prealloc_space structure, we have to
initialize its list_head pointers before we add them to any prealloc
lists. Otherwise, with list debug enabled, we will get list
corruption warnings.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# ba443916 10-Feb-2009 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Fix lockdep warning

We should not call ext4_mb_add_n_trim while holding alloc_semp.

=============================================
[ INFO: possible recursive locking detected ]
2.6.29-rc4-git1-dirty #124
---------------------------------------------
ffsb/3116 is trying to acquire lock:
(&meta_group_info[i]->alloc_sem){----}, at: [<ffffffff8035a6e8>]
ext4_mb_load_buddy+0xd2/0x343

but task is already holding lock:
(&meta_group_info[i]->alloc_sem){----}, at: [<ffffffff8035a6e8>]
ext4_mb_load_buddy+0xd2/0x343

http://bugzilla.kernel.org/show_bug.cgi?id=12672

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 9fd9784c 26-Jan-2009 Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>

ext4: Fix building with EXT4FS_DEBUG

When bg_free_blocks_count was renamed to bg_free_blocks_count_lo in
560671a0, its uses under EXT4FS_DEBUG were not changed to the helper
ext4_free_blks_count.

Another commit, 498e5f24, also did not change everything needed under
EXT4FS_DEBUG, thus making it spill some warnings related to printing
format.

This commit fixes both issues and makes ext4 build again when
EXT4FS_DEBUG is enabled.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# ba80b101 03-Jan-2009 Theodore Ts'o <tytso@mit.edu>

ext4: Add markers for better debuggability

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 0087d9fb 05-Jan-2009 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc

With nodelalloc option we need to update the dirty block counter on
block allocation failure. This is needed because we increment the
dirty block counter early in the block allocation phase. Without
the patch s_dirty_blocks_counter goes wrong so that filesystem's
free blocks decreases incorrectly.

Tested-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org


# 29eaf024 05-Jan-2009 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Init the complete page while building buddy cache

We need to init the complete page during buddy cache init
by setting the contents to '1'. Otherwise we can see the
following errors after doing an online resize of the
filesystem:

EXT4-fs error (device sdb1): ext4_mb_mark_diskspace_used:
Allocating block 1040385 in system zone of 127 group

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org


# 8556e8f3 05-Jan-2009 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Don't allow new groups to be added during block allocation

After we mark the blocks in the buddy cache as allocated,
we need to ensure that we don't reinit the buddy cache until
the block bitmap is updated. This commit achieves this by holding
the group_info alloc_semaphore till ext4_mb_release_context

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org


# 648f5879 05-Jan-2009 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: mark the blocks/inode bitmap beyond end of group as used

We need to mark the block/inode bitmap beyond the end of the group
with '1'.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org


# 2ccb5fb9 05-Jan-2009 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Use new buffer_head flag to check uninit group bitmaps initialization

For uninit block group, the on-disk bitmap is not initialized. That
implies we cannot depend on the uptodate flag on the bitmap
buffer_head to find bitmap validity. Use a new buffer_head flag which
would be set after we properly initialize the bitmap. This also
prevents (re-)initializing the uninit group bitmap every time we call
ext4_read_block_bitmap().

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org


# 3300beda 03-Jan-2009 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: code cleanup

Rename some variables. We also unlock locks in the reverse order we
acquired as a part of cleanup.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 560671a0 05-Jan-2009 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Use high 16 bits of the block group descriptor's free counts fields

Rename the lower bits with suffix _lo and add helper
to access the values. Also rename bg_itable_unused_hi
to bg_pad as in e2fsprogs.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# e8134b27 05-Jan-2009 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Fix race between read_block_bitmap() and mark_diskspace_used()

We need to make sure we update the block bitmap and clear
EXT4_BG_BLOCK_UNINIT flag with sb_bgl_lock held, since
ext4_read_block_bitmap() looks at EXT4_BG_BLOCK_UNINIT to decide
whether to initialize the block bitmap each time it is called
(introduced by commit c806e68f), and this can race with block
allocations in ext4_mb_mark_diskspace_used().

ext4_read_block_bitmap does:

spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
ext4_init_block_bitmap(sb, bh, block_group, desc);

Now on the block allocation side we do

mb_set_bits(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group), bitmap_bh->b_data,
ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len);
....
spin_lock(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group));
if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);

ie on allocation we update the bitmap then we take the sb_bgl_lock
and clear the EXT4_BG_BLOCK_UNINIT flag. What can happen is a
parallel ext4_read_block_bitmap can zero out the bitmap in between
the above mb_set_bits and spin_lock(sb_bg_lock..)

The race results in below user visible errors
EXT4-fs error (device sdb1): ext4_mb_release_inode_pa: free 100, pa_free 105
EXT4-fs error (device sdb1): mb_free_blocks: double-free of inode 0's block ..

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org


# 5d1b1b3f 05-Jan-2009 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: fix BUG when calling ext4_error with locked block group

The mballoc code likes to call ext4_error while it is holding locked
block groups. This can causes a scheduling in atomic context BUG. We
can't just unlock the block group and relock it after/if ext4_error
returns since that might result in race conditions in the case where
the filesystem is set to continue after finding errors.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# b7be019e 23-Nov-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Fix lockdep recursive locking warning

In ext4_mb_init_group(), if the filesystem block size is less than
PAGE_SIZE/2, the code tries to grab alloc_sem for multiple block
groups in a loop. We need to allow for this by using
down_write_nested() and passing in the loop index as a lock subclass
number. This works because no other code path needs to take multiple
alloc_sem's. Note that lockdep will fail for filesystem blocksize
smaller than to PAGE_SIZE/16k. (e.g., a 1k filesystem blocksize with
a 32k page size, or a 2k filesystem blocksize with a 64k blocksize,
etc.)

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 7a2fcbf7 05-Jan-2009 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: don't use blocks freed but not yet committed in buddy cache init

When we generate buddy cache (especially during resize) we need to
make sure we don't use the blocks freed but not yet comitted. This
makes sure we have the right value of free blocks count in the group
info and also in the bitmap. This also ensures the ordered mode
consistency

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org


# c3a326a6 25-Nov-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: cleanup mballoc header files

Move some of the forward declaration of the static functions
to mballoc.c where they are used. This enables us to include
mballoc.h in other .c files. Also correct the buddy cache
documentation.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 920313a7 05-Jan-2009 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Use EXT4_GROUP_INFO_NEED_INIT_BIT during resize

The new groups added during resize are flagged as
need_init group. Make sure we properly initialize these
groups. When we have block size < page size and we are adding
new groups the page may still be marked uptodate even though
we haven't initialized the group. While forcing the init
of buddy cache we need to make sure other groups part of the
same page of buddy cache is not using the cache.
group_info->alloc_sem is added to ensure the same.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
cc: stable@kernel.org


# 3a06d778 22-Nov-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: sparse fixes

* Change EXT4_HAS_*_FEATURE to return a boolean
* Add a function prototype for ext4_fiemap() in ext4.h
* Make ext4_ext_fiemap_cb() and ext4_xattr_fiemap() be static functions
* Add lock annotations to mb_free_blocks()

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 498e5f24 04-Nov-2008 Theodore Ts'o <tytso@mit.edu>

ext4: Change unsigned long to unsigned int

Convert the unsigned longs that are most responsible for bloating the
stack usage on 64-bit systems.

Nearly all places in the ext3/4 code which uses "unsigned long" is
probably a bug, since on 32-bit systems a ulong a 32-bits, which means
we are wasting stack space on 64-bit systems.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# a9df9a49 05-Jan-2009 Theodore Ts'o <tytso@mit.edu>

ext4: Make ext4_group_t be an unsigned int

Nearly all places in the ext3/4 code which uses "unsigned long" is
probably a bug, since on 32-bit systems a ulong a 32-bits, which means
we are wasting stack space on 64-bit systems.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 032115fc 05-Jan-2009 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Don't overwrite allocation_context ac_status

We can call ext4_mb_check_limits even after successfully allocating
the requested blocks. In that case, make sure we don't overwrite
ac_status if it already has the status AC_STATUS_FOUND. This fixes
the lockdep warning:

=============================================
[ INFO: possible recursive locking detected ]
2.6.28-rc6-autokern1 #1
---------------------------------------------
fsstress/11948 is trying to acquire lock:
(&meta_group_info[i]->alloc_sem){----}, at: [<c04d9a49>] ext4_mb_load_buddy+0x9f/0x278
.....

stack backtrace:
.....
[<c04db974>] ext4_mb_regular_allocator+0xbb5/0xd44
.....

but task is already holding lock:
(&meta_group_info[i]->alloc_sem){----}, at: [<c04d9a49>] ext4_mb_load_buddy+0x9f/0x278

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org


# fde4d95a 05-Jan-2009 Theodore Ts'o <tytso@mit.edu>

ext4: remove extraneous newlines from calls to ext4_error() and ext4_warning()

This removes annoying blank syslog entries emitted by ext4_error() or
ext4_warning(), since these functions add their own newline.

Signed-off-by: Nick Warne <nick@ukfsn.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 0390131b 06-Jan-2009 Frank Mayhar <fmayhar@google.com>

ext4: Allow ext4 to run without a journal

A few weeks ago I posted a patch for discussion that allowed ext4 to run
without a journal. Since that time I've integrated the excellent
comments from Andreas and fixed several serious bugs. We're currently
running with this patch and generating some performance numbers against
both ext2 (with backported reservations code) and ext4 with and without
a journal. It just so happens that running without a journal is
slightly faster for most everything.

We did
iozone -T -t 4 s 2g -r 256k -T -I -i0 -i1 -i2

which creates 4 threads, each of which create and do reads and writes on
a 2G file, with a buffer size of 256K, using O_DIRECT for all file opens
to bypass the page cache. Results:

ext2 ext4, default ext4, no journal
initial writes 13.0 MB/s 15.4 MB/s 15.7 MB/s
rewrites 13.1 MB/s 15.6 MB/s 15.9 MB/s
reads 15.2 MB/s 16.9 MB/s 17.2 MB/s
re-reads 15.3 MB/s 16.9 MB/s 17.2 MB/s
random readers 5.6 MB/s 5.6 MB/s 5.7 MB/s
random writers 5.1 MB/s 5.3 MB/s 5.4 MB/s

So it seems that, so far, this was a useful exercise.

Signed-off-by: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# ff7ef329 16-Dec-2008 Yasunori Goto <y-goto@jp.fujitsu.com>

ext4: Widen type of ext4_sb_info.s_mb_maxs[]

I chased the cause of following ext4 oops report which is tested on
ia64 box.

http://bugzilla.kernel.org/show_bug.cgi?id=12018

The cause is the size of s_mb_maxs array that is defined as "unsigned
short" in ext4_sb_info structure. If the file system's block size is
8k or greater, an unsigned short is not wide enough to contain the
value fs->blocksize << 3.

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: stable@kernel.org


# ae2d9fb1 04-Nov-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: fix missing ext4_unlock_group in error path

If we try to free a block which is already freed, the code was
returning without first unlocking the group.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 3e624fc7 16-Oct-2008 Theodore Ts'o <tytso@mit.edu>

ext4: Replace hackish ext4_mb_poll_new_transaction with commit callback

The multiblock allocator needs to be able to release blocks (and issue
a blkdev discard request) when the transaction which freed those
blocks is committed. Previously this was done via a polling mechanism
when blocks are allocated or freed. A much better way of doing things
is to create a jbd2 callback function and attaching the list of blocks
to be freed directly to the transaction structure.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 0b09923e 17-Oct-2008 Manish Katiyar <mkatiyar@gmail.com>

ext4: Remove compile warnings when building w/o CONFIG_PROC_FS

Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 8a0aba73 16-Oct-2008 Theodore Ts'o <tytso@mit.edu>

ext4: let the block device know when unused blocks can be discarded

Let the block device know when unused blocks can be discarded, using
the new sb_issue_discard() interface.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# c894058d 16-Oct-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Use an rbtree for tracking blocks freed during transaction.

With this patch we track the block freed during a transaction using
red-black tree. We also make sure contiguous blocks freed are collected
in one node in the tree.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 688f05a0 12-Oct-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Free ext4_prealloc_space using kmem_cache_free

We should use kmem_cache_free to free memory allocated
via kmem_cache_alloc

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# c806e68f 10-Oct-2008 Frederic Bohe <frederic.bohe@bull.net>

ext4: fix initialization of UNINIT bitmap blocks

This fixes a bug which caused on-line resizing of filesystems with a
1k blocksize to fail. The root cause of this bug was the fact that if
an uninitalized bitmap block gets read in by userspace (which
e2fsprogs does try to avoid, but can happen when the blocksize is less
than the pagesize and an adjacent blocks is read into memory)
ext4_read_block_bitmap() was erroneously depending on the buffer
uptodate flag to decide whether it needed to initialize the bitmap
block in memory --- i.e., to set the standard set of blocks in use by
a block group (superblock, bitmaps, inode table, etc.). Essentially,
ext4_read_block_bitmap() assumed it was the only routine that might
try to read a block containing a block bitmap, which is simply not
true.

To fix this, ext4_read_block_bitmap() and ext4_read_inode_bitmap()
must always initialize uninitialized bitmap blocks. Once a block or
inode is allocated out of that bitmap, it will be marked as
initialized in the block group descriptor, so in general this won't
result any extra unnecessary work.

Signed-off-by: Frederic Bohe <frederic.bohe@bull.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# c2ea3fde 10-Oct-2008 Theodore Ts'o <tytso@mit.edu>

ext4: Remove old legacy block allocator

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 5e8814f2 23-Sep-2008 Theodore Ts'o <tytso@mit.edu>

ext4: Combine proc file handling into a single set of functions

Previously mballoc created a separate set of functions for each proc
file. This combines the tunables into a single set of functions which
gets used for all of the per-superblock proc files, saving
approximately 2k of compiled object code.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 9f6200bb 23-Sep-2008 Theodore Ts'o <tytso@mit.edu>

ext4: move /proc setup and teardown out of mballoc.c

...and into the core setup/teardown code in fs/ext4/super.c so that
other parts of ext4 can define tuning parameters.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 730c213c 13-Sep-2008 Eric Sandeen <sandeen@redhat.com>

ext4: use percpu data structures for lg_prealloc_list

lg_prealloc_list seems to cry out for a per-cpu data structure; on a large
smp system I think this should be better. I've lightly tested this change
on a 4-cpu system.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 899fc1a4 14-Sep-2008 Alexey Dobriyan <adobriyan@gmail.com>

ext4: fix #11321: create /proc/ext4/*/stats more carefully

ext4 creates per-suberblock directory in /proc/ext4/ . Name used as
basis is taken from bdevname, which, surprise, can contain slash.

However, proc while allowing to use proc_create("a/b", parent) form of
PDE creation, assumes that parent/a was already created.

bdevname in question is 'cciss/c0d0p9', directory is not created and all
this stuff goes directly into /proc (which is real bug).

Warning comes when _second_ partition is mounted.

http://bugzilla.kernel.org/show_bug.cgi?id=11321

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 6bc6e63f 10-Oct-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Add percpu dirty block accounting.

This patch adds dirty block accounting using percpu_counters. Delayed
allocation block reservation is now done by updating dirty block
counter. In a later patch we switch to non delalloc mode if the
filesystem free blocks is greater than 150% of total filesystem dirty
blocks

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao<cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 030ba6bc 08-Sep-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Retry block reservation

During block reservation if we don't have enough blocks left, retry
block reservation with smaller block counts. This makes sure we try
fallocate and DIO with smaller request size and don't fail early. The
delayed allocation reservation cannot try with smaller block count. So
retry block reservation to handle temporary disk full conditions. Also
print free blocks details if we fail block allocation during writepages.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# a30d542a 09-Oct-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Make sure all the block allocation paths reserve blocks

With delayed allocation we need to make sure block are reserved before
we attempt to allocate them. Otherwise we get block allocation failure
(ENOSPC) during writepages which cannot be handled. This would mean
silent data loss (We do a printk stating data will be lost). This patch
updates the DIO and fallocate code path to do block reservation before
block allocation. This is needed to make sure parallel DIO and fallocate
request doesn't take block out of delayed reserve space.

When free blocks count go below a threshold we switch to a slow patch
which looks at other CPU's accumulated percpu counter values.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 4776004f 08-Sep-2008 Theodore Ts'o <tytso@mit.edu>

ext4: Add printk priority levels to clean up checkpatch warnings

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 5e745b04 18-Aug-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Fix small file fragmentation

For small file block allocations, mballoc uses per cpu prealloc
space. Use goal block when searching for the right prealloc
space. Also make sure ext4_da_writepages tries to write
all the pages for small files in single attempt

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 6be2ded1 23-Jul-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Don't allow lg prealloc list to be grow large.

Currently, the locality group prealloc list is freed only when there
is a block allocation failure. This can result in large number of
entries in the preallocation list making ext4_mb_use_preallocated()
expensive.

To fix this, we convert the locality group prealloc list to a hash
list. The hash index is the order of number of blocks in the prealloc
space with a max order of 9. When adding prealloc space to the list we
make sure total entries for each order does not exceed 8. If it is
more than 8 we discard few entries and make sure the we have only <= 5
entries.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 1320cbcf 23-Jul-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Convert the usage of NR_CPUS to nr_cpu_ids.

NR_CPUS can be really large. We should be using nr_cpu_ids instead.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# ce89f46c 23-Jul-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Improve error handling in mballoc

Don't call BUG_ON on file system failures. Instead use ext4_error and
also handle the continue case properly.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# b5f10eed 02-Aug-2008 Eric Sandeen <sandeen@redhat.com>

ext4: lock block groups when initializing

I noticed when filling a 1T filesystem with 4 threads using the
fs_mark benchmark:

fs_mark -d /mnt/test -D 256 -n 100000 -t 4 -s 20480 -F -S 0

that I occasionally got checksum mismatch errors:

EXT4-fs error (device sdb): ext4_init_inode_bitmap: Checksum bad for group 6935

etc. I'd reliably get 4-5 of them during the run.

It appears that the problem is likely a race to init the bg's
when the uninit_bg feature is enabled.

With the patch below, which adds sb_bgl_locking around initialization,
I was able to complete several runs with no errors or warnings.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# d2a17637 14-Jul-2008 Mingming Cao <cmm@us.ibm.com>

ext4: delayed allocation ENOSPC handling

This patch does block reservation for delayed
allocation, to avoid ENOSPC later at page flush time.

Blocks(data and metadata) are reserved at da_write_begin()
time, the freeblocks counter is updated by then, and the number of
reserved blocks is store in per inode counter.

At the writepage time, the unused reserved meta blocks are returned
back. At unlink/truncate time, reserved blocks are properly released.

Updated fix from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
to fix the oldallocator block reservation accounting with delalloc, added
lock to guard the counters and also fix the reservation for meta blocks.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 5f21b0e6 11-Jul-2008 Frederic Bohe <frederic.bohe@bull.net>

ext4: fix online resize with mballoc

Update group infos when updating a group's descriptor.
Add group infos when adding a group's descriptor.
Refresh cache pages used by mb_alloc when changes occur.
This will probably need modifications when META_BG resizing will be allowed.

Signed-off-by: Frederic Bohe <frederic.bohe@bull.net>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>


# 07031431 11-Jul-2008 Mingming Cao <cmm@us.ibm.com>

ext4: mballoc avoid use root reserved blocks for non root allocation

mballoc allocation missed check for blocks reserved for root users. Add
ext4_has_free_blocks() check before allocation. Also modified
ext4_has_free_blocks() to support multiple block allocation request.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 654b4908 11-Jul-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: cleanup block allocator

Move the code related to block allocation to a single function and add helper
funtions to differient allocation for data and meta data blocks

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 363d4251 11-Jul-2008 Shen Feng <shen@cn.fujitsu.com>

ext4: remove quota allocation when ext4_mb_new_blocks fails

Quota allocation is not removed when ext4_mb_new_blocks calls
kmem_cache_alloc failed. Also make sure the allocation context is freed
on the error path.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 772cb7c8 11-Jul-2008 Jose R. Santos <jrs@us.ibm.com>

ext4: New inode allocation for FLEX_BG meta-data groups.

This patch mostly controls the way inode are allocated in order to
make ialloc aware of flex_bg block group grouping. It achieves this
by bypassing the Orlov allocator when block group meta-data are packed
toghether through mke2fs. Since the impact on the block allocator is
minimal, this patch should have little or no effect on other block
allocation algorithms. By controlling the inode allocation, it can
basically control where the initial search for new block begins and
thus indirectly manipulate the block allocator.

This allocator favors data and meta-data locality so the disk will
gradually be filled from block group zero upward. This helps improve
performance by reducing seek time. Since the group of inode tables
within one flex_bg are treated as one giant inode table, uninitialized
block groups would not need to partially initialize as many inode
table as with Orlov which would help fsck time as the filesystem usage
goes up.

Signed-off-by: Jose R. Santos <jrs@us.ibm.com>
Signed-off-by: Valerie Clement <valerie.clement@bull.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 7e5a8cdd 13-Jul-2008 Shen Feng <shen@cn.fujitsu.com>

ext4: fix error processing in mb_free_blocks

The error processing of the return value of mb_free_blocks is meanless
because it only returns 0. This fix includes

- make mb_free_blocks return void

- remove the error processing part in callers

- unlock group before calling ext4_error in mb_free_blocks

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# cfbe7e4f 13-Jul-2008 Shen Feng <shen@cn.fujitsu.com>

ext4: error proc entry creation when the fs/ext4 is not correctly created

When the directory fs/ext4 is not correctly created under proc, the entry
under this directory should not be created.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 574ca174 11-Jul-2008 Theodore Ts'o <tytso@mit.edu>

ext4: Rename read_block_bitmap() to ext4_read_block_bitmap()

Since this a non-static function, make it be ext4 specific to avoid
conflicts with potentially other filesystems.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 74767c5a 11-Jul-2008 Shen Feng <shen@cn.fujitsu.com>

ext4: miscellaneous error checks and coding cleanups for mballoc

ext4_mb_seq_history_open(): check if sbi->s_mb_history is NULL

ext4_mb_history_init(): replace kmalloc and memset with kzalloc

ext4_mb_init_backend(): remove memset since kzalloc is used

ext4_mb_init(): the return value of ext4_mb_init_backend is int,
but i is unsigned, replace it with a new int variable.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# fdf6c7a7 11-Jul-2008 Shen Feng <shen@cn.fujitsu.com>

ext4: add error processing when calling ext4_mb_init_cache in mballoc

Add error processing for ext4_mb_load_buddy when it calls
ext4_mb_init_cache.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 31b481dc 11-Jul-2008 Mingming Cao <cmm@us.ibm.com>

ext4: Fix ext4_mb_init_cache return error

ext4_mb_init_cache() incorrectly always return EIO on success. This
causes the caller of ext4_mb_init_cache() fail when it checks the return
value.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 91d99827 11-Jul-2008 Alexey Dobriyan <adobriyan@gmail.com>

ext4: switch to seq_files

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ed8f9c75 11-Jul-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: start searching for the right extent from the goal group.

With mballoc we search for the best extent using different
criteria. We should always use the goal group when we are
starting with a new criteria.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# e7dfb246 11-Jul-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Fix mb_find_next_bit not to return larger than max

Some architectures implement ext4_find_next_bit and
ext4_find_next_zero_bit in such a way that they return
greater than max for some input values. Make sure
mb_find_next_bit and mb_find_next_zero_bit return the
right values.

On 2.6.25 we have include/asm-x86/bitops_32.h
static inline unsigned find_first_bit(const unsigned long *addr, unsigned size)
{
unsigned x = 0;

while (x < size) {
unsigned long val = *addr++;
if (val)
return __ffs(val) + x;
x += (sizeof(*addr)<<3);
}
return x;
}

This can return value greater than size.

Reported and fixed here for lustre

https://bugzilla.lustre.org/show_bug.cgi?id=15932
https://bugzilla.lustre.org/attachment.cgi?id=17205

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 03cddb80 05-Jun-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Fix use of uninitialized data with debug enabled.

Fix use of uninitialized data with debug enabled.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 519deca0 15-May-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Retry block allocation if new blocks are allocated from system zone.

If the block allocator gets blocks out of system zone ext4 calls
ext4_error. But if the file system is mounted with errors=continue
retry block allocation. We need to mark the system zone blocks as
in use to make sure retry don't pick them again

System zone is the block range mapping block bitmap, inode bitmap and inode
table.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 1930479c 13-May-2008 Valerie Clement <valerie.clement@bull.net>

ext4: mballoc fix mb_normalize_request algorithm for 1KB block size filesystems

In case of inode preallocation, the number of blocks to allocate depends
on the file size and it is calculated in ext4_mb_normalize_request().
Each group in the filesystem is then checked to find one that can be
used for allocation; this is done in ext4_mb_good_group().

When a file bigger than 4MB is created, the requested number of blocks
to preallocate, calculated by ext4_mb_normalize_request is 4096.
However for a filesystem with 1KB block size, the maximum size of the
block buddies used by the multiblock allocator is 2048, so none of
groups in the filesystem satisfies the search criteria in
ext4_mb_good_group(). Scanning all the filesystem groups impacts
performance.

This was demonstrated by using a freshly created, 70GB, 1k block
filesystem, with caches dropped write before the test via
/proc/sys/vm/drop_caches, and with the filesystem mounted with
nodelalloc and nodealloc,nomballoc. The time to write an 8 megabyte
file using "dd if=/dev/zero of=/mnt/test/fo bs=8k count=1k conv=fsync"
took 35.5091 seconds (236kB/s) with nodellaloc, and 0.233754 seconds
(35.9 MB/s) with the nodelloc,nomballoc options. With a 1TB partition,
it took several minutes to write 8MB!

This patch modifies the algorithm in ext4_mb_normalize_group_request to
calculate the number of blocks to allocate by taking into account the
maximum size of free blocks chunks handled by the multiblock allocator.

It has also been tested for filesystems with 2KB and 4KB block sizes to
ensure that those cases don't regress.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Valerie Clement <valerie.clement@bull.net>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# f36f21ec 12-May-2008 Jean Delvare <khali@linux-fr.org>

Fix misuses of bdevname()

bdevname() fills the buffer that it is given as a parameter, so calling
strcpy() or snprintf() on the returned value is redundant (and probably not
guaranteed to work - I don't think strcpy and snprintf support overlapping
buffers.)

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f1fa3342 29-Apr-2008 Roel Kluin <12o3l@tiscali.nl>

ext4: fix hot spins in mballoc after err_freebuddy and err_freemeta

In ext4_mb_init_backend() 'i' is of type ext4_group_t. Since unsigned, i
>= 0 is always true, so fix hot spins after err_freebuddy: and -meta:
and prevent decrements when zero.

Signed-off-by: Roel Kluin <12o3l@tiscali.nl>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 8f6e39a7 29-Apr-2008 Mingming Cao <cmm@us.ibm.com>

ext4: Move mballoc headers/structures to a seperate header file mballoc.h

Move function and structure definiations out of mballoc.c and put it under
a new header file mballoc.h

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 60bd63d1 29-Apr-2008 Solofo Ramangalahy <Solofo.Ramangalahy@bull.net>

ext4: cleanup for compiling mballoc with verification and debugging #defines

This patch allows compiling mballoc with:
#define AGGRESSIVE_CHECK
#define DOUBLE_CHECK
#define MB_DEBUG

It fixes:
Compilation errors:
fs/ext4/mballoc.c: In function '__mb_check_buddy':
fs/ext4/mballoc.c:605: error: 'struct ext4_prealloc_space' has no member named 'group_list'
fs/ext4/mballoc.c:606: error: 'struct ext4_prealloc_space' has no member named 'pstart'
fs/ext4/mballoc.c:608: error: 'struct ext4_prealloc_space' has no member named 'len'

Compilation warnings:
fs/ext4/mballoc.c: In function 'ext4_mb_normalize_group_request':
fs/ext4/mballoc.c:2863: warning: format '%lu' expects type 'long unsigned int', but argument 3 has type 'int'
fs/ext4/mballoc.c: In function 'ext4_mb_use_inode_pa':
fs/ext4/mballoc.c:3103: warning: format '%lu' expects type 'long unsigned int', but argument 3 has type 'int'

Sparse check:
fs/ext4/mballoc.c:3818:2: warning: context imbalance in 'ext4_mb_show_ac' - different lock contexts for basic block

Signed-off-by: Solofo Ramangalahy <Solofo.Ramangalahy@bull.net>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# c83617db 29-Apr-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Don't do GFP_NOFS allocations after taking ext4_lock_group

We can't do GFP_NOFS allocation after taking ext4_lock_group

BUG: sleeping function called from invalid context at mm/slab.c:3054
in_atomic():1, irqs_disabled():0
1 lock held by vi/2426:
#0: (&ei->i_data_sem){----}, at: [<c01cf665>] ext4_release_file+0x23/0x66
Pid: 2426, comm: vi Not tainted 2.6.25-rc7 #24
[<c011a3dc>] __might_sleep+0xbe/0xc5
[<c01620c9>] kmem_cache_alloc+0x22/0xa6
[<c01e382a>] ext4_mb_release_inode_pa+0x73/0x1b3
[<c01e6adf>] ext4_mb_discard_inode_preallocations+0x22d/0x2d4
[<c013000a>] ? param_set_ushort+0x32/0x39
[<c01ceba1>] ext4_discard_reservation+0x27/0x6a
[<c01cf66c>] ext4_release_file+0x2a/0x66
[<c0165bd6>] __fput+0xae/0x155
[<c0165e46>] fput+0x17/0x19
[<c0163756>] filp_close+0x50/0x5a
[<c01647c0>] sys_close+0x71/0xad
[<c0104aba>] sysenter_past_esp+0x5f/0xa5

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 3dcf5451 29-Apr-2008 Christoph Hellwig <hch@lst.de>

ext4: move headers out of include/linux

Move ext4 headers out of include/linux. This is just the trivial move,
there's some more thing that could be done later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 46fe74f2 29-Apr-2008 Denis V. Lunev <den@openvz.org>

ext4: use non-racy method for proc entries creation

Use proc_create()/proc_create_data() to make sure that ->proc_fops and ->data
be setup before gluing PDE to main tree.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: <linux-ext4@vger.kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 36a5aeb8 29-Apr-2008 Alexey Dobriyan <adobriyan@gmail.com>

proc: remove proc_root_fs

Use creation by full path instead: "fs/foo".

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 46e665e9 17-Apr-2008 Harvey Harrison <harvey.harrison@gmail.com>

ext4: replace remaining __FUNCTION__ occurrences

__FUNCTION__ is gcc-specific, use __func__

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 14499f35 17-Apr-2008 Mingming Cao <cmm@us.ibm.com>

ext4: remove extra define of ext4_new_blocks_old from mballoc.c

The function prototype of ext4_new_blocks_old() is defined in ext4_fs.h,
so we don't need the extra function prototype in mballoc.c

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# e8546d06 17-Apr-2008 Marcin Slusarz <marcin.slusarz@gmail.com>

ext4: le*_add_cpu conversion

replace all:
little_endian_variable = cpu_to_leX(leX_to_cpu(little_endian_variable) +
expression_in_cpu_byteorder);
with:
leX_add_cpu(&little_endian_variable, expression_in_cpu_byteorder);
generated with semantic patch

Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-ext4@vger.kernel.org
Cc: sct@redhat.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: adilger@clusterfs.com
Cc: Mingming Cao <cmm@us.ibm.com>


# 9a0762c5 17-Apr-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Convert list_for_each_rcu() to list_for_each_entry_rcu()

The list_for_each_entry_rcu() primitive should be used instead of
list_for_each_rcu(), as the former is easier to use and provides
better type safety.

http://groups.google.com/group/linux.kernel/browse_thread/thread/45749c83451cebeb/0633a65759ce7713?lnk=raot

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Roel Kluin <12o3l@tiscali.nl>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 4ddfef7b 29-Apr-2008 Eric Sandeen <sandeen@redhat.com>

ext4: reduce mballoc stack usage with noinline_for_stack

mballoc.c is a whole lot of static functions, which gcc seems to
really like to inline.

With the changes below, on x86, I can at least get from:

432 ext4_mb_new_blocks
240 ext4_mb_free_blocks
208 ext4_mb_discard_group_preallocations
188 ext4_mb_seq_groups_show
164 ext4_mb_init_cache
152 ext4_mb_release_inode_pa
136 ext4_mb_seq_history_show
...

to

220 ext4_mb_free_blocks
188 ext4_mb_seq_groups_show
176 ext4_mb_regular_allocator
164 ext4_mb_init_cache
156 ext4_mb_new_blocks
152 ext4_mb_release_inode_pa
136 ext4_mb_seq_history_show
124 ext4_mb_release_group_pa
...

which still has some big functions in there, but not 432 bytes!

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# ffad0a44 22-Feb-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: ext4_find_next_zero_bit needs an aligned address on some arch

ext4_find_next_zero_bit and ext4_find_next_bit needs a long aligned
address on x8_64. Add mb_find_next_zero_bit and mb_find_next_bit
and use them in the mballoc.

Fix: https://bugzilla.redhat.com/show_bug.cgi?id=433286

Eric Sandeen debugged the problem and suggested the fix.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# e56eb659 15-Feb-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Don't claim block from group which has corrupt bitmap

In ext4_mb_complex_scan_group, if the extent length of the newly
found extentet is greater than than the total free blocks counted
in group info, break without claiming the block.

Document different ext4_error usage, explaining the state with which we
continue if we mount with errors=continue

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# b73fce69 15-Feb-2008 Valerie Clement <valerie.clement@bull.net>

ext4: Fix kernel BUG at fs/ext4/mballoc.c:910!

With the flex_bg feature enabled, a large file creation oopses the
kernel. The BUG_ON is:
BUG_ON(len >= EXT4_BLOCKS_PER_GROUP(sb));

As the allocation of the bitmaps and the inode table can be done
outside the block group with flex_bg, this allows to allocate up to
EXT4_BLOCKS_PER_GROUP blocks in a group.

This patch fixes the oops.

Signed-off-by: Valerie Clement <valerie.clement@bull.net>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 26346ff6 09-Feb-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Don't panic in case of corrupt bitmap

Multiblock allocator calls BUG_ON in many case if the free and used
blocks count obtained looking at the bitmap is different from what
the allocator internally accounted for. Use ext4_error in such case
and don't panic the system.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 256bdb49 09-Feb-2008 Eric Sandeen <sandeen@redhat.com>

ext4: allocate struct ext4_allocation_context from a kmem cache

struct ext4_allocation_context is rather large, and this bloats
the stack of many functions which use it. Allocating it from
a named slab cache will alleviate this.

For example, with this change (on top of the noinline patch sent earlier):

-ext4_mb_new_blocks 200
+ext4_mb_new_blocks 40

-ext4_mb_free_blocks 344
+ext4_mb_free_blocks 168

-ext4_mb_release_inode_pa 216
+ext4_mb_release_inode_pa 40

-ext4_mb_release_group_pa 192
+ext4_mb_release_group_pa 24

Most of these stack-allocated structs are actually used only for
mballoc history; and in those cases often a smaller struct would do.
So changing that may be another way around it, at least for those
functions, if preferred. For now, in those cases where the ac
is only for history, an allocation failure simply skips the history
recording, and does not cause any other failures.


Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>


# 42a10add 09-Feb-2008 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ext4: Fix null bh pointer dereference in mballoc

Repoted by Adrian Bunk <bunk@kernel.org>:

The Coverity checker spotted the following NULL dereference:

static int ext4_mb_mark_diskspace_used
{
...
if (!bitmap_bh)
goto out_err;
...
out_err:
sb->s_dirt = 1;
put_bh(bitmap_bh);
...

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>


# c9de560d 28-Jan-2008 Alex Tomas <alex@clusterfs.com>

ext4: Add multi block allocator for ext4

Signed-off-by: Alex Tomas <alex@clusterfs.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>