#
0efcd739 |
|
26-Feb-2024 |
Wenchao Hao <haowenchao2@huawei.com> |
ext4: remove unused parameter biop in ext4_issue_discard() all caller of ext4_issue_discard() would set biop to NULL since 'commit 55cdd0af2bc5 ("ext4: get discard out of jbd2 commit kthread contex")', it's unnecessary to keep this parameter any more. Signed-off-by: Wenchao Hao <haowenchao2@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20240226081731.3224470-1-haowenchao2@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
fa606293 |
|
13-Feb-2024 |
Jan Kara <jack@suse.cz> |
ext4: don't report EOPNOTSUPP errors from discard When ext4 is mounted without journal, with discard mount option, and on a device not supporting trim, we print error for each and every freed extent. This is not only useless but actively harmful. Instead ignore the EOPNOTSUPP error. Trim is only advisory anyway and when the filesystem has journal we silently ignore trim error as well. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20240213101601.17463-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
4fbf8bc7 |
|
01-Feb-2024 |
Baokun Li <libaokun1@huawei.com> |
ext4: correct best extent lstart adjustment logic When yangerkun review commit 93cdf49f6eca ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()"), it was found that the best extent did not completely cover the original request after adjusting the best extent lstart in ext4_mb_new_inode_pa() as follows: original request: 2/10(8) normalized request: 0/64(64) best extent: 0/9(9) When we check if best ex can be kept at start of goal, ac_o_ex.fe_logical is 2 less than the adjusted best extent logical end 9, so we think the adjustment is done. But obviously 0/9(9) doesn't cover 2/10(8), so we should determine here if the original request logical end is less than or equal to the adjusted best extent logical end. In addition, add a comment stating when adjusted best_ex will not cover the original request, and remove the duplicate assertion because adjusting lstart makes no change to b_ex.fe_len. Link: https://lore.kernel.org/r/3630fa7f-b432-7afd-5f79-781bc3b2c5ea@huawei.com Fixes: 93cdf49f6eca ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()") Cc: <stable@kernel.org> Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20240201141845.1879253-1-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
68ee261f |
|
18-Jan-2024 |
Zhang Yi <yi.zhang@huawei.com> |
ext4: add a hint for block bitmap corrupt state in mb_groups If one group is marked as block bitmap corrupted, its free blocks cannot be used and its free count is also deducted from the global sbi->s_freeclusters_counter. User might be confused about the absent free space because we can't query the information about corrupted block groups except unreliable error messages in syslog. So add a hint to show block bitmap corrupted groups in mb_groups. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240119061154.1525781-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
4b55d343 |
|
17-Jan-2024 |
yangerkun <yangerkun@huawei.com> |
ext4: improve error msg for ext4_mb_seq_groups_show While cat mb_groups for a mounted ext4 image which has some corrupted group, the string return to userspace was just "I/O error" which confuse me a lot. Improve it with ext4_decode_error. Signed-off-by: yangerkun <yangerkun@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240118042557.380058-2-yangerkun@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
25044880 |
|
17-Jan-2024 |
yangerkun <yangerkun@huawei.com> |
ext4: remove unused buddy_loaded in ext4_mb_seq_groups_show We can just first call ext4_mb_unload_buddy, then copy information from ext4_group_info. So remove this unused value. Signed-off-by: yangerkun <yangerkun@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240118042557.380058-1-yangerkun@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
f0e54b60 |
|
05-Jan-2024 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: remove 'needed' in trace_ext4_discard_preallocations As 'needed' to trace_ext4_discard_preallocations is always 0 which is meaningless. Just remove it. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240105092102.496631-10-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
2ffd2a6a |
|
05-Jan-2024 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: remove unnecessary parameter "needed" in ext4_discard_preallocations The "needed" controls the number of ext4_prealloc_space to discard in ext4_discard_preallocations. Function ext4_discard_preallocations is supposed to discard all non-used preallocated blocks when "needed" is 0 and now ext4_discard_preallocations is always called with "needed" = 0. Remove unnecessary parameter "needed" and remove all non-used preallocated spaces in ext4_discard_preallocations to simplify the code. Note: If count of non-used preallocated spaces could be more than UINT_MAX, there was a memory leak as some non-used preallocated spaces are left ununsed and this commit will fix it. Otherwise, there is no behavior change. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240105092102.496631-9-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
20427949 |
|
05-Jan-2024 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: remove unused return value of ext4_mb_release_group_pa Remove unused return value of ext4_mb_release_group_pa. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240105092102.496631-8-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
820c2808 |
|
05-Jan-2024 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: remove unused return value of ext4_mb_release_inode_pa Remove unused return value of ext4_mb_release_inode_pa Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240105092102.496631-7-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
90817717 |
|
05-Jan-2024 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: remove unused return value of ext4_mb_release Remove unused return value of ext4_mb_release. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240105092102.496631-6-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
11fd1a5d |
|
05-Jan-2024 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: remove unneeded return value of ext4_mb_release_context Function ext4_mb_release_context always return 0 and the return value is never used. Just remove unneeded return value of ext4_mb_release_context. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240105092102.496631-4-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
438a35e7 |
|
05-Jan-2024 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: remove unused parameter ngroup in ext4_mb_choose_next_group_*() Remove unused parameter ngroup in ext4_mb_choose_next_group_*(). Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20240105092102.496631-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
133de5a0 |
|
05-Jan-2024 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: remove unused return value of __mb_check_buddy Remove unused return value of __mb_check_buddy. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240105092102.496631-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
c5f3a382 |
|
04-Jan-2024 |
Baokun Li <libaokun1@huawei.com> |
ext4: mark the group block bitmap as corrupted before reporting an error Otherwise unlocking the group in ext4_grp_locked_error may allow other processes to modify the core block bitmap that is known to be corrupt. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240104142040.2835097-9-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
83269837 |
|
04-Jan-2024 |
Baokun Li <libaokun1@huawei.com> |
ext4: avoid allocating blocks from corrupted group in ext4_mb_find_by_goal() Places the logic for checking if the group's block bitmap is corrupt under the protection of the group lock to avoid allocating blocks from the group with a corrupted block bitmap. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240104142040.2835097-8-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
4530b366 |
|
04-Jan-2024 |
Baokun Li <libaokun1@huawei.com> |
ext4: avoid allocating blocks from corrupted group in ext4_mb_try_best_found() Determine if the group block bitmap is corrupted before using ac_b_ex in ext4_mb_try_best_found() to avoid allocating blocks from a group with a corrupted block bitmap in the following concurrency and making the situation worse. ext4_mb_regular_allocator ext4_lock_group(sb, group) ext4_mb_good_group // check if the group bbitmap is corrupted ext4_mb_complex_scan_group // Scan group gets ac_b_ex but doesn't use it ext4_unlock_group(sb, group) ext4_mark_group_bitmap_corrupted(group) // The block bitmap was corrupted during // the group unlock gap. ext4_mb_try_best_found ext4_lock_group(ac->ac_sb, group) ext4_mb_use_best_found mb_mark_used // Allocating blocks in block bitmap corrupted group Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240104142040.2835097-7-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
993bf0f4 |
|
04-Jan-2024 |
Baokun Li <libaokun1@huawei.com> |
ext4: avoid dividing by 0 in mb_update_avg_fragment_size() when block bitmap corrupt Determine if bb_fragments is 0 instead of determining bb_free to eliminate the risk of dividing by zero when the block bitmap is corrupted. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240104142040.2835097-6-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
2331fd4a |
|
04-Jan-2024 |
Baokun Li <libaokun1@huawei.com> |
ext4: avoid bb_free and bb_fragments inconsistency in mb_free_blocks() After updating bb_free in mb_free_blocks, it is possible to return without updating bb_fragments because the block being freed is found to have already been freed, which leads to inconsistency between bb_free and bb_fragments. Since the group may be unlocked in ext4_grp_locked_error(), this can lead to problems such as dividing by zero when calculating the average fragment length. Hence move the update of bb_free to after the block double-free check guarantees that the corresponding statistics are updated only after the core block bitmap is modified. Fixes: eabe0444df90 ("ext4: speed-up releasing blocks on commit") CC: <stable@vger.kernel.org> # 3.10 Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240104142040.2835097-5-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
c9b528c3 |
|
04-Jan-2024 |
Baokun Li <libaokun1@huawei.com> |
ext4: regenerate buddy after block freeing failed if under fc replay This mostly reverts commit 6bd97bf273bd ("ext4: remove redundant mb_regenerate_buddy()") and reintroduces mb_regenerate_buddy(). Based on code in mb_free_blocks(), fast commit replay can end up marking as free blocks that are already marked as such. This causes corruption of the buddy bitmap so we need to regenerate it in that case. Reported-by: Jan Kara <jack@suse.cz> Fixes: 6bd97bf273bd ("ext4: remove redundant mb_regenerate_buddy()") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240104142040.2835097-4-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
17220215 |
|
04-Jan-2024 |
Baokun Li <libaokun1@huawei.com> |
ext4: do not trim the group with corrupted block bitmap Otherwise operating on an incorrupted block bitmap can lead to all sorts of unknown problems. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240104142040.2835097-3-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
68da4c44 |
|
15-Dec-2023 |
Ye Bin <yebin10@huawei.com> |
ext4: fix inconsistent between segment fstrim and full fstrim Suppose we issue two FITRIM ioctls for ranges [0,15] and [16,31] with mininum length of trimmed range set to 8 blocks. If we have say a range of blocks 10-22 free, this range will not be trimmed because it straddles the boundary of the two FITRIM ranges and neither part is big enough. This is a bit surprising to some users that call FITRIM on smaller ranges of blocks to limit impact on the system. Also XFS trims all free space extents that overlap with the specified range so we are inconsistent among filesystems. Let's change ext4_try_to_trim_range() to consider for trimming the whole free space extent that straddles the end of specified range, not just the part of it within the range. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231216010919.1995851-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
1f6bc02f |
|
15-Dec-2023 |
Ojaswin Mujoo <ojaswin@linux.ibm.com> |
ext4: fallback to complex scan if aligned scan doesn't work Currently in case the goal length is a multiple of stripe size we use ext4_mb_scan_aligned() to find the stripe size aligned physical blocks. In case we are not able to find any, we again go back to calling ext4_mb_choose_next_group() to search for a different suitable block group. However, since the linear search always begins from the start, most of the times we end up with the same BG and the cycle continues. With large fliesystems, the CPU can be stuck in this loop for hours which can slow down the whole system. Hence, until we figure out a better way to continue the search (rather than starting from beginning) in ext4_mb_choose_next_group(), lets just fallback to ext4_mb_complex_scan_group() in case aligned scan fails, as it is much more likely to find the needed blocks. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/ee033f6dfa0a7f2934437008a909c3788233950f.1702455010.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
7c784d62 |
|
12-Dec-2023 |
Suraj Jitindar Singh <surajjs@amazon.com> |
ext4: allow for the last group to be marked as trimmed The ext4 filesystem tracks the trim status of blocks at the group level. When an entire group has been trimmed then it is marked as such and subsequent trim invocations with the same minimum trim size will not be attempted on that group unless it is marked as able to be trimmed again such as when a block is freed. Currently the last group can't be marked as trimmed due to incorrect logic in ext4_last_grp_cluster(). ext4_last_grp_cluster() is supposed to return the zero based index of the last cluster in a group. This is then used by ext4_try_to_trim_range() to determine if the trim operation spans the entire group and as such if the trim status of the group should be recorded. ext4_last_grp_cluster() takes a 0 based group index, thus the valid values for grp are 0..(ext4_get_groups_count - 1). Any group index less than (ext4_get_groups_count - 1) is not the last group and must have EXT4_CLUSTERS_PER_GROUP(sb) clusters. For the last group we need to calculate the number of clusters based on the number of blocks in the group. Finally subtract 1 from the number of clusters as zero based indexing is expected. Rearrange the function slightly to make it clear what we are calculating and returning. Reproducer: // Create file system where the last group has fewer blocks than // blocks per group $ mkfs.ext4 -b 4096 -g 8192 /dev/nvme0n1 8191 $ mount /dev/nvme0n1 /mnt Before Patch: $ fstrim -v /mnt /mnt: 25.9 MiB (27156480 bytes) trimmed // Group not marked as trimmed so second invocation still discards blocks $ fstrim -v /mnt /mnt: 25.9 MiB (27156480 bytes) trimmed After Patch: fstrim -v /mnt /mnt: 25.9 MiB (27156480 bytes) trimmed // Group marked as trimmed so second invocation DOESN'T discard any blocks fstrim -v /mnt /mnt: 0 B (0 bytes) trimmed Fixes: 45e4ab320c9b ("ext4: move setting of trimmed bit into ext4_try_to_trim_range()") Cc: <stable@vger.kernel.org> # 4.19+ Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231213051635.37731-1-surajjs@amazon.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
2bf5eb2a |
|
13-Nov-2023 |
Gou Hao <gouhao@uniontech.com> |
ext4: improving calculation of 'fe_{len|start}' in mb_find_extent() After first execution of mb_find_order_for_block(): 'fe_start' is the value of 'block' passed in mb_find_extent(). 'fe_len' is the difference between the length of order-chunk and remainder of the block divided by order-chunk. And 'next' does not require initialization after above modifications. Signed-off-by: Gou Hao <gouhao@uniontech.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231113082617.11258-1-gouhao@uniontech.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
f2fec3e9 |
|
23-Oct-2023 |
Gou Hao <gouhao@uniontech.com> |
ext4: delete redundant calculations in ext4_mb_get_buddy_page_lock() 'blocks_per_page' is always 1 after 'if (blocks_per_page >= 2)', 'pnum' and 'block' are equal in this case. Signed-off-by: Gou Hao <gouhao@uniontech.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231024035215.29474-1-gouhao@uniontech.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
2dcf5fde |
|
26-Nov-2023 |
Baokun Li <libaokun1@huawei.com> |
ext4: prevent the normalized size from exceeding EXT_MAX_BLOCKS For files with logical blocks close to EXT_MAX_BLOCKS, the file size predicted in ext4_mb_normalize_request() may exceed EXT_MAX_BLOCKS. This can cause some blocks to be preallocated that will not be used. And after [Fixes], the following issue may be triggered: ========================================================= kernel BUG at fs/ext4/mballoc.c:4653! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 1 PID: 2357 Comm: xfs_io 6.7.0-rc2-00195-g0f5cc96c367f Hardware name: linux,dummy-virt (DT) pc : ext4_mb_use_inode_pa+0x148/0x208 lr : ext4_mb_use_inode_pa+0x98/0x208 Call trace: ext4_mb_use_inode_pa+0x148/0x208 ext4_mb_new_inode_pa+0x240/0x4a8 ext4_mb_use_best_found+0x1d4/0x208 ext4_mb_try_best_found+0xc8/0x110 ext4_mb_regular_allocator+0x11c/0xf48 ext4_mb_new_blocks+0x790/0xaa8 ext4_ext_map_blocks+0x7cc/0xd20 ext4_map_blocks+0x170/0x600 ext4_iomap_begin+0x1c0/0x348 ========================================================= Here is a calculation when adjusting ac_b_ex in ext4_mb_new_inode_pa(): ex.fe_logical = orig_goal_end - EXT4_C2B(sbi, ex.fe_len); if (ac->ac_o_ex.fe_logical >= ex.fe_logical) goto adjust_bex; The problem is that when orig_goal_end is subtracted from ac_b_ex.fe_len it is still greater than EXT_MAX_BLOCKS, which causes ex.fe_logical to overflow to a very small value, which ultimately triggers a BUG_ON in ext4_mb_new_inode_pa() because pa->pa_free < len. The last logical block of an actual write request does not exceed EXT_MAX_BLOCKS, so in ext4_mb_normalize_request() also avoids normalizing the last logical block to exceed EXT_MAX_BLOCKS to avoid the above issue. The test case in [Link] can reproduce the above issue with 64k block size. Link: https://patchwork.kernel.org/project/fstests/list/?series=804003 Cc: <stable@kernel.org> # 6.4 Fixes: 93cdf49f6eca ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231127063313.3734294-1-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
7c9fa399 |
|
28-Sep-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: add first unit test for ext4_mb_new_blocks_simple in mballoc Here are prepared work: 1. Include mballoc-test.c to mballoc.c to be able test static function in mballoc.c. 2. Implement static stub to avoid read IO to disk. 3. Construct fake super_block. Only partial members are set, more members will be set when more functions are tested. Then unit test for ext4_mb_new_blocks_simple is added. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230928160407.142069-12-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
bdefd689 |
|
28-Sep-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: add some kunit stub for mballoc kunit test Multiblocks allocation will read and write block bitmap and group descriptor which reside on disk. Add kunit stub to function ext4_get_group_desc, ext4_read_block_bitmap_nowait, ext4_wait_block_bitmap and ext4_mb_mark_context to avoid real IO to disk. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230928160407.142069-11-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
5c657db4 |
|
28-Sep-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: call ext4_mb_mark_context in ext4_group_add_blocks() Call ext4_mb_mark_context in ext4_group_add_blocks() to remove repeat code. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230928160407.142069-10-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
03c7fc39 |
|
28-Sep-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: Separate block bitmap and buddy bitmap freeing in ext4_group_add_blocks() This patch separates block bitmap and buddy bitmap freeing in order to update block bitmap with ext4_mb_mark_context in following patch. The reason why this can be sperated is explained in previous submit. Put the explanation here to simplify the code archeology to ext4_group_add_blocks(): Separated freeing is safe with concurrent allocation as long as: 1. Firstly allocate block in buddy bitmap, and then in block bitmap. 2. Firstly free block in block bitmap, and then buddy bitmap. Then freed block will only be available to allocation when both buddy bitmap and block bitmap are updated by freeing. Allocation obeys rule 1 already, just do sperated freeing with rule 2. Separated freeing has no race with generate_buddy as: Once ext4_mb_load_buddy_gfp is executed successfully, the update-to-date buddy page can be found in sbi->s_buddy_cache and no more buddy initialization of the buddy page will be executed concurrently until buddy page is unloaded. As we always do free in "load buddy, free, unload buddy" sequence, separated freeing has no race with generate_buddy. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230928160407.142069-9-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
38b8f70c |
|
28-Sep-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: call ext4_mb_mark_context in ext4_mb_clear_bb Call ext4_mb_mark_context in ext4_mb_clear_bb to remove repeat code. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230928160407.142069-8-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
33e728c6 |
|
28-Sep-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: Separate block bitmap and buddy bitmap freeing in ext4_mb_clear_bb() This patch separates block bitmap and buddy bitmap freeing in order to update block bitmap with ext4_mb_mark_context in following patch. Separated freeing is safe with concurrent allocation as long as: 1. Firstly allocate block in buddy bitmap, and then in block bitmap. 2. Firstly free block in block bitmap, and then buddy bitmap. Then freed block will only be available to allocation when both buddy bitmap and block bitmap are updated by freeing. Allocation obeys rule 1 already, just do sperated freeing with rule 2. Separated freeing has no race with generate_buddy as: Once ext4_mb_load_buddy_gfp is executed successfully, the update-to-date buddy page can be found in sbi->s_buddy_cache and no more buddy initialization of the buddy page will be executed concurrently until buddy page is unloaded. As we always do free in "load buddy, free, unload buddy" sequence, separated freeing has no race with generate_buddy. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230928160407.142069-7-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
2f94711b |
|
28-Sep-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: call ext4_mb_mark_context in ext4_mb_mark_diskspace_used Call ext4_mb_mark_context in ext4_mb_mark_diskspace_used to: 1. Remove repeat code to normally update bitmap and group descriptor on disk. 2. Now that we have a common API for marking blocks inuse/free in block bitmap, use that instead of open coding it in function ext4_mb_mark_diskspace_used(). The current code was not updating checksum and other counters. ext4_mb_mark_context() should fix these consistency problems. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230928160407.142069-6-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
c431d386 |
|
28-Sep-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: extend ext4_mb_mark_context to support allocation under journal Previously, ext4_mb_mark_context is only called under fast commit replay path, so there is no valid handle when we update block bitmap and group descriptor. This patch try to extend ext4_mb_mark_context to be used by code under journal. There are several improvement: 1. Add "handle_t *handle" to struct ext4_mark_context to journal block bitmap and group descriptor update inside ext4_mb_mark_context (the added journal code is based on ext4_mb_mark_diskspace_used where ext4_mb_mark_context is going to be used.) 2. Adds a flag argument to ext4_mb_mark_context() which controls a. EXT4_MB_BITMAP_MARKED_CHECK - whether block bitmap checking is needed. b. EXT4_MB_SYNC_UPDATE - whether dirty buffers (bitmap and group descriptor) needs sync. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230928160407.142069-5-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
26d0f87b |
|
28-Sep-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: call ext4_mb_mark_context in ext4_free_blocks_simple call ext4_mb_mark_context in ext4_free_blocks_simple to: 1. remove repeat code 2. pair update of free_clusters in ext4_mb_new_blocks_simple. 3. add missing ext4_lock_group/ext4_unlock_group protection. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230928160407.142069-4-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
f9e2d95a |
|
28-Sep-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: factor out codes to update block bitmap and group descriptor on disk from ext4_mb_mark_bb There are several reasons to add a general function ext4_mb_mark_context to update block bitmap and group descriptor on disk: 1. pair behavior of alloc/free bits. For example, ext4_mb_new_blocks_simple will update free_clusters in struct flex_groups in ext4_mb_mark_bb while ext4_free_blocks_simple forgets this. 2. remove repeat code to read from disk, update and write back to disk. 3. reduce future unit test mocks to catch real IO to update structure on disk. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230928160407.142069-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
d2f7cf40 |
|
28-Sep-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: make state in ext4_mb_mark_bb to be bool As state could only be either 0 or 1, just make it bool. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230928160407.142069-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
ebf6cb7c |
|
24-Aug-2023 |
Wang Jianjian <wangjianjian0@foxmail.com> |
ext4: no need to generate from free list in mballoc Commit 7a2fcbf7f85 ("ext4: don't use blocks freed but not yet committed in buddy cache init") added a code to mark as used blocks in the list of not yet committed freed blocks during initialization of a buddy page. However ext4_mb_free_metadata() makes sure buddy page is already loaded and takes a reference to it so it cannot happen that ext4_mb_init_cache() is called when efd list is non-empty. Just remove the ext4_mb_generate_from_freelist() call. Fixes: 7a2fcbf7f85('ext4: don't use blocks freed but not yet committed in buddy cache init') Signed-off-by: Wang Jianjian <wangjianjian0@foxmail.com> Link: https://lore.kernel.org/r/tencent_53CBCB1668358AE862684E453DF37B722008@qq.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
|
#
ce774e53 |
|
12-Jun-2023 |
Jinke Han <hanjinke.666@bytedance.com> |
ext4: make running and commit transaction have their own freed_data_list When releasing space in jbd, we traverse s_freed_data_list to get the free range belonging to the current commit transaction. In extreme cases, the time spent may not be small, and we have observed cases exceeding 10ms. This patch makes running and commit transactions manage their own free_data_list respectively, eliminating unnecessary traversal. And in the callback phase of the commit transaction, no one will touch it except the jbd thread itself, so s_md_lock is no longer needed. Signed-off-by: Jinke Han <hanjinke.666@bytedance.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20230612124017.14115-1-hanjinke.666@bytedance.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
5229a658 |
|
13-Sep-2023 |
Jan Kara <jack@suse.cz> |
ext4: do not let fstrim block system suspend Len Brown has reported that system suspend sometimes fail due to inability to freeze a task working in ext4_trim_fs() for one minute. Trimming a large filesystem on a disk that slowly processes discard requests can indeed take a long time. Since discard is just an advisory call, it is perfectly fine to interrupt it at any time and the return number of discarded blocks until that moment. Do that when we detect the task is being frozen. Cc: stable@kernel.org Reported-by: Len Brown <lenb@kernel.org> Suggested-by: Dave Chinner <david@fromorbit.com> References: https://bugzilla.kernel.org/show_bug.cgi?id=216322 Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230913150504.9054-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
45e4ab32 |
|
13-Sep-2023 |
Jan Kara <jack@suse.cz> |
ext4: move setting of trimmed bit into ext4_try_to_trim_range() Currently we set the group's trimmed bit in ext4_trim_all_free() based on return value of ext4_try_to_trim_range(). However when we will want to abort trimming because of suspend attempt, we want to return success from ext4_try_to_trim_range() but not set the trimmed bit. Instead implementing awkward propagation of this information, just move setting of trimmed bit into ext4_try_to_trim_range() when the whole group is trimmed. Cc: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230913150504.9054-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
0f6bc579 |
|
12-Aug-2023 |
Ruan Jinjie <ruanjinjie@huawei.com> |
ext4: use LIST_HEAD() to initialize the list_head in mballoc.c Use LIST_HEAD() to initialize the list_head instead of open-coding it. Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com> Link: https://lore.kernel.org/r/20230812071839.3481909-1-ruanjinjie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
a50bda14 |
|
24-Jul-2023 |
Su Hui <suhui@nfschina.com> |
ext4: mballoc: avoid garbage value from err clang's static analysis warning: fs/ext4/mballoc.c line 4178, column 6, Branch condition evaluates to a garbage value. err is uninitialized and will be judged when 'len <= 0' or it first enters the loop while the condition "!ext4_sb_block_valid()" is true. Although this can't make problems now, it's better to correct it. Signed-off-by: Su Hui <suhui@nfschina.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20230725043310.1227621-1-suhui@nfschina.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
79ebf48c |
|
07-Jul-2023 |
Lu Hongfei <luhongfei@vivo.com> |
ext4: use sbi instead of EXT4_SB(sb) in ext4_mb_new_blocks_simple() Signed-off-by: Lu Hongfei <luhongfei@vivo.com> Link: https://lore.kernel.org/r/20230707115907.26637-1-luhongfei@vivo.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
89cadf6e |
|
07-Jul-2023 |
Lu Hongfei <luhongfei@vivo.com> |
ext4: change the type of blocksize in ext4_mb_init_cache() The return value type of i_blocksize() is 'unsigned int', so the type of blocksize has been modified from 'int' to 'unsigned int' to ensure data type consistency. Signed-off-by: Lu Hongfei <luhongfei@vivo.com> Link: https://lore.kernel.org/r/20230707105516.9156-1-luhongfei@vivo.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
772c9f691 |
|
16-Jul-2023 |
Ritesh Harjani <ritesh.list@gmail.com> |
ext4: don't use CR_BEST_AVAIL_LEN for non-regular files Using CR_BEST_AVAIL_LEN only make sense for regular files, as for non-regular files we never normalize the allocation request length i.e. goal len is same as original length (ac_g_ex.fe_len == ac_o_ex.fe_len). Hence there is no scope of trimming the goal length to make it satisfy original request len. Thus this patch avoids using CR_BEST_AVAIL_LEN criteria for non-regular files request. Cc: stable@kernel.org Fixes: 33122aa930f1 ("ext4: Add allocation criteria 1.5 (CR1_5)") Reported-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Tested-by: Eric Whitney <enwlinux@gmail.com> Link: https://lore.kernel.org/r/2a694c748ff8b8c4b416995a24f06f07b55047a8.1689516047.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
4eea9fbe |
|
01-Aug-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: correct some stale comment of criteria We named criteria with CR_XXX, correct stale comment to criteria with raw number. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-11-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
bcb123ac |
|
01-Aug-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: return found group directly in ext4_mb_choose_next_group_best_avail Return good group when it's found in loop to remove futher check if good group is found after loop. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-10-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
b50675a4 |
|
01-Aug-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: return found group directly in ext4_mb_choose_next_group_goal_fast Return good group when it's found in loop to remove futher check if good group is found after loop. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-9-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
de8bf0e5 |
|
01-Aug-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: replace the traditional ternary conditional operator with with max()/min() Replace the traditional ternary conditional operator with with max()/min() Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-7-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
ad635507 |
|
01-Aug-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: remove unnecessary return for void function The return at end of void function is unnecessary, just remove it. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-6-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
bb60caa2 |
|
01-Aug-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: use is_power_of_2 helper in ext4_mb_regular_allocator Use intuitive is_power_of_2 helper in ext4_mb_regular_allocator. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-5-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
919eb90c |
|
01-Aug-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: return found group directly in ext4_mb_choose_next_group_p2_aligned Return good group when it's found in loop to remove unnecessary NULL initialization of grp and futher check if good group is found after loop. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-4-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
60c672b7 |
|
01-Aug-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: avoid potential data overflow in next_linear_group ngroups is ext4_group_t (unsigned int) while next_linear_group treat it in int. If ngroups is bigger than max number described by int, it will be treat as a negative number. Then "return group + 1 >= ngroups ? 0 : group + 1;" may keep returning 0. Switch int to ext4_group_t in next_linear_group to fix the overflow. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
a9ce5993 |
|
01-Aug-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: correct grp validation in ext4_mb_good_group Group corruption check will access memory of grp and will trigger kernel crash if grp is NULL. So do NULL check before corruption check. Fixes: 5354b2af3406 ("ext4: allow ext4_get_group_info() to fail") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
304749c0 |
|
30-Jun-2023 |
Ojaswin Mujoo <ojaswin@linux.ibm.com> |
ext4: replace CR_FAST macro with inline function for readability Replace CR_FAST with ext4_mb_cr_expensive() inline function for better readability. This function returns true if the criteria is one of the expensive/slower ones where lots of disk IO/prefetching is acceptable. No functional changes are intended in this patch. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230630085927.140137-1-ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
95257987 |
|
16-Jun-2023 |
Jan Kara <jack@suse.cz> |
ext4: drop EXT4_MF_FS_ABORTED flag EXT4_MF_FS_ABORTED flag has practically the same intent as EXT4_FLAGS_SHUTDOWN flag. The shutdown flag is checked in many more places than the aborted flag which is mostly the historical artifact where we were relying on SB_RDONLY checks instead of the aborted flag checks. There are only three places - ext4_sync_file(), __ext4_remount(), and mballoc debug code - which check aborted flag and not shutdown flag and this is arguably a bug. Avoid these inconsistencies by removing EXT4_MF_FS_ABORTED flag and using EXT4_FLAGS_SHUTDOWN everywhere. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-5-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
bedc5d34 |
|
24-Jul-2023 |
Baokun Li <libaokun1@huawei.com> |
ext4: avoid overlapping preallocations due to overflow Let's say we want to allocate 2 blocks starting from 4294966386, after predicting the file size, start is aligned to 4294965248, len is changed to 2048, then end = start + size = 0x100000000. Since end is of type ext4_lblk_t, i.e. uint, end is truncated to 0. This causes (pa->pa_lstart >= end) to always hold when checking if the current extent to be allocated crosses already preallocated blocks, so the resulting ac_g_ex may cross already preallocated blocks. Hence we convert the end type to loff_t and use pa_logical_end() to avoid overflow. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230724121059.11834-4-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
bc056e71 |
|
24-Jul-2023 |
Baokun Li <libaokun1@huawei.com> |
ext4: fix BUG in ext4_mb_new_inode_pa() due to overflow When we calculate the end position of ext4_free_extent, this position may be exactly where ext4_lblk_t (i.e. uint) overflows. For example, if ac_g_ex.fe_logical is 4294965248 and ac_orig_goal_len is 2048, then the computed end is 0x100000000, which is 0. If ac->ac_o_ex.fe_logical is not the first case of adjusting the best extent, that is, new_bex_end > 0, the following BUG_ON will be triggered: ========================================================= kernel BUG at fs/ext4/mballoc.c:5116! invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 3 PID: 673 Comm: xfs_io Tainted: G E 6.5.0-rc1+ #279 RIP: 0010:ext4_mb_new_inode_pa+0xc5/0x430 Call Trace: <TASK> ext4_mb_use_best_found+0x203/0x2f0 ext4_mb_try_best_found+0x163/0x240 ext4_mb_regular_allocator+0x158/0x1550 ext4_mb_new_blocks+0x86a/0xe10 ext4_ext_map_blocks+0xb0c/0x13a0 ext4_map_blocks+0x2cd/0x8f0 ext4_iomap_begin+0x27b/0x400 iomap_iter+0x222/0x3d0 __iomap_dio_rw+0x243/0xcb0 iomap_dio_rw+0x16/0x80 ========================================================= A simple reproducer demonstrating the problem: mkfs.ext4 -F /dev/sda -b 4096 100M mount /dev/sda /tmp/test fallocate -l1M /tmp/test/tmp fallocate -l10M /tmp/test/file fallocate -i -o 1M -l16777203M /tmp/test/file fsstress -d /tmp/test -l 0 -n 100000 -p 8 & sleep 10 && killall -9 fsstress rm -f /tmp/test/tmp xfs_io -c "open -ad /tmp/test/file" -c "pwrite -S 0xff 0 8192" We simply refactor the logic for adjusting the best extent by adding a temporary ext4_free_extent ex and use extent_logical_end() to avoid overflow, which also simplifies the code. Cc: stable@kernel.org # 6.4 Fixes: 93cdf49f6eca ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230724121059.11834-3-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
43bbddc0 |
|
24-Jul-2023 |
Baokun Li <libaokun1@huawei.com> |
ext4: add two helper functions extent_logical_end() and pa_logical_end() When we use lstart + len to calculate the end of free extent or prealloc space, it may exceed the maximum value of 4294967295(0xffffffff) supported by ext4_lblk_t and cause overflow, which may lead to various problems. Therefore, we add two helper functions, extent_logical_end() and pa_logical_end(), to limit the type of end to loff_t, and also convert lstart to loff_t for calculation to avoid overflow. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230724121059.11834-2-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
9d3de7ee |
|
22-Jul-2023 |
Ojaswin Mujoo <ojaswin@linux.ibm.com> |
ext4: fix rbtree traversal bug in ext4_mb_use_preallocated During allocations, while looking for preallocations(PA) in the per inode rbtree, we can't do a direct traversal of the tree because ext4_mb_discard_group_preallocation() can paralelly mark the pa deleted and that can cause direct traversal to skip some entries. This was leading to a BUG_ON() being hit [1] when we missed a PA that could satisfy our request and ultimately tried to create a new PA that would overlap with the missed one. To makes sure we handle that case while still keeping the performance of the rbtree, we make use of the fact that the only pa that could possibly overlap the original goal start is the one that satisfies the below conditions: 1. It must have it's logical start immediately to the left of (ie less than) original logical start. 2. It must not be deleted To find this pa we use the following traversal method: 1. Descend into the rbtree normally to find the immediate neighboring PA. Here we keep descending irrespective of if the PA is deleted or if it overlaps with our request etc. The goal is to find an immediately adjacent PA. 2. If the found PA is on right of original goal, use rb_prev() to find the left adjacent PA. 3. Check if this PA is deleted and keep moving left with rb_prev() until a non deleted PA is found. 4. This is the PA we are looking for. Now we can check if it can satisfy the original request and proceed accordingly. This approach also takes care of having deleted PAs in the tree. (While we are at it, also fix a possible overflow bug in calculating the end of a PA) [1] https://lore.kernel.org/linux-ext4/CA+G9fYv2FRpLqBZf34ZinR8bU2_ZRAUOjKAD3+tKRFaEQHtt8Q@mail.gmail.com/ Cc: stable@kernel.org # 6.4 Fixes: 3872778664e3 ("ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list") Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reviewed-by: Ritesh Harjani (IBM) ritesh.list@gmail.com Tested-by: Ritesh Harjani (IBM) ritesh.list@gmail.com Link: https://lore.kernel.org/r/edd2efda6a83e6343c5ace9deea44813e71dbe20.1690045963.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
5d5460fa |
|
09-Jun-2023 |
Ojaswin Mujoo <ojaswin@linux.ibm.com> |
ext4: fix off by one issue in ext4_mb_choose_next_group_best_avail() In ext4_mb_choose_next_group_best_avail(), we want the start order to be 1 less than goal length and the min_order to be, at max, 1 more than the original length. This commit fixes an off by one issue that arose due to the fact that 1 << fls(n) > (n). After all the processing: order = 1 order below goal len min_order = maximum of the three:- - order - trim_order - 1 order below B2C(s_stripe) - 1 order above original len Cc: stable@kernel.org Fixes: 33122aa930 ("ext4: Add allocation criteria 1.5 (CR1_5)") Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230609103403.112807-1-ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
4c0cfebd |
|
08-Jun-2023 |
Theodore Ts'o <tytso@mit.edu> |
ext4: clean up mballoc criteria comments Line wrap and slightly clarify the comments describing mballoc's cirtiera. Define EXT4_MB_NUM_CRS as part of the enum, so that it will automatically get updated when criteria is added or removed. Also fix a potential unitialized use of 'cr' variable if CONFIG_EXT4_DEBUG is enabled. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
f52f3d2b |
|
30-May-2023 |
Ojaswin Mujoo <ojaswin@linux.ibm.com> |
ext4: Give symbolic names to mballoc criterias mballoc criterias have historically been called by numbers like CR0, CR1... however this makes it confusing to understand what each criteria is about. Change these criterias from numbers to symbolic names and add relevant comments. While we are at it, also reformat and add some comments to ext4_seq_mb_stats_show() for better readability. Additionally, define CR_FAST which signifies the criteria below which we can make quicker decisions like: * quitting early if (free block < requested len) * avoiding to scan free extents smaller than required len. * avoiding to initialize buddy cache and work with existing cache * limiting prefetches Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/a2dc6ec5aea5e5e68cf8e788c2a964ffead9c8b0.1685449706.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
7e170922 |
|
30-May-2023 |
Ojaswin Mujoo <ojaswin@linux.ibm.com> |
ext4: Add allocation criteria 1.5 (CR1_5) CR1_5 aims to optimize allocations which can't be satisfied in CR1. The fact that we couldn't find a group in CR1 suggests that it would be difficult to find a continuous extent to compleltely satisfy our allocations. So before falling to the slower CR2, in CR1.5 we proactively trim the the preallocations so we can find a group with (free / fragments) big enough. This speeds up our allocation at the cost of slightly reduced preallocation. The patch also adds a new sysfs tunable: * /sys/fs/ext4/<partition>/mb_cr1_5_max_trim_order This controls how much CR1.5 can trim a request before falling to CR2. For example, for a request of order 7 and max trim order 2, CR1.5 can trim this upto order 5. Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/150fdf65c8e4cc4dba71e020ce0859bcf636a5ff.1685449706.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
856d865c |
|
30-May-2023 |
Ojaswin Mujoo <ojaswin@linux.ibm.com> |
ext4: Abstract out logic to search average fragment list Make the logic of searching average fragment list of a given order reusable by abstracting it out to a differnet function. This will also avoid code duplication in upcoming patches. No functional changes. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/028c11d95b17ce0285f45456709a0ca922df1b83.1685449706.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
4f3d1e45 |
|
30-May-2023 |
Ojaswin Mujoo <ojaswin@linux.ibm.com> |
ext4: Ensure ext4_mb_prefetch_fini() is called for all prefetched BGs Before this patch, the call stack in ext4_run_li_request is as follows: /* * nr = no. of BGs we want to fetch (=s_mb_prefetch) * prefetch_ios = no. of BGs not uptodate after * ext4_read_block_bitmap_nowait() */ next_group = ext4_mb_prefetch(sb, group, nr, prefetch_ios); ext4_mb_prefetch_fini(sb, next_group prefetch_ios); ext4_mb_prefetch_fini() will only try to initialize buddies for BGs in range [next_group - prefetch_ios, next_group). This is incorrect since sometimes (prefetch_ios < nr), which causes ext4_mb_prefetch_fini() to incorrectly ignore some of the BGs that might need initialization. This issue is more notable now with the previous patch enabling "fetching" of BLOCK_UNINIT BGs which are marked buffer_uptodate by default. Fix this by passing nr to ext4_mb_prefetch_fini() instead of prefetch_ios so that it considers the right range of groups. Similarly, make sure we don't pass nr=0 to ext4_mb_prefetch_fini() in ext4_mb_regular_allocator() since we might have prefetched BLOCK_UNINIT groups that would need buddy initialization. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/05e648ae04ec5b754207032823e9c1de9a54f87a.1685449706.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
3c629604 |
|
30-May-2023 |
Ojaswin Mujoo <ojaswin@linux.ibm.com> |
ext4: Don't skip prefetching BLOCK_UNINIT groups Currently, ext4_mb_prefetch() and ext4_mb_prefetch_fini() skip BLOCK_UNINIT groups since fetching their bitmaps doesn't need disk IO. As a consequence, we end not initializing the buddy structures and CR0/1 lists for these BGs, even though it can be done without any disk IO overhead. Hence, don't skip such BGs during prefetch and prefetch_fini. This improves the accuracy of CR0/1 allocation as earlier, we could have essentially empty BLOCK_UNINIT groups being ignored by CR0/1 due to their buddy not being initialized, leading to slower CR2 allocations. With this patch CR0/1 will be able to discover these groups as well, thus improving performance. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/dc3130b8daf45ffe63d8a3c1edcf00eb8ba70e1f.1685449706.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
1b420011 |
|
30-May-2023 |
Ojaswin Mujoo <ojaswin@linux.ibm.com> |
ext4: Avoid scanning smaller extents in BG during CR1 When we are inside ext4_mb_complex_scan_group() in CR1, we can be sure that this group has atleast 1 big enough continuous free extent to satisfy our request because (free / fragments) > goal length. Hence, instead of wasting time looping over smaller free extents, only try to consider the free extent if we are sure that it has enough continuous free space to satisfy goal length. This is particularly useful when scanning highly fragmented BGs in CR1 as, without this patch, the allocator might stop scanning early before reaching the big enough free extent (due to ac_found > mb_max_to_scan) which causes us to uncessarily trim the request. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/a5473df4517c53ec940bc9b603ef83a547032a32.1685449706.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
3ef5d263 |
|
30-May-2023 |
Ojaswin Mujoo <ojaswin@linux.ibm.com> |
ext4: Add counter to track successful allocation of goal length Track number of allocations where the length of blocks allocated is equal to the length of goal blocks (post normalization). This metric could be useful if making changes to the allocator logic in the future as it could give us visibility into how often do we trim our requests. PS: ac_b_ex.fe_len might get modified due to preallocation efforts and hence we use ac_f_ex.fe_len instead since we want to compare how much the allocator was able to actually find. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/343620e2be8a237239ea2613a7a866ee8607e973.1685449706.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
fdd9a009 |
|
30-May-2023 |
Ojaswin Mujoo <ojaswin@linux.ibm.com> |
ext4: Add per CR extent scanned counter This gives better visibility into the number of extents scanned in each particular CR. For example, this information can be used to see how out block group scanning logic is performing when the BG is fragmented. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/55bb6d80f6e22ed2a5a830aa045572bdffc8b1b9.1685449706.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
4eb7a4a1 |
|
30-May-2023 |
Ojaswin Mujoo <ojaswin@linux.ibm.com> |
ext4: Convert mballoc cr (criteria) to enum Convert criteria to be an enum so it easier to maintain and update the tracefiles to use enum names. This change also makes it easier to insert new criterias in the future. There is no functional change in this patch. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/5d82fd467bdf70ea45bdaef810af3b146013946c.1685449706.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
569f196f |
|
30-May-2023 |
Ritesh Harjani <ritesh.list@gmail.com> |
ext4: mballoc: Remove useless setting of ac_criteria There will be changes coming in future patches which will introduce a new criteria for block allocation. This removes the useless setting of ac_criteria. AFAIU, this might be only used to differentiate between whether a preallocated blocks was allocated or was regular allocator called for allocating blocks. Hence this also adds the debug prints to identify what type of block allocation was done in ext4_mb_show_ac(). Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/1dbae05617519cb6202f1b299c9d1be3e7cda763.1685449706.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
2ec6d0a5 |
|
03-Jun-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: fix wrong unit use in ext4_mb_new_blocks Function ext4_free_blocks_simple needs count in cluster. Function ext4_free_blocks accepts count in block. Convert count to cluster to fix the mismatch. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: stable@kernel.org Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230603150327.3596033-12-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
247c3d21 |
|
03-Jun-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: fix wrong unit use in ext4_mb_clear_bb Function ext4_issue_discard need count in cluster. Pass count_clusters instead of count to fix the mismatch. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: stable@kernel.org Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230603150327.3596033-11-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
ad78b5ef |
|
03-Jun-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: remove unused parameter from ext4_mb_new_blocks_simple() Two cleanups for ext4_mb_new_blocks_simple: Remove unused parameter handle of ext4_mb_new_blocks_simple. Move ext4_mb_new_blocks_simple definition before ext4_mb_new_blocks to remove unnecessary forward declaration of ext4_mb_new_blocks_simple. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230603150327.3596033-10-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
11b6890b |
|
03-Jun-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: get block from bh in ext4_free_blocks for fast commit replay ext4_free_blocks will retrieve block from bh if block parameter is zero. Retrieve block before ext4_free_blocks_simple to avoid potentially passing wrong block to ext4_free_blocks_simple. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: stable@kernel.org Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230603150327.3596033-9-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
19a043bb |
|
03-Jun-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: try all groups in ext4_mb_new_blocks_simple ext4_mb_new_blocks_simple ignores the group before goal, so it will fail if free blocks reside in group before goal. Try all groups to avoid unexpected failure. Search finishes either if any free block is found or if no available blocks are found. Simpliy check "i >= max" to distinguish the above cases. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Suggested-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230603150327.3596033-8-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
1eff5904 |
|
03-Jun-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: add EXT4_MB_HINT_GOAL_ONLY test in ext4_mb_use_preallocated ext4_mb_use_preallocated will ignore the demand to alloc goal blocks, although the EXT4_MB_HINT_GOAL_ONLY is requested. For group pa, ext4_mb_group_or_file will not set EXT4_MB_HINT_GROUP_ALLOC if EXT4_MB_HINT_GOAL_ONLY is set. So we will not alloc goal blocks from group pa if EXT4_MB_HINT_GOAL_ONLY is set. For inode pa, ext4_mb_pa_goal_check is added to check if free extent in found inode pa meets goal blocks when EXT4_MB_HINT_GOAL_ONLY is set. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Suggested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230603150327.3596033-6-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
c3defd99 |
|
03-Jun-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: treat stripe in block unit Stripe is misused in block unit and in cluster unit in different code paths. User awared of stripe maybe not awared of bigalloc feature, so treat stripe only in block unit to fix this. Besides, it's hard to get stripe aligned blocks (start and length are both aligned with stripe) if stripe is not aligned with cluster, just disable stripe and alert user in this case to simpfy the code and avoid unnecessary work to get stripe aligned blocks which likely to be failed. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230603150327.3596033-5-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
99c515e3 |
|
03-Jun-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: fix wrong unit use in ext4_mb_find_by_goal We need start in block unit while fe_start is in cluster unit. Use ext4_grp_offs_to_block helper to convert fe_start to get start in block unit. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230603150327.3596033-4-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
497885f7 |
|
03-Jun-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: fix unit mismatch in ext4_mb_new_blocks_simple The "i" returned from mb_find_next_zero_bit is in cluster unit and we need offset "block" corresponding to "i" in block unit. Convert "i" to block unit to fix the unit mismatch. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230603150327.3596033-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
b3916da0 |
|
03-Jun-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: fix wrong unit use in ext4_mb_normalize_request NRL_CHECK_SIZE will compare input req and size, so req and size should be in same unit. Input req "fe_len" is in cluster unit while input size "(8<<20)>>bsbits" is in block unit. Convert "fe_len" to block unit to fix the mismatch. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230603150327.3596033-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
3582e745 |
|
30-May-2023 |
Ojaswin Mujoo <ojaswin@linux.ibm.com> |
Revert "ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits" This reverts commit 32c0869370194ae5ac9f9f501953ef693040f6a1. The reverted commit was intended to remove a dead check however it was observed that this check was actually being used to exit early instead of looping sbi->s_mb_max_to_scan times when we are able to find a free extent bigger than the goal extent. Due to this, a my performance tests (fsmark, parallel file writes in a highly fragmented FS) were seeing a 2x-3x regression. Example, the default value of the following variables is: sbi->s_mb_max_to_scan = 200 sbi->s_mb_min_to_scan = 10 In ext4_mb_check_limits() if we find an extent smaller than goal, then we return early and try again. This loop will go on until we have processed sbi->s_mb_max_to_scan(=200) number of free extents at which point we exit and just use whatever we have even if it is smaller than goal extent. Now, the regression comes when we find an extent bigger than goal. Earlier, in this case we would loop only sbi->s_mb_min_to_scan(=10) times and then just use the bigger extent. However with commit 32c08693 that check was removed and hence we would loop sbi->s_mb_max_to_scan(=200) times even though we have a big enough free extent to satisfy the request. The only time we would exit early would be when the free extent is *exactly* the size of our goal, which is pretty uncommon occurrence and so we would almost always end up looping 200 times. Hence, revert the commit by adding the check back to fix the regression. Also add a comment to outline this policy. Fixes: 32c086937019 ("ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits") Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/ddcae9658e46880dfec2fb0aa61d01fb3353d202.1685449706.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
463808f2 |
|
29-Apr-2023 |
Theodore Ts'o <tytso@mit.edu> |
ext4: remove a BUG_ON in ext4_mb_release_group_pa() If a malicious fuzzer overwrites the ext4 superblock while it is mounted such that the s_first_data_block is set to a very large number, the calculation of the block group can underflow, and trigger a BUG_ON check. Change this to be an ext4_warning so that we don't crash the kernel. Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230430154311.579720-3-tytso@mit.edu Reported-by: syzbot+e2efa3efc15a1c9e95c3@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=69b28112e098b070f639efb356393af3ffec4220 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
5354b2af |
|
28-Apr-2023 |
Theodore Ts'o <tytso@mit.edu> |
ext4: allow ext4_get_group_info() to fail Previously, ext4_get_group_info() would treat an invalid group number as BUG(), since in theory it should never happen. However, if a malicious attaker (or fuzzer) modifies the superblock via the block device while it is the file system is mounted, it is possible for s_first_data_block to get set to a very large number. In that case, when calculating the block group of some block number (such as the starting block of a preallocation region), could result in an underflow and very large block group number. Then the BUG_ON check in ext4_get_group_info() would fire, resutling in a denial of service attack that can be triggered by root or someone with write access to the block device. For a quality of implementation perspective, it's best that even if the system administrator does something that they shouldn't, that it will not trigger a BUG. So instead of BUG'ing, ext4_get_group_info() will call ext4_error and return NULL. We also add fallback code in all of the callers of ext4_get_group_info() that it might NULL. Also, since ext4_get_group_info() was already borderline to be an inline function, un-inline it. The results in a next reduction of the compiled text size of ext4 by roughly 2k. Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230430154311.579720-2-tytso@mit.edu Reported-by: syzbot+e2efa3efc15a1c9e95c3@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=69b28112e098b070f639efb356393af3ffec4220 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
|
#
361eb69f |
|
25-Mar-2023 |
Ojaswin Mujoo <ojaswin@linux.ibm.com> |
ext4: Remove the logic to trim inode PAs Earlier, inode PAs were stored in a linked list. This caused a need to periodically trim the list down inorder to avoid growing it to a very large size, as this would severly affect performance during list iteration. Recent patches changed this list to an rbtree, and since the tree scales up much better, we no longer need to have the trim functionality, hence remove it. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/c409addceaa3ade4b40328e28e3b54b2f259689e.1679731817.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
38727786 |
|
25-Mar-2023 |
Ojaswin Mujoo <ojaswin@linux.ibm.com> |
ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list Currently, the kernel uses i_prealloc_list to hold all the inode preallocations. This is known to cause degradation in performance in workloads which perform large number of sparse writes on a single file. This is mainly because functions like ext4_mb_normalize_request() and ext4_mb_use_preallocated() iterate over this complete list, resulting in slowdowns when large number of PAs are present. Patch 27bc446e2 partially fixed this by enforcing a limit of 512 for the inode preallocation list and adding logic to continually trim the list if it grows above the threshold, however our testing revealed that a hardcoded value is not suitable for all kinds of workloads. To optimize this, add an rbtree to the inode and hold the inode preallocations in this rbtree. This will make iterating over inode PAs faster and scale much better than a linked list. Additionally, we also had to remove the LRU logic that was added during trimming of the list (in ext4_mb_release_context()) as it will add extra overhead in rbtree. The discards now happen in the lowest-logical-offset-first order. ** Locking notes ** With the introduction of rbtree to maintain inode PAs, we can't use RCU to walk the tree for searching since it can result in partial traversals which might miss some nodes(or entire subtrees) while discards happen in parallel (which happens under a lock). Hence this patch converts the ei->i_prealloc_lock spin_lock to rw_lock. Almost all the codepaths that read/modify the PA rbtrees are protected by the higher level inode->i_data_sem (except ext4_mb_discard_group_preallocations() and ext4_clear_inode()) IIUC, the only place we need lock protection is when one thread is reading "searching" the PA rbtree (earlier protected under rcu_read_lock()) and another is "deleting" the PAs in ext4_mb_discard_group_preallocations() function (which iterates all the PAs using the grp->bb_prealloc_list and deletes PAs from the tree without taking any inode lock (i_data_sem)). So, this patch converts all rcu_read_lock/unlock() paths for inode list PA to use read_lock() and all places where we were using ei->i_prealloc_lock spinlock will now be using write_lock(). Note that this makes the fast path (searching of the right PA e.g. ext4_mb_use_preallocated() or ext4_mb_normalize_request()), now use read_lock() instead of rcu_read_lock/unlock(). Ths also will now block due to slow discard path (ext4_mb_discard_group_preallocations()) which uses write_lock(). But this is not as bad as it looks. This is because - 1. The slow path only occurs when the normal allocation failed and we can say that we are low on disk space. One can argue this scenario won't be much frequent. 2. ext4_mb_discard_group_preallocations(), locks and unlocks the rwlock for deleting every individual PA. This gives enough opportunity for the fast path to acquire the read_lock for searching the PA inode list. Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/4137bce8f6948fedd8bae134dabae24acfe699c6.1679731817.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
a8e38fd3 |
|
25-Mar-2023 |
Ojaswin Mujoo <ojaswin@linux.ibm.com> |
ext4: Convert pa->pa_inode_list and pa->pa_obj_lock into a union ** Splitting pa->pa_inode_list ** Currently, we use the same pa->pa_inode_list to add a pa to either the inode preallocation list or the locality group preallocation list. For better clarity, split this list into a union of 2 list_heads and use either of the them based on the type of pa. ** Splitting pa->pa_obj_lock ** Currently, pa->pa_obj_lock is either assigned &ei->i_prealloc_lock for inode PAs or lg_prealloc_lock for lg PAs, and is then used to lock the lists containing these PAs. Make the distinction between the 2 PA types clear by changing this lock to a union of 2 locks. Explicitly use the pa_lock_node.inode_lock for inode PAs and pa_lock_node.lg_lock for lg PAs. This patch is required so that the locality group preallocation code remains the same as in upcoming patches we are going to make changes to inode preallocation code to move from list to rbtree based implementation. This patch also makes it easier to review the upcoming patches. There are no functional changes in this patch. Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/1d7ac0557e998c3fc7eef422b52e4bc67bdef2b0.1679731817.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
93cdf49f |
|
25-Mar-2023 |
Ojaswin Mujoo <ojaswin@linux.ibm.com> |
ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa() When the length of best extent found is less than the length of goal extent we need to make sure that the best extent atleast covers the start of the original request. This is done by adjusting the ac_b_ex.fe_logical (logical start) of the extent. While doing so, the current logic sometimes results in the best extent's logical range overflowing the goal extent. Since this best extent is later added to the inode preallocation list, we have a possibility of introducing overlapping preallocations. This is discussed in detail here [1]. As per Jan's suggestion, to fix this, replace the existing logic with the below logic for adjusting best extent as it keeps fragmentation in check while ensuring logical range of best extent doesn't overflow out of goal extent: 1. Check if best extent can be kept at end of goal range and still cover original start. 2. Else, check if best extent can be kept at start of goal range and still cover original start. 3. Else, keep the best extent at start of original request. Also, add a few extra BUG_ONs that might help catch errors faster. [1] https://lore.kernel.org/r/Y+OGkVvzPN0RMv0O@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/f96aca6d415b36d1f90db86c1a8cd7e2e9d7ab0e.1679731817.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
0830344c |
|
25-Mar-2023 |
Ojaswin Mujoo <ojaswin@linux.ibm.com> |
ext4: Abstract out overlap fix/check logic in ext4_mb_normalize_request() Abstract out the logic of fixing PA overlaps in ext4_mb_normalize_request to improve readability of code. This also makes it easier to make changes to the overlap logic in future. There are no functional changes in this patch Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/9b35f3955a1d7b66bbd713eca1e63026e01f78c1.1679731817.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
7692094a |
|
25-Mar-2023 |
Ojaswin Mujoo <ojaswin@linux.ibm.com> |
ext4: Move overlap assert logic into a separate function Abstract out the logic to double check for overlaps in normalize_pa to a separate function. Since there has been no reports in past where we have seen any overlaps which hits this bug_on(), in future we can consider calling this function under "#ifdef AGGRESSIVE_CHECK" only. There are no functional changes in this patch Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/35dd5d94fa0b2d1cd2d2947adf8967279c72967d.1679731817.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
bcf43499 |
|
25-Mar-2023 |
Ojaswin Mujoo <ojaswin@linux.ibm.com> |
ext4: Refactor code in ext4_mb_normalize_request() and ext4_mb_use_preallocated() Change some variable names to be more consistent and refactor some of the code to make it easier to read. There are no functional changes in this patch Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/8edcab489c06cf861b19d87207d9b0ff7ac7f3c1.1679731817.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
82089725 |
|
25-Mar-2023 |
Ojaswin Mujoo <ojaswin@linux.ibm.com> |
ext4: Refactor code related to freeing PAs This patch makes the following changes: * Rename ext4_mb_pa_free to ext4_mb_pa_put_free to better reflect its purpose * Add new ext4_mb_pa_free() which only handles freeing * Refactor ext4_mb_pa_callback() to use ext4_mb_pa_free() There are no functional changes in this patch Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/b273bc9cbf5bd278f641fa5bc6c0cc9e6cb3330c.1679731817.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
e86a7182 |
|
25-Mar-2023 |
Ojaswin Mujoo <ojaswin@linux.ibm.com> |
ext4: Stop searching if PA doesn't satisfy non-extent file If we come across a PA that matches the logical offset but is unable to satisfy a non-extent file due to its physical start being higher than that supported by non extent files, then simply stop searching for another PA and break out of loop. This is because, since PAs don't overlap, we won't be able to find another inode PA which can satisfy the original request. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/42404ca29bd304ae2c962184c3c32a02e8eefcd0.1679731817.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
19b8b035 |
|
16-Mar-2023 |
Theodore Ts'o <tytso@mit.edu> |
ext4: convert some BUG_ON's in mballoc to use WARN_RATELIMITED instead In cases where we have an obvious way of continuing, let's use WARN_RATELIMITED() instead of BUG_ON(). Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
91a48aaf |
|
11-Mar-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: avoid unnecessary pointer dereference in ext4_mb_normalize_request Result of EXT4_SB(ac->ac_sb) is already stored in sbi at beginning of ext4_mb_normalize_request. Use sbi instead of EXT4_SB(ac->ac_sb) to remove unnecessary pointer dereference. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230311170949.1047958-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
1221b235 |
|
11-Mar-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: fix typos in mballoc pa_plen -> pa_len pa_start -> pa_pstart Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230311170949.1047958-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
253cacb0 |
|
03-Mar-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: simplify calculation of blkoff in ext4_mb_new_blocks_simple We try to allocate a block from goal in ext4_mb_new_blocks_simple. We only need get blkoff in first group with goal and set blkoff to 0 for the rest groups. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230303172120.3800725-21-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
46825e94 |
|
03-Mar-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: remove comment code ext4_discard_preallocations Just remove comment code in ext4_discard_preallocations. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-20-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
3a037b1b |
|
03-Mar-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: remove repeat assignment to ac_f_ex Call trace to assign ac_f_ex: ext4_mb_use_best_found ac->ac_f_ex = ac->ac_b_ex; ext4_mb_new_preallocation ext4_mb_new_group_pa ac->ac_f_ex = ac->ac_b_ex; ext4_mb_new_inode_pa ac->ac_f_ex = ac->ac_b_ex; Actually allocated blocks is already stored in ac_f_ex in ext4_mb_use_best_found, so there is no need to assign ac_f_ex in ext4_mb_new_group_pa and ext4_mb_new_inode_pa. Just remove repeat assignment to ac_f_ex in ext4_mb_new_group_pa and ext4_mb_new_inode_pa. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-19-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
fb28f9ce |
|
03-Mar-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: remove unnecessary goto in ext4_mb_mark_diskspace_used When ext4_read_block_bitmap fails, we can return PTR_ERR(bitmap_bh) to remove unnecessary NULL check of bitmap_bh. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230303172120.3800725-18-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
c7f2bafa |
|
03-Mar-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: remove unnecessary count2 in ext4_free_data_in_buddy count2 is always 1 in mb_debug, just remove unnecessary count2. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-17-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
df119095 |
|
03-Mar-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: remove unnecessary exit_meta_group_info tag We goto exit_meta_group_info only to return -ENOMEM. Return -ENOMEM directly instead of goto to remove this unnecessary tag. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-16-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
78dc9f84 |
|
03-Mar-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: use best found when complex scan of group finishs If any bex which meets bex->fe_len >= gex->fe_len is found, then it will always be used when complex scan of group that bex belongs to finishs. So there will not be any lock-unlock period. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-15-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
32c08693 |
|
03-Mar-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits Only call trace of ext4_mb_check_limits is as following: ext4_mb_complex_scan_group ext4_mb_measure_extent ext4_mb_check_limits(ac, e4b, 0); ext4_mb_check_limits(ac, e4b, 1); If the first ac->ac_found > sbi->s_mb_max_to_scan check in ext4_mb_check_limits is met, we will set ac_status to AC_STATUS_BREAK and call ext4_mb_try_best_found to try to use ac->ac_b_ex. If ext4_mb_try_best_found successes, then block allocation finishs, the removed ac->ac_found > sbi->s_mb_min_to_scan check is not reachable. If ext4_mb_try_best_found fails, then we set EXT4_MB_HINT_FIRST and reset ac->ac_b_ex to retry block allocation. We will use any found free extent in ext4_mb_measure_extent before reach the removed ac->ac_found > sbi->s_mb_min_to_scan check. In summary, the removed ac->ac_found > sbi->s_mb_min_to_scan check is not reachable and we can remove that dead check. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-14-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
976620bd |
|
03-Mar-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: remove dead check in mb_buddy_mark_free We always adjust first to even number and adjust last to odd number, so first == last will never happen. Remove this dead check. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230303172120.3800725-13-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
aaae558d |
|
03-Mar-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: remove unnecessary check in ext4_mb_new_blocks 1. remove unnecessary ac check: We always go to out tag before ac is successfully allocated, then we can move out tag after free of ac and remove NULL check of ac. 2. remove unnecessary *errp check: We always go to errout tag if *errp is non-zero, then we can move errout tag into error handle if *errp is non-zero. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-12-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
285164b8 |
|
03-Mar-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: remove unnecessary e4b->bd_buddy_page check in ext4_mb_load_buddy_gfp e4b->bd_buddy_page is only set if we initialize ext4_buddy successfully. So e4b->bd_buddy_page is always NULL in error handle branch. Just remove the dead check. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-11-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
139f46d3 |
|
03-Mar-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: Remove unnecessary release when memory allocation failed in ext4_mb_init_cache If we alloc array of buffer_head failed, there is no resource need to be freed and we can simpily return error. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-10-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
85b67ffb |
|
03-Mar-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: remove unused return value of ext4_mb_try_best_found and ext4_mb_free_metadata Return value static function ext4_mb_try_best_found and ext4_mb_free_metadata is not used. Just remove unused return value. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-9-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
1b5c9d34 |
|
03-Mar-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: add missed brelse in ext4_free_blocks_simple Release bitmap buffer_head we got if error occurs. Besides, this patch remove unused assignment to err. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-8-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
36cb0f52 |
|
03-Mar-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: protect pa->pa_free in ext4_discard_allocated_blocks If ext4_mb_mark_diskspace_used fails in ext4_mb_new_blocks, we may discard pa already in list. Protect pa with pa_lock to avoid race. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-7-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
1afdc588 |
|
03-Mar-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: correct start of used group pa for debug in ext4_mb_use_group_pa As we don't correct pa_lstart here, so there is no need to subtract pa_lstart with consumed len. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-6-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
abc075d4 |
|
03-Mar-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: correct calculation of s_mb_preallocated We will add pa_free to s_mb_preallocated when new ext4_prealloc_space is created. In ext4_mb_new_inode_pa, we will call ext4_mb_use_inode_pa before adding pa_free to s_mb_preallocated. However, ext4_mb_use_inode_pa will consume pa_free for block allocation which triggerred the creation of ext4_prealloc_space. Add pa_free to s_mb_preallocated before ext4_mb_use_inode_pa to correct calculation of s_mb_preallocated. There is no such problem in ext4_mb_new_group_pa as pa_free of group pa is consumed in ext4_mb_release_context instead of ext4_mb_use_group_pa. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-5-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
22fab984 |
|
03-Mar-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: get correct ext4_group_info in ext4_mb_prefetch_fini We always get ext4_group_desc with group + 1 and ext4_group_info with group to check if we need do initialize ext4_group_info for the group. Just get ext4_group_desc with group for ext4_group_info initialization check. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230303172120.3800725-4-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
01e4ca29 |
|
03-Mar-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: allow to find by goal if EXT4_MB_HINT_GOAL_ONLY is set If EXT4_MB_HINT_GOAL_ONLY is set, ext4_mb_regular_allocator will only allocate blocks from ext4_mb_find_by_goal. Allow to find by goal in ext4_mb_find_by_goal if EXT4_MB_HINT_GOAL_ONLY is set or allocation with EXT4_MB_HINT_GOAL_ONLY set will always fail. EXT4_MB_HINT_GOAL_ONLY is not used at all, so the problem is not found for now. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
b07ffe69 |
|
03-Mar-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: set goal start correctly in ext4_mb_normalize_request We need to set ac_g_ex to notify the goal start used in ext4_mb_find_by_goal. Set ac_g_ex instead of ac_f_ex in ext4_mb_normalize_request. Besides we should assure goal start is in range [first_data_block, blocks_count) as ext4_mb_initialize_context does. [ Added a check to make sure size is less than ar->pright; otherwise we could end up passing an underflowed value of ar->pright - size to ext4_get_group_no_and_offset(), which will trigger a BUG_ON later on. - TYT ] Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230303172120.3800725-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
1df9bde4 |
|
21-Feb-2023 |
Kemeng Shi <shikemeng@huaweicloud.com> |
ext4: remove unused group parameter in ext4_block_bitmap_csum_set Remove unused group parameter in ext4_block_bitmap_csum_set. After this, group parameter in ext4_set_bitmap_checksums is also not used, just remove it too. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230221203027.2359920-5-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
d73eff68 |
|
02-Dec-2022 |
Guoqing Jiang <guoqing.jiang@linux.dev> |
ext4: make ext4_mb_initialize_context return void Change the return type to void since it always return 0, and no need to do the checking in ext4_mb_new_blocks. Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20221202120409.24098-1-guoqing.jiang@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
a078dff8 |
|
22-Sep-2022 |
Jan Kara <jack@suse.cz> |
ext4: fixup possible uninitialized variable access in ext4_mb_choose_next_group_cr1() Variable 'grp' may be left uninitialized if there's no group with suitable average fragment size (or larger). Fix the problem by initializing it earlier. Link: https://lore.kernel.org/r/20220922091542.pkhedytey7wzp5fi@quack3 Fixes: 83e80a6e3543 ("ext4: use buckets for cr 1 block scan instead of rbtree") Cc: stable@kernel.org Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
80fa46d6 |
|
01-Sep-2022 |
Theodore Ts'o <tytso@mit.edu> |
ext4: limit the number of retries after discarding preallocations blocks This patch avoids threads live-locking for hours when a large number threads are competing over the last few free extents as they blocks getting added and removed from preallocation pools. From our bug reporter: A reliable way for triggering this has multiple writers continuously write() to files when the filesystem is full, while small amounts of space are freed (e.g. by truncating a large file -1MiB at a time). In the local filesystem, this can be done by simply not checking the return code of write (0) and/or the error (ENOSPACE) that is set. Over NFS with an async mount, even clients with proper error checking will behave this way since the linux NFS client implementation will not propagate the server errors [the write syscalls immediately return success] until the file handle is closed. This leads to a situation where NFS clients send a continuous stream of WRITE rpcs which result in ERRNOSPACE -- but since the client isn't seeing this, the stream of writes continues at maximum network speed. When some space does appear, multiple writers will all attempt to claim it for their current write. For NFS, we may see dozens to hundreds of threads that do this. The real-world scenario of this is database backup tooling (in particular, github.com/mdkent/percona-xtrabackup) which may write large files (>1TiB) to NFS for safe keeping. Some temporary files are written, rewound, and read back -- all before closing the file handle (the temp file is actually unlinked, to trigger automatic deletion on close/crash.) An application like this operating on an async NFS mount will not see an error code until TiB have been written/read. The lockup was observed when running this database backup on large filesystems (64 TiB in this case) with a high number of block groups and no free space. Fragmentation is generally not a factor in this filesystem (~thousands of large files, mostly contiguous except for the parts written while the filesystem is at capacity.) Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
|
#
83e80a6e |
|
08-Sep-2022 |
Jan Kara <jack@suse.cz> |
ext4: use buckets for cr 1 block scan instead of rbtree Using rbtree for sorting groups by average fragment size is relatively expensive (needs rbtree update on every block freeing or allocation) and leads to wide spreading of allocations because selection of block group is very sentitive both to changes in free space and amount of blocks allocated. Furthermore selecting group with the best matching average fragment size is not necessary anyway, even more so because the variability of fragment sizes within a group is likely large so average is not telling much. We just need a group with large enough average fragment size so that we have high probability of finding large enough free extent and we don't want average fragment size to be too big so that we are likely to find free extent only somewhat larger than what we need. So instead of maintaing rbtree of groups sorted by fragment size keep bins (lists) or groups where average fragment size is in the interval [2^i, 2^(i+1)). This structure requires less updates on block allocation / freeing, generally avoids chaotic spreading of allocations into block groups, and still is able to quickly (even faster that the rbtree) provide a block group which is likely to have a suitably sized free space extent. This patch reduces number of block groups used when untarring archive with medium sized files (size somewhat above 64k which is default mballoc limit for avoiding locality group preallocation) to about half and thus improves write speeds for eMMC flash significantly. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@kernel.org Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/ Link: https://lore.kernel.org/r/20220908092136.11770-5-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
a9f2a293 |
|
08-Sep-2022 |
Jan Kara <jack@suse.cz> |
ext4: use locality group preallocation for small closed files Curently we don't use any preallocation when a file is already closed when allocating blocks (from writeback code when converting delayed allocation). However for small files, using locality group preallocation is actually desirable as that is not specific to a particular file. Rather it is a method to pack small files together to reduce fragmentation and for that the fact the file is closed is actually even stronger hint the file would benefit from packing. So change the logic to allow locality group preallocation in this case. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@kernel.org Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/ Link: https://lore.kernel.org/r/20220908092136.11770-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
1940265e |
|
08-Sep-2022 |
Jan Kara <jack@suse.cz> |
ext4: avoid unnecessary spreading of allocations among groups mb_set_largest_free_order() updates lists containing groups with largest chunk of free space of given order. The way it updates it leads to always moving the group to the tail of the list. Thus allocations looking for free space of given order effectively end up cycling through all groups (and due to initialization in last to first order). This spreads allocations among block groups which reduces performance for rotating disks or low-end flash media. Change mb_set_largest_free_order() to only update lists if the order of the largest free chunk in the group changed. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@kernel.org Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/ Link: https://lore.kernel.org/r/20220908092136.11770-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
4fca50d4 |
|
08-Sep-2022 |
Jan Kara <jack@suse.cz> |
ext4: make mballoc try target group first even with mb_optimize_scan One of the side-effects of mb_optimize_scan was that the optimized functions to select next group to try were called even before we tried the goal group. As a result we no longer allocate files close to corresponding inodes as well as we don't try to expand currently allocated extent in the same group. This results in reaim regression with workfile.disk workload of upto 8% with many clients on my test machine: baseline mb_optimize_scan Hmean disk-1 2114.16 ( 0.00%) 2099.37 ( -0.70%) Hmean disk-41 87794.43 ( 0.00%) 83787.47 * -4.56%* Hmean disk-81 148170.73 ( 0.00%) 135527.05 * -8.53%* Hmean disk-121 177506.11 ( 0.00%) 166284.93 * -6.32%* Hmean disk-161 220951.51 ( 0.00%) 207563.39 * -6.06%* Hmean disk-201 208722.74 ( 0.00%) 203235.59 ( -2.63%) Hmean disk-241 222051.60 ( 0.00%) 217705.51 ( -1.96%) Hmean disk-281 252244.17 ( 0.00%) 241132.72 * -4.41%* Hmean disk-321 255844.84 ( 0.00%) 245412.84 * -4.08%* Also this is causing huge regression (time increased by a factor of 5 or so) when untarring archive with lots of small files on some eMMC storage cards. Fix the problem by making sure we try goal group first. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@kernel.org Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/all/20220727105123.ckwrhbilzrxqpt24@quack3/ Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/ Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220908092136.11770-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
1e1c2b86 |
|
14-Jul-2022 |
Lukas Czerner <lczerner@redhat.com> |
ext4: block range must be validated before use in ext4_mb_clear_bb() Block range to free is validated in ext4_free_blocks() using ext4_inode_block_valid() and then it's passed to ext4_mb_clear_bb(). However in some situations on bigalloc file system the range might be adjusted after the validation in ext4_free_blocks() which can lead to troubles on corrupted file systems such as one found by syzkaller that resulted in the following BUG kernel BUG at fs/ext4/ext4.h:3319! PREEMPT SMP NOPTI CPU: 28 PID: 4243 Comm: repro Kdump: loaded Not tainted 5.19.0-rc6+ #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014 RIP: 0010:ext4_free_blocks+0x95e/0xa90 Call Trace: <TASK> ? lock_timer_base+0x61/0x80 ? __es_remove_extent+0x5a/0x760 ? __mod_timer+0x256/0x380 ? ext4_ind_truncate_ensure_credits+0x90/0x220 ext4_clear_blocks+0x107/0x1b0 ext4_free_data+0x15b/0x170 ext4_ind_truncate+0x214/0x2c0 ? _raw_spin_unlock+0x15/0x30 ? ext4_discard_preallocations+0x15a/0x410 ? ext4_journal_check_start+0xe/0x90 ? __ext4_journal_start_sb+0x2f/0x110 ext4_truncate+0x1b5/0x460 ? __ext4_journal_start_sb+0x2f/0x110 ext4_evict_inode+0x2b4/0x6f0 evict+0xd0/0x1d0 ext4_enable_quotas+0x11f/0x1f0 ext4_orphan_cleanup+0x3de/0x430 ? proc_create_seq_private+0x43/0x50 ext4_fill_super+0x295f/0x3ae0 ? snprintf+0x39/0x40 ? sget_fc+0x19c/0x330 ? ext4_reconfigure+0x850/0x850 get_tree_bdev+0x16d/0x260 vfs_get_tree+0x25/0xb0 path_mount+0x431/0xa70 __x64_sys_mount+0xe2/0x120 do_syscall_64+0x5b/0x80 ? do_user_addr_fault+0x1e2/0x670 ? exc_page_fault+0x70/0x170 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7fdf4e512ace Fix it by making sure that the block range is properly validated before used every time it changes in ext4_free_blocks() or ext4_mb_clear_bb(). Link: https://syzkaller.appspot.com/bug?id=5266d464285a03cee9dbfda7d2452a72c3c2ae7c Reported-by: syzbot+15cd994e273307bf5cfa@syzkaller.appspotmail.com Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Tadeusz Struk <tadeusz.struk@linaro.org> Tested-by: Tadeusz Struk <tadeusz.struk@linaro.org> Link: https://lore.kernel.org/r/20220714165903.58260-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
218a6944 |
|
06-Jun-2022 |
hanjinke <hanjinke.666@bytedance.com> |
ext4: reuse order and buddy in mb_mark_used when buddy split After each buddy split, mb_mark_used will search the proper order for the block which may consume some loop in mb_find_order_for_block. In fact, we can reuse the order and buddy generated by the buddy split. Reviewed by: lei.rao@intel.com Signed-off-by: hanjinke <hanjinke.666@bytedance.com> Link: https://lore.kernel.org/r/20220606155305.74146-1-hanjinke.666@bytedance.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
cf4ff938 |
|
28-May-2022 |
Baokun Li <libaokun1@huawei.com> |
ext4: correct the judgment of BUG in ext4_mb_normalize_request ext4_mb_normalize_request() can move logical start of allocated blocks to reduce fragmentation and better utilize preallocation. However logical block requested as a start of allocation (ac->ac_o_ex.fe_logical) should always be covered by allocated blocks so we should check that by modifying and to or in the assertion. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220528110017.354175-3-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
a08f789d |
|
28-May-2022 |
Baokun Li <libaokun1@huawei.com> |
ext4: fix bug_on ext4_mb_use_inode_pa Hulk Robot reported a BUG_ON: ================================================================== kernel BUG at fs/ext4/mballoc.c:3211! [...] RIP: 0010:ext4_mb_mark_diskspace_used.cold+0x85/0x136f [...] Call Trace: ext4_mb_new_blocks+0x9df/0x5d30 ext4_ext_map_blocks+0x1803/0x4d80 ext4_map_blocks+0x3a4/0x1a10 ext4_writepages+0x126d/0x2c30 do_writepages+0x7f/0x1b0 __filemap_fdatawrite_range+0x285/0x3b0 file_write_and_wait_range+0xb1/0x140 ext4_sync_file+0x1aa/0xca0 vfs_fsync_range+0xfb/0x260 do_fsync+0x48/0xa0 [...] ================================================================== Above issue may happen as follows: ------------------------------------- do_fsync vfs_fsync_range ext4_sync_file file_write_and_wait_range __filemap_fdatawrite_range do_writepages ext4_writepages mpage_map_and_submit_extent mpage_map_one_extent ext4_map_blocks ext4_mb_new_blocks ext4_mb_normalize_request >>> start + size <= ac->ac_o_ex.fe_logical ext4_mb_regular_allocator ext4_mb_simple_scan_group ext4_mb_use_best_found ext4_mb_new_preallocation ext4_mb_new_inode_pa ext4_mb_use_inode_pa >>> set ac->ac_b_ex.fe_len <= 0 ext4_mb_mark_diskspace_used >>> BUG_ON(ac->ac_b_ex.fe_len <= 0); we can easily reproduce this problem with the following commands: `fallocate -l100M disk` `mkfs.ext4 -b 1024 -g 256 disk` `mount disk /mnt` `fsstress -d /mnt -l 0 -n 1000 -p 1` The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP. Therefore, "start + size <= ac->ac_o_ex.fe_logical" may occur when the size is truncated. So start should be the start position of the group where ac_o_ex.fe_logical is located after alignment. In addition, when the value of fe_logical or EXT4_BLOCKS_PER_GROUP is very large, the value calculated by start_off is more accurate. Cc: stable@kernel.org Fixes: cd648b8a8fd5 ("ext4: trim allocation requests to group size") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220528110017.354175-2-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
d63c00ea |
|
17-Apr-2022 |
Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> |
ext4: mark group as trimmed only if it was fully scanned Otherwise nonaligned fstrim calls will works inconveniently for iterative scanners, for example: // trim [0,16MB] for group-1, but mark full group as trimmed fstrim -o $((1024*1024*128)) -l $((1024*1024*16)) ./m // handle [16MB,16MB] for group-1, do nothing because group already has the flag. fstrim -o $((1024*1024*144)) -l $((1024*1024*16)) ./m [ Update function documentation for ext4_trim_all_free -- TYT ] Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> Link: https://lore.kernel.org/r/1650214995-860245-1-git-send-email-dmtrmonakhov@yandex-team.ru Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
|
#
af2b3275 |
|
04-Apr-2022 |
Jinke Han <hanjinke.666@bytedance.com> |
ext4: remove unnecessary code in __mb_check_buddy When enter elseif branch, the the MB_CHECK_ASSERT will never fail. In addtion, the only illegal combination is 0/0, which can be caught by the first if branch. Signed-off-by: Jinke Han <hanjinke.666@bytedance.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220404152243.13556-1-hanjinke.666@bytedance.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
c30365b9 |
|
01-Apr-2022 |
Yu Zhe <yuzhe@nfschina.com> |
ext4: remove unnecessary type castings remove unnecessary void* type castings. Signed-off-by: Yu Zhe <yuzhe@nfschina.com> Link: https://lore.kernel.org/r/20220401081321.73735-1-yuzhe@nfschina.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
44abff2c |
|
14-Apr-2022 |
Christoph Hellwig <hch@lst.de> |
block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARD Secure erase is a very different operation from discard in that it is a data integrity operation vs hint. Fully split the limits and helper infrastructure to make the separation more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> [nifs2] Acked-by: Jaegeuk Kim <jaegeuk@kernel.org> [f2fs] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Chao Yu <chao@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-27-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
7b47ef52 |
|
14-Apr-2022 |
Christoph Hellwig <hch@lst.de> |
block: add a bdev_discard_granularity helper Abstract away implementation details from file systems by providing a block_device based helper to retrieve the discard granularity. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Link: https://lore.kernel.org/r/20220415045258.199825-26-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
10f0d2a5 |
|
14-Apr-2022 |
Christoph Hellwig <hch@lst.de> |
block: add a bdev_nonrot helper Add a helper to check the nonrot flag based on the block_device instead of having to poke into the block layer internal request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
077d0c2c |
|
08-Mar-2022 |
Ojaswin Mujoo <ojaswin@linux.ibm.com> |
ext4: make mb_optimize_scan performance mount option work with extents Currently mb_optimize_scan scan feature which improves filesystem performance heavily (when FS is fragmented), seems to be not working with files with extents (ext4 by default has files with extents). This patch fixes that and makes mb_optimize_scan feature work for files with extents. Below are some performance numbers obtained when allocating a 10M and 100M file with and w/o this patch on a filesytem with no 1M contiguous block. <perf numbers> =============== Workload: dd if=/dev/urandom of=test conv=fsync bs=1M count=10/100 Time taken ===================================================== no. Size without-patch with-patch Diff(%) 1 10M 0m8.401s 0m5.623s 33.06% 2 100M 1m40.465s 1m14.737s 25.6% <debug stats> ============= w/o patch: mballoc: reqs: 17056 success: 11407 groups_scanned: 13643 cr0_stats: hits: 37 groups_considered: 9472 useless_loops: 36 bad_suggestions: 0 cr1_stats: hits: 11418 groups_considered: 908560 useless_loops: 1894 bad_suggestions: 0 cr2_stats: hits: 1873 groups_considered: 6913 useless_loops: 21 cr3_stats: hits: 21 groups_considered: 5040 useless_loops: 21 extents_scanned: 417364 goal_hits: 3707 2^n_hits: 37 breaks: 1873 lost: 0 buddies_generated: 239/240 buddies_time_used: 651080 preallocated: 705 discarded: 478 with patch: mballoc: reqs: 12768 success: 11305 groups_scanned: 12768 cr0_stats: hits: 1 groups_considered: 18 useless_loops: 0 bad_suggestions: 0 cr1_stats: hits: 5829 groups_considered: 50626 useless_loops: 0 bad_suggestions: 0 cr2_stats: hits: 6938 groups_considered: 580363 useless_loops: 0 cr3_stats: hits: 0 groups_considered: 0 useless_loops: 0 extents_scanned: 309059 goal_hits: 0 2^n_hits: 1 breaks: 1463 lost: 0 buddies_generated: 239/240 buddies_time_used: 791392 preallocated: 673 discarded: 446 Fixes: 196e402 (ext4: improve cr 0 / cr 1 group scanning) Cc: stable@kernel.org Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com> Reported-by: Nageswara R Sastry <rnsastry@linux.ibm.com> Suggested-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/fc9a48f7f8dcfc83891a8b21f6dd8cdf056ed810.1646732698.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
fd9b6fad |
|
01-Mar-2022 |
Yang Li <yang.lee@linux.alibaba.com> |
ext4: fix ext4_mb_clear_bb() kernel-doc comment Remove the excess description of @bh in ext4_mb_clear_bb() kernel-doc comment to remove warnings found by running scripts/kernel-doc, which is caused by using 'make W=1'. fs/ext4/mballoc.c:5895: warning: Excess function parameter 'bh' description in 'ext4_mb_clear_bb' Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Link: https://lore.kernel.org/r/20220301092136.34764-1-yang.lee@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
8c91c579 |
|
15-Feb-2022 |
Ritesh Harjani <riteshh@linux.ibm.com> |
ext4: add extra check in ext4_mb_mark_bb() to prevent against possible corruption This patch adds an extra checks in ext4_mb_mark_bb() function to make sure we mark & report error if we were to mark/clear any of the critical FS metadata specific bitmaps (&bail out) to prevent from any accidental corruption. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/53cbb6f2573db162a57f935365050d8b1df202ee.1644992610.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
a00b482b |
|
15-Feb-2022 |
Ritesh Harjani <riteshh@linux.ibm.com> |
ext4: add strict range checks while freeing blocks Currently ext4_mb_clear_bb() & ext4_group_add_blocks() only checks whether the given block ranges (which is to be freed) belongs to any FS metadata blocks or not, of the block's respective block group. But to detect any FS error early, it is better to add more strict checkings in those functions which checks whether the given blocks belongs to any critical FS metadata or not within system-zone. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/ddd9143d064774e32d6364a99667817c6e8bfdc0.1644992610.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
bd8247ee |
|
15-Feb-2022 |
Ritesh Harjani <riteshh@linux.ibm.com> |
ext4: no need to test for block bitmap bits in ext4_mb_mark_bb() We don't need the return value of mb_test_and_clear_bits() in ext4_mb_mark_bb() So simply use mb_clear_bits() instead. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/a971935306dafb124da0193c7dad1aa485210b62.1644992610.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
123e3016 |
|
15-Feb-2022 |
Ritesh Harjani <riteshh@linux.ibm.com> |
ext4: rename ext4_set_bits to mb_set_bits ext4_set_bits() should actually be mb_set_bits() for uniform API naming convention. This is via below cmd - grep -nr "ext4_set_bits" fs/ext4/ | cut -d ":" -f 1 | xargs sed -i 's/ext4_set_bits/mb_set_bits/g' Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/f1f6ece1405b76a7a987e9145d1adfaf71e30695.1644992610.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
8ac3939d |
|
15-Feb-2022 |
Ritesh Harjani <riteshh@linux.ibm.com> |
ext4: refactor ext4_free_blocks() to pull out ext4_mb_clear_bb() ext4_free_blocks() function became too long and confusing, this patch just pulls out the ext4_mb_clear_bb() function logic from it which clears the block bitmap and frees it. No functionality change in this patch Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/22c30fbb26ba409cf8aa5f0c7912970272c459e8.1644992610.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
bfdc502a |
|
15-Feb-2022 |
Ritesh Harjani <riteshh@linux.ibm.com> |
ext4: fix ext4_mb_mark_bb() with flex_bg with fast_commit In case of flex_bg feature (which is by default enabled), extents for any given inode might span across blocks from two different block group. ext4_mb_mark_bb() only reads the buffer_head of block bitmap once for the starting block group, but it fails to read it again when the extent length boundary overflows to another block group. Then in this below loop it accesses memory beyond the block group bitmap buffer_head and results into a data abort. for (i = 0; i < clen; i++) if (!mb_test_bit(blkoff + i, bitmap_bh->b_data) == !state) already++; This patch adds this functionality for checking block group boundary in ext4_mb_mark_bb() and update the buffer_head(bitmap_bh) for every different block group. w/o this patch, I was easily able to hit a data access abort using Power platform. <...> [ 74.327662] EXT4-fs error (device loop3): ext4_mb_generate_buddy:1141: group 11, block bitmap and bg descriptor inconsistent: 21248 vs 23294 free clusters [ 74.533214] EXT4-fs (loop3): shut down requested (2) [ 74.536705] Aborting journal on device loop3-8. [ 74.702705] BUG: Unable to handle kernel data access on read at 0xc00000005e980000 [ 74.703727] Faulting instruction address: 0xc0000000007bffb8 cpu 0xd: Vector: 300 (Data Access) at [c000000015db7060] pc: c0000000007bffb8: ext4_mb_mark_bb+0x198/0x5a0 lr: c0000000007bfeec: ext4_mb_mark_bb+0xcc/0x5a0 sp: c000000015db7300 msr: 800000000280b033 dar: c00000005e980000 dsisr: 40000000 current = 0xc000000027af6880 paca = 0xc00000003ffd5200 irqmask: 0x03 irq_happened: 0x01 pid = 5167, comm = mount <...> enter ? for help [c000000015db7380] c000000000782708 ext4_ext_clear_bb+0x378/0x410 [c000000015db7400] c000000000813f14 ext4_fc_replay+0x1794/0x2000 [c000000015db7580] c000000000833f7c do_one_pass+0xe9c/0x12a0 [c000000015db7710] c000000000834504 jbd2_journal_recover+0x184/0x2d0 [c000000015db77c0] c000000000841398 jbd2_journal_load+0x188/0x4a0 [c000000015db7880] c000000000804de8 ext4_fill_super+0x2638/0x3e10 [c000000015db7a40] c0000000005f8404 get_tree_bdev+0x2b4/0x350 [c000000015db7ae0] c0000000007ef058 ext4_get_tree+0x28/0x40 [c000000015db7b00] c0000000005f6344 vfs_get_tree+0x44/0x100 [c000000015db7b70] c00000000063c408 path_mount+0xdd8/0xe70 [c000000015db7c40] c00000000063c8f0 sys_mount+0x450/0x550 [c000000015db7d50] c000000000035770 system_call_exception+0x4a0/0x4e0 [c000000015db7e10] c00000000000c74c system_call_common+0xec/0x250 Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/2609bc8f66fc15870616ee416a18a3d392a209c4.1644992609.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
a5c0e2fd |
|
15-Feb-2022 |
Ritesh Harjani <riteshh@linux.ibm.com> |
ext4: correct cluster len and clusters changed accounting in ext4_mb_mark_bb ext4_mb_mark_bb() currently wrongly calculates cluster len (clen) and flex_group->free_clusters. This patch fixes that. Identified based on code review of ext4_mb_mark_bb() function. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/a0b035d536bafa88110b74456853774b64c8ac40.1644992609.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
31a074a0 |
|
09-Jan-2022 |
Xin Yin <yinxin.x@bytedance.com> |
ext4: modify the logic of ext4_mb_new_blocks_simple For now in ext4_mb_new_blocks_simple, if we found a block which should be excluded then will switch to next group, this may probably cause 'group' run out of range. Change to check next block in the same group when get a block should be excluded. Also change the search range to EXT4_CLUSTERS_PER_GROUP and add error checking. Signed-off-by: Xin Yin <yinxin.x@bytedance.com> Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20220110035141.1980-3-yinxin.x@bytedance.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
|
#
359745d7 |
|
21-Jan-2022 |
Muchun Song <songmuchun@bytedance.com> |
proc: remove PDE_DATA() completely Remove PDE_DATA() completely and replace it with pde_data(). [akpm@linux-foundation.org: fix naming clash in drivers/nubus/proc.c] [akpm@linux-foundation.org: now fix it properly] Link: https://lkml.kernel.org/r/20211124081956.87711-2-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Alexey Gladkov <gladkov.alexey@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2327fb2e |
|
03-Nov-2021 |
Lukas Czerner <lczerner@redhat.com> |
ext4: change s_last_trim_minblks type to unsigned long There is no good reason for the s_last_trim_minblks to be atomic. There is no data integrity needed and there is no real danger in setting and reading it in a racy manner. Change it to be unsigned long, the same type as s_clusters_per_group which is the maximum that's allowed. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Suggested-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20211103145122.17338-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
173b6e38 |
|
12-Nov-2021 |
Jan Kara <jack@suse.cz> |
ext4: avoid trim error on fs with small groups A user reported FITRIM ioctl failing for him on ext4 on some devices without apparent reason. After some debugging we've found out that these devices (being LVM volumes) report rather large discard granularity of 42MB and the filesystem had 1k blocksize and thus group size of 8MB. Because ext4 FITRIM implementation puts discard granularity into minlen, ext4_trim_fs() declared the trim request as invalid. However just silently doing nothing seems to be a more appropriate reaction to such combination of parameters since user did not specify anything wrong. CC: Lukas Czerner <lczerner@redhat.com> Fixes: 5c2ed62fd447 ("ext4: Adjust minlen with discard_granularity in the FITRIM ioctl") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20211112152202.26614-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
8c80fb31 |
|
22-Nov-2021 |
Chunguang Xu <brookxu@tencent.com> |
ext4: fix a possible ABBA deadlock due to busy PA We found on older kernel (3.10) that in the scenario of insufficient disk space, system may trigger an ABBA deadlock problem, it seems that this problem still exists in latest kernel, try to fix it here. The main process triggered by this problem is that task A occupies the PA and waits for the jbd2 transaction finish, the jbd2 transaction waits for the completion of task B's IO (plug_list), but task B waits for the release of PA by task A to finish discard, which indirectly forms an ABBA deadlock. The related calltrace is as follows: Task A vfs_write ext4_mb_new_blocks() ext4_mb_mark_diskspace_used() JBD2 jbd2_journal_get_write_access() -> jbd2_journal_commit_transaction() ->schedule() filemap_fdatawait() | | | Task B | | do_unlinkat() | | ext4_evict_inode() | | jbd2_journal_begin_ordered_truncate() | | filemap_fdatawrite_range() | | ext4_mb_new_blocks() | -ext4_mb_discard_group_preallocations() <----- Here, try to cancel ext4_mb_discard_group_preallocations() internal retry due to PA busy, and do a limited number of retries inside ext4_mb_discard_preallocations(), which can circumvent the above problems, but also has some advantages: 1. Since the PA is in a busy state, if other groups have free PAs, keeping the current PA may help to reduce fragmentation. 2. Continue to traverse forward instead of waiting for the current group PA to be released. In most scenarios, the PA discard time can be reduced. However, in the case of smaller free space, if only a few groups have space, then due to multiple traversals of the group, it may increase CPU overhead. But in contrast, I feel that the overall benefit is better than the cost. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/1637630277-23496-1-git-send-email-brookxu.cn@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
|
#
afcc4e32 |
|
20-Aug-2021 |
Lukas Bulwahn <lukas.bulwahn@gmail.com> |
ext4: scope ret locally in ext4_try_to_trim_range() As commit 6920b3913235 ("ext4: add new helper interface ext4_try_to_trim_range()") moves some code into the separate function ext4_try_to_trim_range(), the use of the variable ret within that function is more limited and can be adjusted as well. Scope the use of the variable ret locally and drop dead assignments. No functional change. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Link: https://lore.kernel.org/r/20210820120853.23134-1-lukas.bulwahn@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
188c299e |
|
16-Aug-2021 |
Jan Kara <jack@suse.cz> |
ext4: Support for checksumming from journal triggers JBD2 layer support triggers which are called when journaling layer moves buffer to a certain state. We can use the frozen trigger, which gets called when buffer data is frozen and about to be written out to the journal, to compute block checksums for some buffer types (similarly as does ocfs2). This avoids unnecessary repeated recomputation of the checksum (at the cost of larger window where memory corruption won't be caught by checksumming) and is even necessary when there are unsynchronized updaters of the checksummed data. So add superblock and journal trigger type arguments to ext4_journal_get_write_access() and ext4_journal_get_create_access() so that frozen triggers can be set accordingly. Also add inode argument to ext4_walk_page_buffers() and all the callbacks used with that function for the same purpose. This patch is mostly only a change of prototype of the above mentioned functions and a few small helpers. Real checksumming will come later. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210816095713.16537-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
a5fda113 |
|
14-Aug-2021 |
Theodore Ts'o <tytso@mit.edu> |
ext4: fix sparse warnings Add sparse annotations to suppress false positive context imbalance warnings, and use NULL instead of 0 in EXT_MAX_{EXTENT,INDEX}. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
5036ab8d |
|
30-Aug-2021 |
Wang Jianchao <wangjianchao@kuaishou.com> |
ext4: flush background discard kwork when retry allocation The background discard kwork tries to mark blocks used and issue discard. This can make filesystem suffer from NOSPC error, xfstest generic/371 can fail due to it. Fix it by flushing discard kwork in ext4_should_retry_alloc. At the same time, give up discard at the moment. Signed-off-by: Wang Jianchao <wangjianchao@kuaishou.com> Link: https://lore.kernel.org/r/20210830075246.12516-6-jianchao.wan9@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
55cdd0af |
|
24-Jul-2021 |
Wang Jianchao <wangjianchao@kuaishou.com> |
ext4: get discard out of jbd2 commit kthread contex Right now, discard is issued and waited to be completed in jbd2 commit kthread context after the logs are committed. When large amount of files are deleted and discard is flooding, jbd2 commit kthread can be blocked for long time. Then all of the metadata operations can be blocked to wait the log space. One case is the page fault path with read mm->mmap_sem held, which wants to update the file time but has to wait for the log space. When other threads in the task wants to do mmap, then write mmap_sem is blocked. Finally all of the following read mmap_sem requirements are blocked, even the ps command which need to read the /proc/pid/ -cmdline. Our monitor service which needs to read /proc/pid/cmdline used to be blocked for 5 mins. This patch frees the blocks back to buddy after commit and then do discard in a async kworker context in fstrim fashion, namely, - mark blocks to be discarded as used if they have not been allocated - do discard - mark them free After this, jbd2 commit kthread won't be blocked any more by discard and we won't get NOSPC even if the discard is slow or throttled. Link: https://marc.info/?l=linux-kernel&m=162143690731901&w=2 Suggested-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Wang Jianchao <wangjianchao@kuaishou.com> Link: https://lore.kernel.org/r/20210830075246.12516-5-jianchao.wan9@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
b6f5558c |
|
24-Jul-2021 |
Wang Jianchao <wangjianchao@kuaishou.com> |
ext4: remove the repeated comment of ext4_trim_all_free Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Wang Jianchao <wangjianchao@kuaishou.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210724074124.25731-4-jianchao.wan9@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
6920b391 |
|
24-Jul-2021 |
Wang Jianchao <wangjianchao@kuaishou.com> |
ext4: add new helper interface ext4_try_to_trim_range() There is no functional change in this patch but just split the codes, which serachs free block and does trim, into a new function ext4_try_to_trim_range. This is preparing for the following async backgroup discard. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Wang Jianchao <wangjianchao@kuaishou.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210724074124.25731-3-jianchao.wan9@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
bd2eea8d |
|
24-Jul-2021 |
Wang Jianchao <wangjianchao@kuaishou.com> |
ext4: remove the 'group' parameter of ext4_trim_extent Get rid of the 'group' parameter of ext4_trim_extent as we can get it from the 'e4b'. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Wang Jianchao <wangjianchao@kuaishou.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210724074124.25731-2-jianchao.wan9@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
cd84bbba |
|
23-Jun-2021 |
Stephen Brennan <stephen.s.brennan@oracle.com> |
ext4: use ext4_grp_locked_error in mb_find_extent Commit 5d1b1b3f492f ("ext4: fix BUG when calling ext4_error with locked block group") introduces ext4_grp_locked_error to handle unlocking a group in error cases. Otherwise, there is a possibility of a sleep while atomic. However, since 43c73221b3b1 ("ext4: replace BUG_ON with WARN_ON in mb_find_extent()"), mb_find_extent() has contained a ext4_error() call while a group spinlock is held. Replace this with ext4_grp_locked_error. Fixes: 43c73221b3b1 ("ext4: replace BUG_ON with WARN_ON in mb_find_extent()") Cc: <stable@vger.kernel.org> # 4.14+ Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Link: https://lore.kernel.org/r/20210623232114.34457-1-stephen.s.brennan@oracle.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
a8867f4e |
|
12-Apr-2021 |
Phillip Potter <phil@philpotter.co.uk> |
ext4: fix memory leak in ext4_mb_init_backend on error path. Fix a memory leak discovered by syzbot when a file system is corrupted with an illegally large s_log_groups_per_flex. Reported-by: syzbot+aa12d6106ea4ca1b6aae@syzkaller.appspotmail.com Signed-off-by: Phillip Potter <phil@philpotter.co.uk> Cc: stable@kernel.org Link: https://lore.kernel.org/r/20210412073837.1686-1-phil@philpotter.co.uk Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
666245d9 |
|
08-Apr-2021 |
Jack Qiu <jack.qiu@huawei.com> |
ext4: fix trailing whitespace Made suggested modifications from checkpatch in reference to ERROR: trailing whitespace Signed-off-by: Jack Qiu <jack.qiu@huawei.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20210409042035.15516-1-jack.qiu@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
f68f4063 |
|
01-Apr-2021 |
Harshad Shirwadkar <harshadshirwadkar@gmail.com> |
ext4: add proc files to monitor new structures This patch adds a new file "mb_structs_summary" which allows us to see the summary of the new allocator structures added in this series. Here's the sample output of file: optimize_scan: 1 max_free_order_lists: list_order_0_groups: 0 list_order_1_groups: 0 list_order_2_groups: 0 list_order_3_groups: 0 list_order_4_groups: 0 list_order_5_groups: 0 list_order_6_groups: 0 list_order_7_groups: 0 list_order_8_groups: 0 list_order_9_groups: 0 list_order_10_groups: 0 list_order_11_groups: 0 list_order_12_groups: 0 list_order_13_groups: 40 fragment_size_tree: tree_min: 16384 tree_max: 32768 tree_nodes: 40 Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20210401172129.189766-7-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
196e402a |
|
01-Apr-2021 |
Harshad Shirwadkar <harshadshirwadkar@gmail.com> |
ext4: improve cr 0 / cr 1 group scanning Instead of traversing through groups linearly, scan groups in specific orders at cr 0 and cr 1. At cr 0, we want to find groups that have the largest free order >= the order of the request. So, with this patch, we maintain lists for each possible order and insert each group into a list based on the largest free order in its buddy bitmap. During cr 0 allocation, we traverse these lists in the increasing order of largest free orders. This allows us to find a group with the best available cr 0 match in constant time. If nothing can be found, we fallback to cr 1 immediately. At CR1, the story is slightly different. We want to traverse in the order of increasing average fragment size. For CR1, we maintain a rb tree of groupinfos which is sorted by average fragment size. Instead of traversing linearly, at CR1, we traverse in the order of increasing average fragment size, starting at the most optimal group. This brings down cr 1 search complexity to log(num groups). For cr >= 2, we just perform the linear search as before. Also, in case of lock contention, we intermittently fallback to linear search even in CR 0 and CR 1 cases. This allows us to proceed during the allocation path even in case of high contention. There is an opportunity to do optimization at CR2 too. That's because at CR2 we only consider groups where bb_free counter (number of free blocks) is greater than the request extent size. That's left as future work. All the changes introduced in this patch are protected under a new mount option "mb_optimize_scan". With this patchset, following experiment was performed: Created a highly fragmented disk of size 65TB. The disk had no contiguous 2M regions. Following command was run consecutively for 3 times: time dd if=/dev/urandom of=file bs=2M count=10 Here are the results with and without cr 0/1 optimizations introduced in this patch: |---------+------------------------------+---------------------------| | | Without CR 0/1 Optimizations | With CR 0/1 Optimizations | |---------+------------------------------+---------------------------| | 1st run | 5m1.871s | 2m47.642s | | 2nd run | 2m28.390s | 0m0.611s | | 3rd run | 2m26.530s | 0m1.255s | |---------+------------------------------+---------------------------| Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20210401172129.189766-6-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
4b68f6df |
|
01-Apr-2021 |
Harshad Shirwadkar <harshadshirwadkar@gmail.com> |
ext4: add MB_NUM_ORDERS macro A few arrays in mballoc.c use the total number of valid orders as their size. Currently, this value is set as "sb->s_blocksize_bits + 2". This makes code harder to read. So, instead add a new macro MB_NUM_ORDERS(sb) to make the code more readable. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20210401172129.189766-5-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
a6c75eaf |
|
01-Apr-2021 |
Harshad Shirwadkar <harshadshirwadkar@gmail.com> |
ext4: add mballoc stats proc file Add new stats for measuring the performance of mballoc. This patch is forked from Artem Blagodarenko's work that can be found here: https://github.com/lustre/lustre-release/blob/master/ldiskfs/kernel_patches/patches/rhel8/ext4-simple-blockalloc.patch This patch reorganizes the stats by cr level. This is how the output looks like: mballoc: reqs: 0 success: 0 groups_scanned: 0 cr0_stats: hits: 0 groups_considered: 0 useless_loops: 0 bad_suggestions: 0 cr1_stats: hits: 0 groups_considered: 0 useless_loops: 0 bad_suggestions: 0 cr2_stats: hits: 0 groups_considered: 0 useless_loops: 0 cr3_stats: hits: 0 groups_considered: 0 useless_loops: 0 extents_scanned: 0 goal_hits: 0 2^n_hits: 0 breaks: 0 lost: 0 buddies_generated: 0/40 buddies_time_used: 0 preallocated: 0 discarded: 0 Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20210401172129.189766-4-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
67d25186 |
|
01-Apr-2021 |
Harshad Shirwadkar <harshadshirwadkar@gmail.com> |
ext4: drop s_mb_bal_lock and convert protected fields to atomic s_mb_buddies_generated gets used later in this patch series to determine if the cr 0 and cr 1 optimziations should be performed or not. Currently, s_mb_buddies_generated is protected under a spin_lock. In the allocation path, it is better if we don't depend on the lock and instead read the value atomically. In order to do that, we drop s_bal_lock altogether and we convert the only two protected fields by it s_mb_buddies_generated and s_mb_generation_time to atomic type. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20210401172129.189766-2-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
f91436d5 |
|
24-Feb-2021 |
Sabyrzhan Tasbolatov <snovitoll@gmail.com> |
fs/ext4: fix integer overflow in s_log_groups_per_flex syzbot found UBSAN: shift-out-of-bounds in ext4_mb_init [1], when 1 << sbi->s_es->s_log_groups_per_flex is bigger than UINT_MAX, where sbi->s_mb_prefetch is unsigned integer type. 32 is the maximum allowed power of s_log_groups_per_flex. Following if check will also trigger UBSAN shift-out-of-bound: if (1 << sbi->s_es->s_log_groups_per_flex >= UINT_MAX) { So I'm checking it against the raw number, perhaps there is another way to calculate UINT_MAX max power. Also use min_t as to make sure it's uint type. [1] UBSAN: shift-out-of-bounds in fs/ext4/mballoc.c:2713:24 shift exponent 60 is too large for 32-bit type 'int' Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x137/0x1be lib/dump_stack.c:120 ubsan_epilogue lib/ubsan.c:148 [inline] __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395 ext4_mb_init_backend fs/ext4/mballoc.c:2713 [inline] ext4_mb_init+0x19bc/0x19f0 fs/ext4/mballoc.c:2898 ext4_fill_super+0xc2ec/0xfbe0 fs/ext4/super.c:4983 Reported-by: syzbot+a8b4b0c60155e87e9484@syzkaller.appspotmail.com Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210224095800.3350002-1-snovitoll@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
82ef1370 |
|
03-Dec-2020 |
Chunguang Xu <brookxu@tencent.com> |
ext4: avoid s_mb_prefetch to be zero in individual scenarios Commit cfd732377221 ("ext4: add prefetching for block allocation bitmaps") introduced block bitmap prefetch, and expects to read block bitmaps of flex_bg through an IO. However, it seems to ignore the value range of s_log_groups_per_flex. In the scenario where the value of s_log_groups_per_flex is greater than 27, s_mb_prefetch or s_mb_prefetch_limit will overflow, cause a divide zero exception. In addition, the logic of calculating nr is also flawed, because the size of flexbg is fixed during a single mount, but s_mb_prefetch can be modified, which causes nr to fail to meet the value condition of [1, flexbg_size]. To solve this problem, we need to set the upper limit of s_mb_prefetch. Since we expect to load block bitmaps of a flex_bg through an IO, we can consider determining a reasonable upper limit among the IO limit parameters. After consideration, we chose BLK_MAX_SEGMENT_SIZE. This is a good choice to solve divide zero problem and avoiding performance degradation. [ Some minor code simplifications to make the changes easy to follow -- TYT ] Reported-by: Tosk Robot <tencent_os_robot@tencent.com> Signed-off-by: Chunguang Xu <brookxu@tencent.com> Reviewed-by: Samuel Liao <samuelliao@tencent.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/1607051143-24508-1-git-send-email-brookxu@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
cca41553 |
|
07-Nov-2020 |
Chunguang Xu <brookxu@tencent.com> |
ext4: fix a memory leak of ext4_free_data When freeing metadata, we will create an ext4_free_data and insert it into the pending free list. After the current transaction is committed, the object will be freed. ext4_mb_free_metadata() will check whether the area to be freed overlaps with the pending free list. If true, return directly. At this time, ext4_free_data is leaked. Fortunately, the probability of this problem is small, since it only occurs if the file system is corrupted such that a block is claimed by more one inode and those inodes are deleted within a single jbd2 transaction. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Link: https://lore.kernel.org/r/1604764698-4269-8-git-send-email-brookxu@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
|
#
ce3cca33 |
|
07-Nov-2020 |
Chunguang Xu <brookxu@tencent.com> |
ext4: simplify the code of mb_find_order_for_block The code of mb_find_order_for_block is a bit obscure, but we can simplify it with mb_find_buddy(), make the code more concise. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/1604764698-4269-3-git-send-email-brookxu@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
6bd97bf2 |
|
07-Nov-2020 |
Chunguang Xu <brookxu@tencent.com> |
ext4: remove redundant mb_regenerate_buddy() After this patch (163a203), if an abnormal bitmap is detected, we will mark the group as corrupt, and we will not use this group in the future. Therefore, it should be meaningless to regenerate the buddy bitmap of this group, It might be better to delete it. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/1604764698-4269-2-git-send-email-brookxu@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
9b5f6c9b |
|
05-Nov-2020 |
Harshad Shirwadkar <harshadshirwadkar@gmail.com> |
ext4: make s_mount_flags modifications atomic Fast commit file system states are recorded in sbi->s_mount_flags. Fast commit expects these bit manipulations to be atomic. This patch adds helpers to make those modifications atomic. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201106035911.1942128-21-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
e121bd48 |
|
30-Oct-2020 |
Dan Carpenter <dan.carpenter@oracle.com> |
ext4: silence an uninitialized variable warning Smatch complains that "i" can be uninitialized if we don't enter the loop. I don't know if it's possible but we may as well silence this warning. [ Initialize i to sb->s_blocksize instead of 0. The only way the for loop could be skipped entirely is the in-memory data structures, in particular the bh->b_data for the on-disk superblock has gotten corrupted enough that calculated value of group is >= to ext4_get_groups_count(sb). In that case, we want to exit immediately without allocating a block. -- TYT ] Fixes: 8016e29f4362 ("ext4: fast commit recovery path") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/20201030114620.GB3251003@mwanda Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
|
#
8016e29f |
|
15-Oct-2020 |
Harshad Shirwadkar <harshadshirwadkar@gmail.com> |
ext4: fast commit recovery path This patch adds fast commit recovery path support for Ext4 file system. We add several helper functions that are similar in spirit to e2fsprogs journal recovery path handlers. Example of such functions include - a simple block allocator, idempotent block bitmap update function etc. Using these routines and the fast commit log in the fast commit area, the recovery path (ext4_fc_replay()) performs fast commit log recovery. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-8-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
addd752c |
|
28-Sep-2020 |
Chunguang Xu <brookxu@tencent.com> |
ext4: make mb_check_counter per group Make bb_check_counter per group, so each group has the same chance to be checked, which can expose errors more easily. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Link: https://lore.kernel.org/r/1601292995-32205-2-git-send-email-brookxu@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
9d1f9b27 |
|
28-Sep-2020 |
Chunguang Xu <brookxu@tencent.com> |
ext4: delete invalid comments near mb_buddy_adjust_border The comment near mb_buddy_adjust_border seems meaningless, just clear it. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Link: https://lore.kernel.org/r/1601292995-32205-1-git-send-email-brookxu@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
b483bb77 |
|
04-Aug-2020 |
Randy Dunlap <rdunlap@infradead.org> |
ext4: delete duplicated words + other fixes Delete repeated words in fs/ext4/. {the, this, of, we, after} Also change spelling of "xttr" in inline.c to "xattr" in 2 places. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200805024850.12129-1-rdunlap@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
5b3dc19d |
|
24-Sep-2020 |
Jan Kara <jack@suse.cz> |
ext4: discard preallocations before releasing group lock ext4_mb_discard_group_preallocations() can be releasing group lock with preallocations accumulated on its local list. Thus although discard_pa_seq was incremented and concurrent allocating processes will be retrying allocations, it can happen that premature ENOSPC error is returned because blocks used for preallocations are not available for reuse yet. Make sure we always free locally accumulated preallocations before releasing group lock. Fixes: 07b5b8e1ac40 ("ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200924150959.4335-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
70022da8 |
|
16-Sep-2020 |
Ye Bin <yebin10@huawei.com> |
ext4: fix dead loop in ext4_mb_new_blocks As we test disk offline/online with running fsstress, we find fsstress process is keeping running state. kworker/u32:3-262 [004] ...1 140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114 .... kworker/u32:3-262 [004] ...1 140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114 ext4_mb_new_blocks repeat: ext4_mb_discard_preallocations_should_retry(sb, ac, &seq) freed = ext4_mb_discard_preallocations ext4_mb_discard_group_preallocations this_cpu_inc(discard_pa_seq); ---> freed == 0 seq_retry = ext4_get_discard_pa_seq_sum for_each_possible_cpu(__cpu) __seq += per_cpu(discard_pa_seq, __cpu); if (seq_retry != *seq) { *seq = seq_retry; ret = true; } As we see seq_retry is sum of discard_pa_seq every cpu, if ext4_mb_discard_group_preallocations return zero discard_pa_seq in this cpu maybe increase one, so condition "seq_retry != *seq" have always been met. Ritesh Harjani suggest to in ext4_mb_discard_group_preallocations function we only increase discard_pa_seq when there is some PA to free. Fixes: 07b5b8e1ac40 ("ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20200916113859.1556397-3-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
27bc446e |
|
17-Aug-2020 |
brookxu <brookxu.cn@gmail.com> |
ext4: limit the length of per-inode prealloc list In the scenario of writing sparse files, the per-inode prealloc list may be very long, resulting in high overhead for ext4_mb_use_preallocated(). To circumvent this problem, we limit the maximum length of per-inode prealloc list to 512 and allow users to modify it. After patching, we observed that the sys ratio of cpu has dropped, and the system throughput has increased significantly. We created a process to write the sparse file, and the running time of the process on the fixed kernel was significantly reduced, as follows: Running time on unfixed kernel: [root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat real 0m2.051s user 0m0.008s sys 0m2.026s Running time on fixed kernel: [root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat real 0m0.471s user 0m0.004s sys 0m0.395s Signed-off-by: Chunguang Xu <brookxu@tencent.com> Link: https://lore.kernel.org/r/d7a98178-056b-6db5-6bce-4ead23f4a257@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
66d5e027 |
|
17-Aug-2020 |
brookxu <brookxu.cn@gmail.com> |
ext4: reorganize if statement of ext4_mb_release_context() Reorganize the if statement of ext4_mb_release_context(), make it easier to read. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Link: https://lore.kernel.org/r/5439ac6f-db79-ad68-76c1-a4dda9aa0cc3@gmail.com Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
c55ee7d2 |
|
14-Aug-2020 |
brookxu <brookxu.cn@gmail.com> |
ext4: add mb_debug logging when there are lost chunks Lost chunks are when some other process raced with the current thread to grab a particular block allocation. Add mb_debug log for developers who wants to see how often this is happening for a particular workload. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Link: https://lore.kernel.org/r/0a165ac0-1912-aebd-8a0d-b42e7cd1aea1@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
e0d438c7 |
|
09-Aug-2020 |
Xu Wang <vulab@iscas.ac.cn> |
mballoc: replace seq_printf with seq_puts seq_puts is a lot cheaper than seq_printf, so use that to print literal strings. Signed-off-by: Xu Wang <vulab@iscas.ac.cn> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20200810022158.9167-1-vulab@iscas.ac.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
dddcd2f9 |
|
07-Aug-2020 |
brookxu <brookxu.cn@gmail.com> |
ext4: optimize the implementation of ext4_mb_good_group() It might be better to adjust the code in two places: 1. Determine whether grp is currupt or not should be placed first. 2. (cr<=2 && free <ac->ac_g_ex.fe_len)should may belong to the crx strategy, and it may be more appropriate to put it in the subsequent switch statement block. For cr1, cr2, the conditions in switch potentially realize the above judgment. For cr0, we should add (free <ac->ac_g_ex.fe_len) judgment, and then delete (free / fragments) >= ac->ac_g_ex.fe_len), because cr0 returns true by default. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/e20b2d8f-1154-adb7-3831-a9e11ba842e9@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
051e2ce8 |
|
07-Aug-2020 |
brookxu <brookxu.cn@gmail.com> |
ext4: delete invalid comments near ext4_mb_check_limits() These comments do not seem to be related to ext4_mb_check_limits(), it may be invalid. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/c49faf0c-d5d5-9c51-6911-9e0ff57c6bfa@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
e9a3cd48 |
|
07-Aug-2020 |
brookxu <brookxu.cn@gmail.com> |
ext4: fix typos in ext4_mb_regular_allocator() comment Fix typos in ext4_mb_regular_allocator() comment Signed-off-by: Chunguang Xu <brookxu@tencent.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/d6514145-73b3-808b-ec5a-a8be27c51f9c@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
ce9f24cc |
|
28-Jul-2020 |
Jan Kara <jack@suse.cz> |
ext4: check journal inode extents more carefully Currently, system zones just track ranges of block, that are "important" fs metadata (bitmaps, group descriptors, journal blocks, etc.). This however complicates how extent tree (or indirect blocks) can be checked for inodes that actually track such metadata - currently the journal inode but arguably we should be treating quota files or resize inode similarly. We cannot run __ext4_ext_check() on such metadata inodes when loading their extents as that would immediately trigger the validity checks and so we just hack around that and special-case the journal inode. This however leads to a situation that a journal inode which has extent tree of depth at least one can have invalid extent tree that gets unnoticed until ext4_cache_extents() crashes. To overcome this limitation, track inode number each system zone belongs to (0 is used for zones not belonging to any inode). We can then verify inode number matches the expected one when verifying extent tree and thus avoid the false errors. With this there's no need to to special-case journal inode during extent tree checking anymore so remove it. Fixes: 0a944e8a6c66 ("ext4: don't perform block validity checks on the journal inode") Reported-by: Wolfgang Frisch <wolfgang.frisch@suse.com> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200728130437.7804-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
9375ac77 |
|
26-Jul-2020 |
brookxu <brookxu.cn@gmail.com> |
ext4: delete the invalid BUGON in ext4_mb_load_buddy_gfp() Delete the invalid BUGON in ext4_mb_load_buddy_gfp(), the previous code has already judged whether page is NULL. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/ad68e8a2-5ec3-5beb-537f-f3e53f55367a@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
3d392b26 |
|
16-Jul-2020 |
Theodore Ts'o <tytso@mit.edu> |
ext4: add prefetch_block_bitmaps mount option For file systems where we can afford to keep the buddy bitmaps cached, we can speed up initial writes to large file systems by starting to load the block allocation bitmaps as soon as the file system is mounted. This won't work well for _super_ large file systems, or memory constrained systems, so we only enable this when it is requested via a mount option. Addresses-Google-Bug: 159488342 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
|
#
c1d2c7d4 |
|
19-Jun-2020 |
Alex Zhuravlev <azhuravlev@whamcloud.com> |
ext4: skip non-loaded groups at cr=0/1 when scanning for good groups cr=0 is supposed to be an optimization to save CPU cycles, but if buddy data (in memory) is not initialized then all this makes no sense as we have to do sync IO taking a lot of cycles. Also, at cr=0 mballoc doesn't choose any available chunk. cr=1 also skips groups using heuristic based on avg. fragment size. It's more useful to skip such groups and switch to cr=2 where groups will be scanned for available chunks. However, we always read the first block group in a flex_bg so metadata blocks will get read into the first flex_bg if possible. Using sparse image and dm-slow virtual device of 120TB was simulated, then the image was formatted and filled using debugfs to mark ~85% of available space as busy. mount process w/o the patch couldn't complete in half an hour (according to vmstat it would take ~10-11 hours). With the patch applied mount took ~20 seconds. Lustre-bug-id: https://jira.whamcloud.com/browse/LU-12988 Signed-off-by: Alex Zhuravlev <azhuravlev@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>
|
#
cfd73237 |
|
21-Apr-2020 |
Alex Zhuravlev <bzzz@whamcloud.com> |
ext4: add prefetching for block allocation bitmaps This should significantly improve bitmap loading, especially for flex groups as it tries to load all bitmaps within a flex.group instead of one by one synchronously. Prefetching is done in 8 * flex_bg groups, so it should be 8 read-ahead reads for a single allocating thread. At the end of allocation the thread waits for read-ahead completion and initializes buddy information so that read-aheads are not lost in case of memory pressure. At cr=0 the number of prefetching IOs is limited per allocation context to prevent a situation when mballoc loads thousands of bitmaps looking for a perfect group and ignoring groups with good chunks. Together with the patch "ext4: limit scanning of uninitialized groups" the mount time (which includes few tiny allocations) of a 1PB filesystem is reduced significantly: 0% full 50%-full unpatched patched mount time 33s 9279s 563s [ Restructured by tytso; removed the state flags in the allocation context, so it can be used to lazily prefetch the allocation bitmaps immediately after the file system is mounted. Skip prefetching block groups which are uninitialized. Finally pass in the REQ_RAHEAD flag to the block layer while prefetching. ] Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
3cb77bd2 |
|
14-Jul-2020 |
brookxu <brookxu.cn@gmail.com> |
ext4: fix spelling typos in ext4_mb_initialize_context Fix spelling typos in ext4_mb_initialize_context. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/883b523c-58ec-7f38-0bb8-cd2ea4393684@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
81198536 |
|
09-Jun-2020 |
Ritesh Harjani <riteshh@linux.ibm.com> |
ext4: mballoc: Use this_cpu_read instead of this_cpu_ptr Simplify reading a seq variable by directly using this_cpu_read API instead of doing this_cpu_ptr and then dereferencing it. This also avoid the below kernel BUG: which happens when CONFIG_DEBUG_PREEMPT is enabled BUG: using smp_processor_id() in preemptible [00000000] code: syz-fuzzer/6927 caller is ext4_mb_new_blocks+0xa4d/0x3b70 fs/ext4/mballoc.c:4711 CPU: 1 PID: 6927 Comm: syz-fuzzer Not tainted 5.7.0-next-20200602-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x18f/0x20d lib/dump_stack.c:118 check_preemption_disabled+0x20d/0x220 lib/smp_processor_id.c:48 ext4_mb_new_blocks+0xa4d/0x3b70 fs/ext4/mballoc.c:4711 ext4_ext_map_blocks+0x201b/0x33e0 fs/ext4/extents.c:4244 ext4_map_blocks+0x4cb/0x1640 fs/ext4/inode.c:626 ext4_getblk+0xad/0x520 fs/ext4/inode.c:833 ext4_bread+0x7c/0x380 fs/ext4/inode.c:883 ext4_append+0x153/0x360 fs/ext4/namei.c:67 ext4_init_new_dir fs/ext4/namei.c:2757 [inline] ext4_mkdir+0x5e0/0xdf0 fs/ext4/namei.c:2802 vfs_mkdir+0x419/0x690 fs/namei.c:3632 do_mkdirat+0x21e/0x280 fs/namei.c:3655 do_syscall_64+0x60/0xe0 arch/x86/entry/common.c:359 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 42f56b7a4a7d ("ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling") Suggested-by: Borislav Petkov <bp@alien8.de> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reported-by: syzbot+82f324bb69744c5f6969@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/534f275016296996f54ecf65168bb3392b6f653d.1591699601.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
99377830 |
|
19-May-2020 |
Ritesh Harjani <riteshh@linux.ibm.com> |
ext4: mballoc: use lock for checking free blocks while retrying Currently while doing block allocation grp->bb_free may be getting modified if discard is happening in parallel. For e.g. consider a case where there are lot of threads who have preallocated lot of blocks and there is a thread which is trying to discard all of this group's PA. Now it could happen that we see all of those group's bb_free is zero and fail the allocation while there is sufficient space if we free up all the PA. So this patch adds another flag "EXT4_MB_STRICT_CHECK" which will be set if we are unable to allocate any blocks in the first try (since we may not have considered blocks about to be discarded from PA lists). So during retry attempt to allocate blocks we will use ext4_lock_group() for checking if the group is good or not. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/9cb740a117c958c36596f167b12af1beae9a68b7.1589955723.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
8ef123fe |
|
19-May-2020 |
Ritesh Harjani <riteshh@linux.ibm.com> |
ext4: mballoc: refactor ext4_mb_good_group() ext4_mb_good_group() definition was changed some time back and now it even initializes the buddy cache (via ext4_mb_init_group()), if in case the EXT4_MB_GRP_NEED_INIT() is true for a group. Note that ext4_mb_init_group() could sleep and so should not be called under a spinlock held. This is fine as of now because ext4_mb_good_group() is called before loading the buddy bitmap without ext4_lock_group() held and again called after loading the bitmap, only this time with ext4_lock_group() held. But still this whole thing is confusing. So this patch refactors out ext4_mb_good_group_nolock() which should be called when without holding ext4_lock_group(). Also in further patches we hold the spinlock (ext4_lock_group()) while doing any calculations which involves grp->bb_free or grp->bb_fragments. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/d9f7d031a5fbe1c943fae6bf1ff5cdf0604ae722.1589955723.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
07b5b8e1 |
|
19-May-2020 |
Ritesh Harjani <riteshh@linux.ibm.com> |
ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling There could be a race in function ext4_mb_discard_group_preallocations() where the 1st thread may iterate through group's bb_prealloc_list and remove all the PAs and add to function's local list head. Now if the 2nd thread comes in to discard the group preallocations, it will see that the group->bb_prealloc_list is empty and will return 0. Consider for a case where we have less number of groups (for e.g. just group 0), this may even return an -ENOSPC error from ext4_mb_new_blocks() (where we call for ext4_mb_discard_group_preallocations()). But that is wrong, since 2nd thread should have waited for 1st thread to release all the PAs and should have retried for allocation. Since 1st thread was anyway going to discard the PAs. The algorithm using this percpu seq counter goes below: 1. We sample the percpu discard_pa_seq counter before trying for block allocation in ext4_mb_new_blocks(). 2. We increment this percpu discard_pa_seq counter when we either allocate or free these blocks i.e. while marking those blocks as used/free in mb_mark_used()/mb_free_blocks(). 3. We also increment this percpu seq counter when we successfully identify that the bb_prealloc_list is not empty and hence proceed for discarding of those PAs inside ext4_mb_discard_group_preallocations(). Now to make sure that the regular fast path of block allocation is not affected, as a small optimization we only sample the percpu seq counter on that cpu. Only when the block allocation fails and when freed blocks found were 0, that is when we sample percpu seq counter for all cpus using below function ext4_get_discard_pa_seq_sum(). This happens after making sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty. It can be well argued that why don't just check for grp->bb_free to see if there are any free blocks to be allocated. So here are the two concerns which were discussed:- 1. If for some reason the blocks available in the group are not appropriate for allocation logic (say for e.g. EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then the retry logic may result into infinte looping since grp->bb_free is non-zero. 2. Also before preallocation was clubbed with block allocation with the same ext4_lock_group() held, there were lot of races where grp->bb_free could not be reliably relied upon. Due to above, this patch considers discard_pa_seq logic to determine if we should retry for block allocation. Say if there are are n threads trying for block allocation and none of those could allocate or discard any of the blocks, then all of those n threads will fail the block allocation and return -ENOSPC error. (Since the seq counter for all of those will match as no block allocation/discard was done during that duration). Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/7f254686903b87c419d798742fd9a1be34f0657b.1589955723.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
cf5e2ca6 |
|
19-May-2020 |
Ritesh Harjani <riteshh@linux.ibm.com> |
ext4: mballoc: refactor ext4_mb_discard_preallocations() Implement ext4_mb_discard_preallocations_should_retry() which we will need in later patches to add more logic like check for sequence number match to see if we should retry for block allocation or not. There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/1cfae0098d2aa9afbeb59331401258182868c8f2.1589955723.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
53f86b17 |
|
19-May-2020 |
Ritesh Harjani <riteshh@linux.ibm.com> |
ext4: mballoc: add blocks to PA list under same spinlock after allocating blocks ext4_mb_discard_preallocations() only checks for grp->bb_prealloc_list of every group to discard the group's PA to free up the space if allocation request fails. Consider below race:- Process A Process B 1. allocate blocks 1. Fails block allocation from ext4_mb_regular_allocator() ext4_lock_group() allocated blocks more than ac_o_ex.fe_len ext4_unlock_group() 2. Scans the grp->bb_prealloc_list (under ext4_lock_group()) and find nothing and thus return -ENOSPC. 2. Add the additional blocks to PA list ext4_lock_group() add blocks to grp->bb_prealloc_list ext4_unlock_group() Above race could be avoided if we add those additional blocks to grp->bb_prealloc_list at the same time with block allocation when ext4_lock_group() was still held. With this discard-PA will know if there are actually any blocks which could be freed from the PA Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/a2217dd782585b42328981832e6d396abaaccb80.1589955723.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
d3df1453 |
|
10-May-2020 |
Ritesh Harjani <riteshh@linux.ibm.com> |
ext4: mballoc: make mb_debug() implementation to use pr_debug() mb_debug() msg had only 1 control level for all type of msgs. And if we enable mballoc_debug then all of those msgs would be enabled. Instead of adding multiple debug levels for mb_debug() msgs, use pr_debug() with which we could have finer control to print msgs at all of different levels (i.e. at file, func, line no.). Also add process name/pid, superblk id, and other info in mb_debug() msg. This also kills the mballoc_debug module parameter, since it is not needed any more. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/f0c660cbde9e2edbe95c67942ca9ad80dd2231eb.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
eb2b8ebb |
|
10-May-2020 |
Ritesh Harjani <riteshh@linux.ibm.com> |
ext4: mballoc: fix possible NULL ptr & remove BUG_ONs from DOUBLE_CHECK Make sure to check for e4b->bd_info->bb_bitmap == NULL, in mb_cmp_bitmaps() and return if NULL, to avoid possible NULL ptr dereference. Similar to how we do this in other ifdef DOUBLE_CHECK functions. Also remove the BUG_ON() logic if kmalloc() or ext4_read_block_bitmap() fails. We should simply mark grp->bb_bitmap as NULL if above happens. In fact ext4_read_block_bitmap() may even return an error in case of resize ioctl. Hence remove this BUG_ON logic (fstests ext4/032 may trigger this). Link: https://lore.kernel.org/r/9a54f8a696ff17c057cd571be3d15ac3ec1407f1.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
a3450215 |
|
10-May-2020 |
Ritesh Harjani <riteshh@linux.ibm.com> |
ext4: mballoc: refactor code inside DOUBLE_CHECK into separate function This patch implemets mb_group_bb_bitmap_alloc() and mb_group_bb_bitmap_free() function to remove #ifdef DOUBLE_CHECK macro and it's related code from inside ext4_mb_add_groupinfo()/ext4_mb_release(). There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/8c2095d74b779f0254a19b24982490dc6f07c4f9.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
4fca8f07 |
|
10-May-2020 |
Ritesh Harjani <riteshh@linux.ibm.com> |
ext4: mballoc: make ext4_mb_use_preallocated() return type as bool Change return type of function ext4_mb_use_preallocated() to bool to better reflect what this function can return. There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/7880cb6ef911465beafefcd7e9c3ea214688744b.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
f283529a |
|
10-May-2020 |
Ritesh Harjani <riteshh@linux.ibm.com> |
ext4: mballoc: simplify error handling in ext4_init_mballoc() This patch simplifies error handling logic in ext4_init_mballoc(), by adding all the cleanups at one place at the end of that function. There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/8621a7bc68f7107a9ac4292afeb784515333bd25.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
004379d0 |
|
10-May-2020 |
Ritesh Harjani <riteshh@linux.ibm.com> |
ext4: mballoc: fix few other format specifier in mb_debug() Fix few other format specifiers in mb_debug() msgs. As such no other functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/574fa7f833abf2dbf3b53a2fea3195e71f6cdbd8.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
36bad423 |
|
10-May-2020 |
Ritesh Harjani <riteshh@linux.ibm.com> |
ext4: mballoc: correct the mb_debug() format specifier for pa_len var pa->pa_len is an integer. Fix all of the format specifier used in mb_debug() for pa_len to %d instead of %u. As such no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/af4987f643c586f62bcc9961e43f0a67151d5551.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
bbc4ec77 |
|
10-May-2020 |
Ritesh Harjani <riteshh@linux.ibm.com> |
ext4: mballoc: add more mb_debug() msgs This patch adds some more debugging mb_debug() msgs to help improve mballoc code debugging. Other than adding more mb_debug() msgs at few more places, there should be no other functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/5fc8e7788b924e211fcfa4a4c1d2f8503511661a.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
e68cf40c |
|
10-May-2020 |
Ritesh Harjani <riteshh@linux.ibm.com> |
ext4: mballoc: refactor ext4_mb_show_ac() This factors out ext4_mb_show_pa() function to show all the group's preallocation info. This could be useful info to be added in later patches. There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/8f07d890b0038dcc935e9c10e6043ec9f3792721.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
212da3ec |
|
10-May-2020 |
Ritesh Harjani <riteshh@linux.ibm.com> |
ext4: mballoc: print bb_free info even when it is 0 Improve the debugging msg by also printing even if bb_free is 0. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/c894f1d1d30f86ae38f4e3a861949665b6dc61cd.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
907ea529 |
|
13-Apr-2020 |
Theodore Ts'o <tytso@mit.edu> |
ext4: convert BUG_ON's to WARN_ON's in mballoc.c If the in-core buddy bitmap gets corrupted (or out of sync with the block bitmap), issue a WARN_ON and try to recover. In most cases this involves skipping trying to allocate out of a particular block group. We can end up declaring the file system corrupted, which is fair, since the file system probably should be checked before we proceed any further. Link: https://lore.kernel.org/r/20200414035649.293164-1-tytso@mit.edu Google-Bug-Id: 34811296 Google-Bug-Id: 34639169 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
54d3adbc |
|
28-Mar-2020 |
Theodore Ts'o <tytso@mit.edu> |
ext4: save all error info in save_error_info() and drop ext4_set_errno() Using a separate function, ext4_set_errno() to set the errno is problematic because it doesn't do the right thing once s_last_error_errorcode is non-zero. It's also less racy to set all of the error information all at once. (Also, as a bonus, it shrinks code size slightly.) Link: https://lore.kernel.org/r/20200329020404.686965-1-tytso@mit.edu Fixes: 878520ac45f9 ("ext4: save the error code which triggered...") Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
eb576086 |
|
10-Mar-2020 |
Dmitry Monakhov <dmonakhov@gmail.com> |
ext4: mark block bitmap corrupted when found instead of BUGON We already has similar code in ext4_mb_complex_scan_group(), but ext4_mb_simple_scan_group() still affected. Other reports: https://www.spinics.net/lists/linux-ext4/msg60231.html Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com> Link: https://lore.kernel.org/r/20200310150156.641-1-dmonakhov@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
92e9c58c |
|
13-Feb-2020 |
Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> |
ext4: use built-in RCU list checking in mballoc list_for_each_entry_rcu() has built-in RCU and lock checking. Pass cond argument to list_for_each_entry_rcu() to silence false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled by default. Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Link: https://lore.kernel.org/r/20200213152558.7070-1-madhuparnabhowmik10@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
7c990728 |
|
18-Feb-2020 |
Suraj Jitindar Singh <surajjs@amazon.com> |
ext4: fix potential race between s_flex_groups online resizing and access During an online resize an array of s_flex_groups structures gets replaced so it can get enlarged. If there is a concurrent access to the array and this memory has been reused then this can lead to an invalid memory access. The s_flex_group array has been converted into an array of pointers rather than an array of structures. This is to ensure that the information contained in the structures cannot get out of sync during a resize due to an accessor updating the value in the old structure after it has been copied but before the array pointer is updated. Since the structures them- selves are no longer copied but only the pointers to them this case is mitigated. Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443 Link: https://lore.kernel.org/r/20200221053458.730016-4-tytso@mit.edu Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
|
#
df3da4ea |
|
18-Feb-2020 |
Suraj Jitindar Singh <surajjs@amazon.com> |
ext4: fix potential race between s_group_info online resizing and access During an online resize an array of pointers to s_group_info gets replaced so it can get enlarged. If there is a concurrent access to the array in ext4_get_group_info() and this memory has been reused then this can lead to an invalid memory access. Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443 Link: https://lore.kernel.org/r/20200221053458.730016-3-tytso@mit.edu Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Balbir Singh <sblbir@amazon.com> Cc: stable@kernel.org
|
#
878520ac |
|
19-Nov-2019 |
Theodore Ts'o <tytso@mit.edu> |
ext4: save the error code which triggered an ext4_error() in the superblock This allows the cause of an ext4_error() report to be categorized based on whether it was triggered due to an I/O error, or an memory allocation error, or other possible causes. Most errors are caused by a detected file system inconsistency, so the default code stored in the superblock will be EXT4_ERR_EFSCORRUPTED. Link: https://lore.kernel.org/r/20191204032335.7683-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
c60990b3 |
|
19-Jun-2019 |
Theodore Ts'o <tytso@mit.edu> |
ext4: clean up kerneldoc warnigns when building with W=1 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
4b99faa2 |
|
24-Apr-2019 |
Khazhismel Kumykov <khazhy@google.com> |
ext4: cond_resched in work-heavy group loops Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
|
#
31562b95 |
|
06-Apr-2019 |
Jan Kara <jack@suse.cz> |
ext4: make sanity check in mballoc more strict The sanity check in mb_find_extent() only checked that returned extent does not extend past blocksize * 8, however it should not extend past EXT4_CLUSTERS_PER_GROUP(sb). This can happen when clusters_per_group < blocksize * 8 and the tail of the bitmap is not properly filled by 1s which happened e.g. when ancient kernels have grown the filesystem. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
|
#
82dd124c |
|
10-Feb-2019 |
Nikolay Borisov <nborisov@suse.com> |
ext4: replace opencoded i_writecount usage with inode_is_open_for_write() There is a function which clearly conveys the objective of checking i_writecount. Additionally the usage in ext4_mb_initialize_context was wrong, since a node would have wrongfully been reported as writable if i_writecount had a negative value (MMAP_DENY_WRITE). Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
|
#
9fe67149 |
|
01-Oct-2018 |
Eric Whitney <enwlinux@gmail.com> |
ext4: adjust reserved cluster count when removing extents Modify ext4_ext_remove_space() and the code it calls to correct the reserved cluster count for pending reservations (delayed allocated clusters shared with allocated blocks) when a block range is removed from the extent tree. Pending reservations may be found for the clusters at the ends of written or unwritten extents when a block range is removed. If a physical cluster at the end of an extent is freed, it's necessary to increment the reserved cluster count to maintain correct accounting if the corresponding logical cluster is shared with at least one delayed and unwritten extent as found in the extents status tree. Add a new function, ext4_rereserve_cluster(), to reapply a reservation on a delayed allocated cluster sharing blocks with a freed allocated cluster. To avoid ENOSPC on reservation, a flag is applied to ext4_free_blocks() to briefly defer updating the freeclusters counter when an allocated cluster is freed. This prevents another thread from allocating the freed block before the reservation can be reapplied. Redefine the partial cluster object as a struct to carry more state information and to clarify the code using it. Adjust the conditional code structure in ext4_ext_remove_space to reduce the indentation level in the main body of the code to improve readability. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
863c37fc |
|
04-Aug-2018 |
zhong jiang <zhongjiang@huawei.com> |
ext4: remove unneeded variable "err" in ext4_mb_release_inode_pa() The err is not used after initalization. So just remove the variable. Signed-off-by: zhong jiang <zhongjiang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
1a5d5e5d |
|
01-Aug-2018 |
Jeremy Cline <jcline@redhat.com> |
ext4: fix spectre gadget in ext4_mb_regular_allocator() 'ac->ac_g_ex.fe_len' is a user-controlled value which is used in the derivation of 'ac->ac_2order'. 'ac->ac_2order', in turn, is used to index arrays which makes it a potential spectre gadget. Fix this by sanitizing the value assigned to 'ac->ac2_order'. This covers the following accesses found with the help of smatch: * fs/ext4/mballoc.c:1896 ext4_mb_simple_scan_group() warn: potential spectre issue 'grp->bb_counters' [w] (local cap) * fs/ext4/mballoc.c:445 mb_find_buddy() warn: potential spectre issue 'EXT4_SB(e4b->bd_sb)->s_mb_offsets' [r] (local cap) * fs/ext4/mballoc.c:446 mb_find_buddy() warn: potential spectre issue 'EXT4_SB(e4b->bd_sb)->s_mb_maxs' [r] (local cap) Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jeremy Cline <jcline@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
|
#
8844618d |
|
13-Jun-2018 |
Theodore Ts'o <tytso@mit.edu> |
ext4: only look at the bg_flags field if it is valid The bg_flags field in the block group descripts is only valid if the uninit_bg or metadata_csum feature is enabled. We were not consistently looking at this field; fix this. Also block group #0 must never have uninitialized allocation bitmaps, or need to be zeroed, since that's where the root inode, and other special inodes are set up. Check for these conditions and mark the file system as corrupted if they are detected. This addresses CVE-2018-10876. https://bugzilla.kernel.org/show_bug.cgi?id=199403 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
|
#
21c580d8 |
|
20-May-2018 |
Sean Fu <fxinrong@gmail.com> |
ext4: remove NULL check before calling kmem_cache_destroy() Signed-off-by: Sean Fu <fxinrong@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
247dbed8 |
|
11-Apr-2018 |
Christoph Hellwig <hch@lst.de> |
ext4: simplify procfs code Use remove_proc_subtree to remove the whole subtree on cleanup, and unwind the registration loop into individual calls. Switch to use proc_create_seq where applicable. Signed-off-by: Christoph Hellwig <hch@lst.de>
|
#
736dedbb |
|
11-May-2018 |
Wang Shilong <wshilong@ddn.com> |
ext4: mark block bitmap corrupted when found There are still some cases that we missed to set block bitmaps corrupted bit properly: 1) block bitmap number is wrong. 2) failed to read block bitmap due to disk errors. 3) double free block bitmaps.. 4) some mismatch check with bitmaps vs buddy information. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Wang Shilong <wshilong@ddn.com> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
|
#
db79e6d1 |
|
12-May-2018 |
Wang Shilong <wshilong@ddn.com> |
ext4: add new ext4_mark_group_bitmap_corrupted() helper Since there are many places to set inode/block bitmap corrupt bit, add a new helper for it, which will make codes more clear. Signed-off-by: Wang Shilong <wshilong@ddn.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
|
#
49598e04 |
|
11-Jan-2018 |
Jun Piao <piaojun@huawei.com> |
ext4: use 'sbi' instead of 'EXT4_SB(sb)' We could use 'sbi' instead of 'EXT4_SB(sb)' to make code more elegant. Signed-off-by: Jun Piao <piaojun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
|
#
f5166768 |
|
17-Dec-2017 |
Theodore Ts'o <tytso@mit.edu> |
ext4: fix up remaining files with SPDX cleanups A number of ext4 source files were skipped due because their copyright permission statements didn't match the expected text used by the automated conversion utilities. I've added SPDX tags for the rest. While looking at some of these files, I've noticed that we have quite a bit of variation on the licenses that were used --- in particular some of the Red Hat licenses on the jbd2 files use a GPL2+ license, and we have some files that have a LGPL-2.1 license (which was quite surprising). I've not attempted to do any license changes. Even if it is perfectly legal to relicense to GPL 2.0-only for consistency's sake, that should be done with ext4 developer community discussion. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
d77147ff |
|
29-Oct-2017 |
harshads <harshads@google.com> |
ext4: add support for online resizing with bigalloc This patch adds support for online resizing on bigalloc file system by implementing EXT4_IOC_RESIZE_FS ioctl. Old resize interfaces (add block groups and extend last block group) are left untouched. Tests performed with cluster sizes of 1, 2, 4 and 8 blocks (of size 4k) per cluster. I will add these tests to xfstests. Signed-off-by: Harshad Shirwadkar <harshads@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
b80b32b6 |
|
14-Aug-2017 |
Theodore Ts'o <tytso@mit.edu> |
ext4: fix clang build regression Arnd Bergmann <arnd@arndb.de> As Stefan pointed out, I misremembered what clang can do specifically, and it turns out that the variable-length array at the end of the structure did not work (a flexible array would have worked here but not solved the problem): fs/ext4/mballoc.c:2303:17: error: fields must have a constant size: 'variable length array in structure' extension will never be supported ext4_grpblk_t counters[blocksize_bits + 2]; This reverts part of my previous patch, using a fixed-size array again, but keeping the check for the array overflow. Fixes: 2df2c3402fc8 ("ext4: fix warning about stack corruption") Reported-by: Stefan Agner <stefan@agner.ch> Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
2df2c340 |
|
05-Aug-2017 |
Arnd Bergmann <arnd@arndb.de> |
ext4: fix warning about stack corruption After commit 62d1034f53e3 ("fortify: use WARN instead of BUG for now"), we get a warning about possible stack overflow from a memcpy that was not strictly bounded to the size of the local variable: inlined from 'ext4_mb_seq_groups_show' at fs/ext4/mballoc.c:2322:2: include/linux/string.h:309:9: error: '__builtin_memcpy': writing between 161 and 1116 bytes into a region of size 160 overflows the destination [-Werror=stringop-overflow=] We actually had a bug here that would have been found by the warning, but it was already fixed last year in commit 30a9d7afe70e ("ext4: fix stack memory corruption with 64k block size"). This replaces the fixed-length structure on the stack with a variable-length structure, using the correct upper bound that tells the compiler that everything is really fine here. I also change the loop count to check for the same upper bound for consistency, but the existing code is already correct here. Note that while clang won't allow certain kinds of variable-length arrays in structures, this particular instance is fine, as the array is at the end of the structure, and the size is strictly bounded. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
e4510577 |
|
05-Aug-2017 |
Daeho Jeong <daeho.jeong@samsung.com> |
ext4: release discard bio after sending discard commands We've changed the discard command handling into parallel manner. But, in this change, I forgot decreasing the usage count of the bio which was used to send discard request. I'm sorry about that. Fixes: a015434480dc ("ext4: send parallel discards on commit completions") Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
|
#
ff950156 |
|
06-Jul-2017 |
Colin Ian King <colin.king@canonical.com> |
ext4: fix spelling mistake: "prellocated" -> "preallocated" Trivial fix to spelling mistake in mb_debug debug message Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
a0154344 |
|
22-Jun-2017 |
Daeho Jeong <daeho.jeong@samsung.com> |
ext4: send parallel discards on commit completions Now, when we mount ext4 filesystem with '-o discard' option, we have to issue all the discard commands for the blocks to be deallocated and wait for the completion of the commands on the commit complete phase. Because this procedure might involve a lot of sequential combinations of issuing discard commands and waiting for that, the delay of this procedure might be too much long, even to 17.0s in our test, and it results in long commit delay and fsync() performance degradation. To reduce this kind of delay, instead of adding callback for each extent and handling all of them in a sequential manner on commit phase, we instead add a separate list of extents to free to the superblock and then process this list at once after transaction commits so that we can issue all the discard commands in a parallel manner like XFS filesystem. Finally, we could enhance the discard command handling performance. The result was such that 17.0s delay of a single commit in the worst case has been enhanced to 4.8s. Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Tested-by: Hobin Woo <hobin.woo@samsung.com> Tested-by: Kitae Lee <kitae87.lee@samsung.com> Reviewed-by: Jan Kara <jack@suse.cz>
|
#
02749a4c |
|
22-Jun-2017 |
Tahsin Erdogan <tahsin@google.com> |
ext4: add ext4_is_quota_file() IS_NOQUOTA() indicates whether quota is disabled for an inode. Ext4 also uses it to check whether an inode is for a quota file. The distinction currently doesn't matter because quota is disabled only for the quota files. When we start disabling quota for other inodes in the future, we will want to make the distinction clear. Replace IS_NOQUOTA() call with ext4_is_quota_file() at places where we are checking for quota files. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
9651e6b2 |
|
21-May-2017 |
Konstantin Khlebnikov <koct9i@gmail.com> |
ext4: handle the rest of ext4_mb_load_buddy() ENOMEM errors I've got another report about breaking ext4 by ENOMEM error returned from ext4_mb_load_buddy() caused by memory shortage in memory cgroup. This time inside ext4_discard_preallocations(). This patch replaces ext4_error() with ext4_warning() where errors returned from ext4_mb_load_buddy() are not fatal and handled by caller: * ext4_mb_discard_group_preallocations() - called before generating ENOSPC, we'll try to discard other group or return ENOSPC into user-space. * ext4_trim_all_free() - just stop trimming and return ENOMEM from ioctl. Some callers cannot handle errors, thus __GFP_NOFAIL is used for them: * ext4_discard_preallocations() * ext4_mb_discard_lg_preallocations() Fixes: adb7ef600cc9 ("ext4: use __GFP_NOFAIL in ext4_free_blocks()") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
a7c3e901 |
|
08-May-2017 |
Michal Hocko <mhocko@suse.com> |
mm: introduce kv[mz]alloc helpers Patch series "kvmalloc", v5. There are many open coded kmalloc with vmalloc fallback instances in the tree. Most of them are not careful enough or simply do not care about the underlying semantic of the kmalloc/page allocator which means that a) some vmalloc fallbacks are basically unreachable because the kmalloc part will keep retrying until it succeeds b) the page allocator can invoke a really disruptive steps like the OOM killer to move forward which doesn't sound appropriate when we consider that the vmalloc fallback is available. As it can be seen implementing kvmalloc requires quite an intimate knowledge if the page allocator and the memory reclaim internals which strongly suggests that a helper should be implemented in the memory subsystem proper. Most callers, I could find, have been converted to use the helper instead. This is patch 6. There are some more relying on __GFP_REPEAT in the networking stack which I have converted as well and Eric Dumazet was not opposed [2] to convert them as well. [1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org [2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com This patch (of 9): Using kmalloc with the vmalloc fallback for larger allocations is a common pattern in the kernel code. Yet we do not have any common helper for that and so users have invented their own helpers. Some of them are really creative when doing so. Let's just add kv[mz]alloc and make sure it is implemented properly. This implementation makes sure to not make a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also to not warn about allocation failures. This also rules out the OOM killer as the vmalloc is a more approapriate fallback than a disruptive user visible action. This patch also changes some existing users and removes helpers which are specific for them. In some cases this is not possible (e.g. ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and require GFP_NO{FS,IO} context which is not vmalloc compatible in general (note that the page table allocation is GFP_KERNEL). Those need to be fixed separately. While we are at it, document that __vmalloc{_node} about unsupported gfp mask because there seems to be a lot of confusion out there. kvmalloc_node will warn about GFP_KERNEL incompatible (which are not superset) flags to catch new abusers. Existing ones would have to die slowly. [sfr@canb.auug.org.au: f2fs fixup] Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Andreas Dilger <adilger@dilger.ca> [ext4 part] Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: John Hubbard <jhubbard@nvidia.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0c9ec4be |
|
29-Apr-2017 |
Darrick J. Wong <darrick.wong@oracle.com> |
ext4: support GETFSMAP ioctls Support the GETFSMAP ioctls so that we can use the xfs free space management tools to probe ext4 as well. Note that this is a partial implementation -- we only report fixed-location metadata and free space; everything else is reported as "unknown". Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
d6006186 |
|
29-Apr-2017 |
Eric Biggers <ebiggers@google.com> |
ext4: constify static data that is never modified Constify static data in ext4 that is never (intentionally) modified so that it is placed in .rodata and benefits from memory protection. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
93407472 |
|
27-Feb-2017 |
Fabian Frederick <fabf@skynet.be> |
fs: add i_blocksize() Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs branch. This patch also fixes multiple checkpatch warnings: WARNING: Prefer 'unsigned int' to bare use of 'unsigned' Thanks to Andrew Morton for suggesting more appropriate function instead of macro. [geliangtang@gmail.com: truncate: use i_blocksize()] Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d9b22cf9 |
|
09-Feb-2017 |
Jan Kara <jack@suse.cz> |
ext4: fix stripe-unaligned allocations When a filesystem is created using: mkfs.ext4 -b 4096 -E stride=512 <dev> and we try to allocate 64MB extent, we will end up directly in ext4_mb_complex_scan_group(). This is because the request is detected as power-of-two allocation (so we start in ext4_mb_regular_allocator() with ac_criteria == 0) however the check before ext4_mb_simple_scan_group() refuses the direct buddy scan because the allocation request is too large. Since cr == 0, the check whether we should use ext4_mb_scan_aligned() fails as well and we fall back to ext4_mb_complex_scan_group(). Fix the problem by checking for upper limit on power-of-two requests directly when detecting them. Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
cd648b8a |
|
27-Jan-2017 |
Jan Kara <jack@suse.cz> |
ext4: trim allocation requests to group size If filesystem groups are artifically small (using parameter -g to mkfs.ext4), ext4_mb_normalize_request() can result in a request that is larger than a block group. Trim the request size to not confuse allocation code. Reported-by: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
|
#
43c73221 |
|
22-Jan-2017 |
Theodore Ts'o <tytso@mit.edu> |
ext4: replace BUG_ON with WARN_ON in mb_find_extent() The last BUG_ON in mb_find_extent() is apparently triggering in some rare cases. Most of the time it indicates a bug in the buddy bitmap algorithms, but there are some weird cases where it can trigger when buddy bitmap is still in memory, but the block bitmap has to be read from disk, and there is disk or memory corruption such that the block bitmap and the buddy bitmap are out of sync. Google-Bug-Id: #33702157 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
30a9d7af |
|
14-Nov-2016 |
Chandan Rajendra <chandan@linux.vnet.ibm.com> |
ext4: fix stack memory corruption with 64k block size The number of 'counters' elements needed in 'struct sg' is super_block->s_blocksize_bits + 2. Presently we have 16 'counters' elements in the array. This is insufficient for block sizes >= 32k. In such cases the memcpy operation performed in ext4_mb_seq_groups_show() would cause stack memory corruption. Fixes: c9de560ded61f Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org
|
#
69e43e8c |
|
14-Nov-2016 |
Chandan Rajendra <chandan@linux.vnet.ibm.com> |
ext4: fix mballoc breakage with 64k block size 'border' variable is set to a value of 2 times the block size of the underlying filesystem. With 64k block size, the resulting value won't fit into a 16-bit variable. Hence this commit changes the data type of 'border' to 'unsigned int'. Fixes: c9de560ded61f Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Cc: stable@vger.kernel.org
|
#
554a5ccc |
|
14-Jul-2016 |
Vegard Nossum <vegard.nossum@oracle.com> |
ext4: fix reference counting bug on block allocation error If we hit this error when mounted with errors=continue or errors=remount-ro: EXT4-fs error (device loop0): ext4_mb_mark_diskspace_used:2940: comm ext4.exe: Allocating blocks 5090-6081 which overlap fs metadata then ext4_mb_new_blocks() will call ext4_mb_release_context() and try to continue. However, ext4_mb_release_context() is the wrong thing to call here since we are still actually using the allocation context. Instead, just error out. We could retry the allocation, but there is a possibility of getting stuck in an infinite loop instead, so this seems safer. [ Fixed up so we don't return EAGAIN to userspace. --tytso ] Fixes: 8556e8f3b6 ("ext4: Don't allow new groups to be added during block allocation") Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: stable@vger.kernel.org
|
#
d08854f5 |
|
26-Jun-2016 |
Theodore Ts'o <tytso@mit.edu> |
ext4: optimize ext4_should_retry_alloc() to improve ENOSPC performance If there are no pending blocks to be released after a commit, forcing a journal commit has no hope of helping. It's possible that a commit had just completed, so if there are now free blocks available for allocation, it's worth retrying the commit. Reported-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
84c60b13 |
|
27-May-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
drop redundant ->owner initializations it's not needed for file_operations of inodes located on fs defined in the hosting module and for file_operations that go into procfs. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
935244cd |
|
05-May-2016 |
Nicolai Stange <nicstange@gmail.com> |
ext4: silence UBSAN in ext4_mb_init() Currently, in ext4_mb_init(), there's a loop like the following: do { ... offset += 1 << (sb->s_blocksize_bits - i); i++; } while (i <= sb->s_blocksize_bits + 1); Note that the updated offset is used in the loop's next iteration only. However, at the last iteration, that is at i == sb->s_blocksize_bits + 1, the shift count becomes equal to (unsigned)-1 > 31 (c.f. C99 6.5.7(3)) and UBSAN reports UBSAN: Undefined behaviour in fs/ext4/mballoc.c:2621:15 shift exponent 4294967295 is too large for 32-bit type 'int' [...] Call Trace: [<ffffffff818c4d25>] dump_stack+0xbc/0x117 [<ffffffff818c4c69>] ? _atomic_dec_and_lock+0x169/0x169 [<ffffffff819411ab>] ubsan_epilogue+0xd/0x4e [<ffffffff81941cac>] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254 [<ffffffff81941ab1>] ? __ubsan_handle_load_invalid_value+0x158/0x158 [<ffffffff814b6dc1>] ? kmem_cache_alloc+0x101/0x390 [<ffffffff816fc13b>] ? ext4_mb_init+0x13b/0xfd0 [<ffffffff814293c7>] ? create_cache+0x57/0x1f0 [<ffffffff8142948a>] ? create_cache+0x11a/0x1f0 [<ffffffff821c2168>] ? mutex_lock+0x38/0x60 [<ffffffff821c23ab>] ? mutex_unlock+0x1b/0x50 [<ffffffff814c26ab>] ? put_online_mems+0x5b/0xc0 [<ffffffff81429677>] ? kmem_cache_create+0x117/0x2c0 [<ffffffff816fcc49>] ext4_mb_init+0xc49/0xfd0 [...] Observe that the mentioned shift exponent, 4294967295, equals (unsigned)-1. Unless compilers start to do some fancy transformations (which at least GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the such calculated value of offset is never used again. Silence UBSAN by introducing another variable, offset_incr, holding the next increment to apply to offset and adjust that one by right shifting it by one position per loop iteration. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701 Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161 Cc: stable@vger.kernel.org Signed-off-by: Nicolai Stange <nicstange@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
b5cb316c |
|
05-May-2016 |
Nicolai Stange <nicstange@gmail.com> |
ext4: address UBSAN warning in mb_find_order_for_block() Currently, in mb_find_order_for_block(), there's a loop like the following: while (order <= e4b->bd_blkbits + 1) { ... bb += 1 << (e4b->bd_blkbits - order); } Note that the updated bb is used in the loop's next iteration only. However, at the last iteration, that is at order == e4b->bd_blkbits + 1, the shift count becomes negative (c.f. C99 6.5.7(3)) and UBSAN reports UBSAN: Undefined behaviour in fs/ext4/mballoc.c:1281:11 shift exponent -1 is negative [...] Call Trace: [<ffffffff818c4d35>] dump_stack+0xbc/0x117 [<ffffffff818c4c79>] ? _atomic_dec_and_lock+0x169/0x169 [<ffffffff819411bb>] ubsan_epilogue+0xd/0x4e [<ffffffff81941cbc>] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254 [<ffffffff81941ac1>] ? __ubsan_handle_load_invalid_value+0x158/0x158 [<ffffffff816e93a0>] ? ext4_mb_generate_from_pa+0x590/0x590 [<ffffffff816502c8>] ? ext4_read_block_bitmap_nowait+0x598/0xe80 [<ffffffff816e7b7e>] mb_find_order_for_block+0x1ce/0x240 [...] Unless compilers start to do some fancy transformations (which at least GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the such calculated value of bb is never used again. Silence UBSAN by introducing another variable, bb_incr, holding the next increment to apply to bb and adjust that one by right shifting it by one position per loop iteration. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701 Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161 Cc: stable@vger.kernel.org Signed-off-by: Nicolai Stange <nicstange@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
8d2ae1cb |
|
26-Apr-2016 |
Jakub Wilk <jwilk@jwilk.net> |
ext4: remove trailing \n from ext4_warning/ext4_error calls Messages passed to ext4_warning() or ext4_error() don't need trailing newlines, because these function add the newlines themselves. Signed-off-by: Jakub Wilk <jwilk@jwilk.net>
|
#
ea1754a0 |
|
01-Apr-2016 |
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage Mostly direct substitution with occasional adjustment or removing outdated comments. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
09cbfeaf |
|
01-Apr-2016 |
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
adb7ef60 |
|
13-Mar-2016 |
Konstantin Khlebnikov <koct9i@gmail.com> |
ext4: use __GFP_NOFAIL in ext4_free_blocks() This might be unexpected but pages allocated for sbi->s_buddy_cache are charged to current memory cgroup. So, GFP_NOFS allocation could fail if current task has been killed by OOM or if current memory cgroup has no free memory left. Block allocator cannot handle such failures here yet. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
b8a07463 |
|
09-Mar-2016 |
Adam Buchbinder <adam.buchbinder@gmail.com> |
ext4: fix misspellings in comments. Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
f96c450d |
|
21-Feb-2016 |
Daeho Jeong <daeho.jeong@samsung.com> |
ext4: make sure to revoke all the freeable blocks in ext4_free_blocks Now, ext4_free_blocks() doesn't revoke data blocks of per-file data journalled inode and it can cause file data inconsistency problems. Even though data blocks of per-file data journalled inode are already forgotten by jbd2_journal_invalidatepage() in advance of invoking ext4_free_blocks(), we still need to revoke the data blocks here. Moreover some of the metadata blocks, which are not found by sb_find_get_block(), are still needed to be revoked, but this is also missing here. Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
|
#
802cf1f9 |
|
11-Feb-2016 |
Huaitong Han <huaitong.han@intel.com> |
ext4: add a line break for proc mb_groups display This patch adds a line break for proc mb_groups display. Signed-off-by: Huaitong Han <huaitong.han@intel.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
|
#
79211c8e |
|
09-Nov-2015 |
Andrew Morton <akpm@linux-foundation.org> |
remove abs64() Switch everything to the new and more capable implementation of abs(). Mainly to give the new abs() a bit of a workout. Cc: Michal Nazarewicz <mina86@mina86.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
16175039 |
|
18-Oct-2015 |
John Stultz <john.stultz@linaro.org> |
ext4: fix abs() usage in ext4_mb_check_group_pa The ext4_fsblk_t type is a long long, which should not be used with abs(), as is done in ext4_mb_check_group_pa(). This patch modifies ext4_mb_check_group_pa() to use abs64() instead. Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
9c02ac97 |
|
17-Oct-2015 |
Daeho Jeong <daeho.jeong@samsung.com> |
ext4: fix xfstest generic/269 double revoked buffer bug with bigalloc When you repeatly execute xfstest generic/269 with bigalloc_1k option enabled using the below command: "./kvm-xfstests -c bigalloc_1k -m nodelalloc -C 1000 generic/269" you can easily see the below bug message. "JBD2 unexpected failure: jbd2_journal_revoke: !buffer_revoked(bh);" This means that an already revoked buffer is erroneously revoked again and it is caused by doing revoke for the buffer at the wrong position in ext4_free_blocks(). We need to re-position the buffer revoke procedure for an unspecified buffer after checking the cluster boundary for bigalloc option. If not, some part of the cluster can be doubly revoked. Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
|
#
9008a58e |
|
17-Oct-2015 |
Darrick J. Wong <darrick.wong@oracle.com> |
ext4: make the bitmap read routines return real error codes Make the bitmap reaading routines return real error codes (EIO, EFSCORRUPTED, EFSBADCRC) which can then be reflected back to userspace for more precise diagnosis work. In particular, this means that mballoc no longer claims that we're out of memory if the block bitmaps become corrupt. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
ebd173be |
|
22-Sep-2015 |
Theodore Ts'o <tytso@mit.edu> |
ext4: move procfs registration code to fs/ext4/sysfs.c This allows us to refactor the procfs code, which saves a bit of compiled space. More importantly it isolates most of the procfs support code into a single file, so it's easier to #ifdef it out if the proc file system has been disabled. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
7444a072 |
|
04-Jul-2015 |
Michal Hocko <mhocko@suse.cz> |
ext4: replace open coded nofail allocation in ext4_free_blocks() ext4_free_blocks is looping around the allocation request and mimics __GFP_NOFAIL behavior without any allocation fallback strategy. Let's remove the open coded loop and replace it with __GFP_NOFAIL. Without the flag the allocator has no way to find out never-fail requirement and cannot help in any way. Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
|
#
97b4af2f |
|
14-Jun-2015 |
Rasmus Villemoes <linux@rasmusvillemoes.dk> |
ext4: mballoc: avoid 20-argument function call Making a function call with 20 arguments is rather expensive in both stack and .text. In this case, doing the formatting manually doesn't make it any less readable, so we might as well save 155 bytes of .text and 112 bytes of stack. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
|
#
42ac1848 |
|
08-Jun-2015 |
Lukas Czerner <lczerner@redhat.com> |
ext4: return error code from ext4_mb_good_group() Currently ext4_mb_good_group() only returns 0 or 1 depending on whether the allocation group is suitable for use or not. However we might get various errors and fail while initializing new group including -EIO which would never get propagated up the call chain. This might lead to an endless loop at writeback when we're trying to find a good group to allocate from and we fail to initialize new group (read error for example). Fix this by returning proper error code from ext4_mb_good_group() and using it in ext4_mb_regular_allocator(). In ext4_mb_regular_allocator() we will always return only the first occurred error from ext4_mb_good_group() and we only propagate it back to the caller if we do not get any other errors and we fail to allocate any blocks. Note that with other modes than errors=continue, we will fail immediately in ext4_mb_good_group() in case of error, however with errors=continue we should try to continue using the file system, that's why we're not going to fail immediately when we see an error from ext4_mb_good_group(), but rather when we fail to find a suitable block group to allocate from due to an problem in group initialization. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
bbdc322f |
|
08-Jun-2015 |
Lukas Czerner <lczerner@redhat.com> |
ext4: try to initialize all groups we can in case of failure on ppc64 Currently on the machines with page size > block size when initializing block group buddy cache we initialize it for all the block group bitmaps in the page. However in the case of read error, checksum error, or if a single bitmap is in any way corrupted we would fail to initialize all of the bitmaps. This is problematic because we will not have access to the other allocation groups even though those might be perfectly fine and usable. Fix this by reading all the bitmaps instead of error out on the first problem and simply skip the bitmaps which were either not read properly, or are not valid. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
66114cad |
|
22-May-2015 |
Tejun Heo <tj@kernel.org> |
writeback: separate out include/linux/backing-dev-defs.h With the planned cgroup writeback support, backing-dev related declarations will be more widely used across block and cgroup; unfortunately, including backing-dev.h from include/linux/blkdev.h makes cyclic include dependency quite likely. This patch separates out backing-dev-defs.h which only has the essential definitions and updates blkdev.h to include it. c files which need access to more backing-dev details now include backing-dev.h directly. This takes backing-dev.h off the common include dependency chain making it a lot easier to use it across block and cgroup. v2: fs/fat build failure fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@fb.com>
|
#
bfcba2d0 |
|
25-Nov-2014 |
Markus Elfring <elfring@users.sourceforge.net> |
ext4: Remove an unnecessary check for NULL before iput() The iput() function tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
4fdb5543 |
|
25-Nov-2014 |
Dmitry Monakhov <dmonakhov@openvz.org> |
ext4: cleanup GFP flags inside resize path We must use GFP_NOFS instead GFP_KERNEL inside ext4_mb_add_groupinfo and ext4_calculate_overhead() because they are called from inside a journal transaction. Call trace: ioctl ->ext4_group_add ->journal_start ->ext4_setup_new_descs ->ext4_mb_add_groupinfo -> GFP_KERNEL ->ext4_flex_group_add ->ext4_update_super ->ext4_calculate_overhead -> GFP_KERNEL ->journal_stop Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
b93b41d4 |
|
19-Nov-2014 |
Al Viro <viro@ZenIV.linux.org.uk> |
ext4: kill ext4_kvfree() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
dfe076c1 |
|
01-Oct-2014 |
Dmitry Monakhov <dmonakhov@openvz.org> |
ext4: get rid of code duplication Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
754cfed6 |
|
04-Sep-2014 |
Theodore Ts'o <tytso@mit.edu> |
ext4: drop the EXT4_STATE_DELALLOC_RESERVED flag Having done a full regression test, we can now drop the DELALLOC_RESERVED state flag. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
|
#
e3cf5d5d |
|
04-Sep-2014 |
Theodore Ts'o <tytso@mit.edu> |
ext4: prepare to drop EXT4_STATE_DELALLOC_RESERVED The EXT4_STATE_DELALLOC_RESERVED flag was originally implemented because it was too hard to make sure the mballoc and get_block flags could be reliably passed down through all of the codepaths that end up calling ext4_mb_new_blocks(). Since then, we have mb_flags passed down through most of the code paths, so getting rid of EXT4_STATE_DELALLOC_RESERVED isn't as tricky as it used to. This commit plumbs in the last of what is required, and then adds a WARN_ON check to make sure we haven't missed anything. If this passes a full regression test run, we can then drop EXT4_STATE_DELALLOC_RESERVED. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
|
#
a0b6bc63 |
|
16-Aug-2014 |
Christoph Lameter <cl@linux.com> |
block: Replace __this_cpu_ptr with raw_cpu_ptr __this_cpu_ptr is being phased out use raw_cpu_ptr instead which was introduced in 3.15-rc1. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>
|
#
c99d1e6e |
|
23-Aug-2014 |
Theodore Ts'o <tytso@mit.edu> |
ext4: fix BUG_ON in mb_free_blocks() If we suffer a block allocation failure (for example due to a memory allocation failure), it's possible that we will call ext4_discard_allocated_blocks() before we've actually allocated any blocks. In that case, fe_len and fe_start in ac->ac_f_ex will still be zero, and this will result in mb_free_blocks(inode, e4b, 0, 0) triggering the BUG_ON on mb_free_blocks(): BUG_ON(last >= (sb->s_blocksize << 3)); Fix this by bailing out of ext4_discard_allocated_blocks() if fs_len is zero. Also fix a missing ext4_mb_unload_buddy() call in ext4_discard_allocated_blocks(). Google-Bug-Id: 16844242 Fixes: 86f0afd463215fc3e58020493482faa4ac3a4d69 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
|
#
86f0afd4 |
|
30-Jul-2014 |
Theodore Ts'o <tytso@mit.edu> |
ext4: fix ext4_discard_allocated_blocks() if we can't allocate the pa struct If there is a failure while allocating the preallocation structure, a number of blocks can end up getting marked in the in-memory buddy bitmap, and then not getting released. This can result in the following corruption getting reported by the kernel: EXT4-fs error (device sda3): ext4_mb_generate_buddy:758: group 1126, 12793 clusters in bitmap, 12729 in gd In that case, we need to release the blocks using mb_free_blocks(). Tested: fs smoke test; also demonstrated that with injected errors, the file system is no longer getting corrupted Google-Bug-Id: 16657874 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
|
#
b27b1535 |
|
27-Jul-2014 |
Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com> |
ext4: fix wrong size computation in ext4_mb_normalize_request() As the member fe_len defined in struct ext4_free_extent is expressed as number of clusters, the variable "size" computation is wrong, we need to first translate fe_len to block number, then to bytes. Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
|
#
71d4f7d0 |
|
15-Jul-2014 |
Theodore Ts'o <tytso@mit.edu> |
ext4: remove metadata reservation checks Commit 27dd43854227b ("ext4: introduce reserved space") reserves 2% of the file system space to make sure metadata allocations will always succeed. Given that, tracking the reservation of metadata blocks is no longer necessary. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
94d4c066 |
|
05-Jul-2014 |
Theodore Ts'o <tytso@mit.edu> |
ext4: clarify ext4_error message in ext4_mb_generate_buddy_error() We are spending a lot of time explaining to users what this error means. Let's try to improve the message to avoid this problem. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
|
#
e43bb4e6 |
|
26-Jun-2014 |
Namjae Jeon <namjae.jeon@samsung.com> |
ext4: decrement free clusters/inodes counters when block group declared bad We should decrement free clusters counter when block bitmap is marked as corrupt and free inodes counter when the allocation bitmap is marked as corrupt to avoid misunderstanding due to incorrect available size in statfs result. User can get immediately ENOSPC error from write begin without reaching for the writepages. Cc: Darrick J. Wong<darrick.wong@oracle.com> Reported-by: Amit Sahrawat <amit.sahrawat83@gmail.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
|
#
2457aec6 |
|
04-Jun-2014 |
Mel Gorman <mgorman@suse.de> |
mm: non-atomically mark page accessed during page cache allocation where possible aops->write_begin may allocate a new page and make it visible only to have mark_page_accessed called almost immediately after. Once the page is visible the atomic operations are necessary which is noticable overhead when writing to an in-memory filesystem like tmpfs but should also be noticable with fast storage. The objective of the patch is to initialse the accessed information with non-atomic operations before the page is visible. The bulk of filesystems directly or indirectly use grab_cache_page_write_begin or find_or_create_page for the initial allocation of a page cache page. This patch adds an init_page_accessed() helper which behaves like the first call to mark_page_accessed() but may called before the page is visible and can be done non-atomically. The primary APIs of concern in this care are the following and are used by most filesystems. find_get_page find_lock_page find_or_create_page grab_cache_page_nowait grab_cache_page_write_begin All of them are very similar in detail to the patch creates a core helper pagecache_get_page() which takes a flags parameter that affects its behavior such as whether the page should be marked accessed or not. Then old API is preserved but is basically a thin wrapper around this core function. Each of the filesystems are then updated to avoid calling mark_page_accessed when it is known that the VM interfaces have already done the job. There is a slight snag in that the timing of the mark_page_accessed() has now changed so in rare cases it's possible a page gets to the end of the LRU as PageReferenced where as previously it might have been repromoted. This is expected to be rare but it's worth the filesystem people thinking about it in case they see a problem with the timing change. It is also the case that some filesystems may be marking pages accessed that previously did not but it makes sense that filesystems have consistent behaviour in this regard. The test case used to evaulate this is a simple dd of a large file done multiple times with the file deleted on each iterations. The size of the file is 1/10th physical memory to avoid dirty page balancing. In the async case it will be possible that the workload completes without even hitting the disk and will have variable results but highlight the impact of mark_page_accessed for async IO. The sync results are expected to be more stable. The exception is tmpfs where the normal case is for the "IO" to not hit the disk. The test machine was single socket and UMA to avoid any scheduling or NUMA artifacts. Throughput and wall times are presented for sync IO, only wall times are shown for async as the granularity reported by dd and the variability is unsuitable for comparison. As async results were variable do to writback timings, I'm only reporting the maximum figures. The sync results were stable enough to make the mean and stddev uninteresting. The performance results are reported based on a run with no profiling. Profile data is based on a separate run with oprofile running. async dd 3.15.0-rc3 3.15.0-rc3 vanilla accessed-v2 ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%) tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%) btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%) ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%) xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%) The XFS figure is a bit strange as it managed to avoid a worst case by sheer luck but the average figures looked reasonable. samples percentage ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b5b60778 |
|
26-May-2014 |
Maurizio Lombardi <mlombard@redhat.com> |
ext4: fix wrong assert in ext4_mb_normalize_request() The variable "size" is expressed as number of blocks and not as number of clusters, this could trigger a kernel panic when using ext4 with the size of a cluster different from the size of a block. Cc: stable@vger.kernel.org Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
5d601255 |
|
12-May-2014 |
liang xie <xieliang007@gmail.com> |
ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access Make them more consistently Signed-off-by: xieliang <xieliang@xiaomi.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
029b10c5 |
|
11-May-2014 |
Andrey Tsyvarev <tsyvarev@ispras.ru> |
ext4: do not destroy ext4_groupinfo_caches if ext4_mb_init() fails Caches from 'ext4_groupinfo_caches' may be in use by other mounts, which have already existed. So, it is incorrect to destroy them when newly requested mount fails. Found by Linux File System Verification project (linuxtesting.org). Signed-off-by: Andrey Tsyvarev <tsyvarev@ispras.ru> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
|
#
e2cbd587 |
|
12-Apr-2014 |
jon ernst <jonernst07@gmail.com> |
ext4: silence sparse check warning for function ext4_trim_extent This fixes the following sparse warning: CHECK fs/ext4/mballoc.c fs/ext4/mballoc.c:5019:9: warning: context imbalance in 'ext4_trim_extent' - unexpected unlock Signed-off-by: "Jon Ernst" <jonernst07@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
c57ab39b |
|
10-Apr-2014 |
Younger Liu <younger.liucn@gmail.com> |
ext4: return ENOMEM rather than EIO when find_###_page() fails Return ENOMEM rather than EIO when find_get_page() fails in ext4_mb_get_buddy_page_lock() and find_or_create_page() fails in ext4_mb_load_buddy(). Signed-off-by: Younger Liu <younger.liucn@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
dc9ddd98 |
|
20-Feb-2014 |
Eric Sandeen <sandeen@redhat.com> |
ext4: remove unused ac_ex_scanned When looking at a bug report with: > kernel: EXT4-fs: 0 scanned, 0 found I thought wow, 0 scanned, that's odd? But it's not odd; it's printing a variable that is initialized to 0 and never touched again. It's never been used since the original merge, so I don't really even know what the original intent was, either. If anyone knows how to hook it up, speak now via patch, otherwise just yank it so it's not making a confusing situation more confusing in kernel logs. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
ab0c00fc |
|
19-Feb-2014 |
Theodore Ts'o <tytso@mit.edu> |
ext4: make sure ex.fe_logical is initialized The lowest levels of mballoc set all of the fields of struct ext4_free_extent except for fe_logical, since they are just trying to find the requested free set of blocks, and the logical block hasn't been set yet. This makes some static code checkers sad. Set it to various different debug values, which would be useful when debugging mballoc if these values were to ever show up due to the parts of mballoc triyng to use ac->ac_b_ex.fe_logical before it is properly upper layers of mballoc failing to properly set, usually by ext4_mb_use_best_found(). Addresses-Coverity-Id: #139697 Addresses-Coverity-Id: #139698 Addresses-Coverity-Id: #139699 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
f5a44db5 |
|
20-Dec-2013 |
Theodore Ts'o <tytso@mit.edu> |
ext4: add explicit casts when masking cluster sizes The missing casts can cause the high 64-bits of the physical blocks to be lost. Set up new macros which allows us to make sure the right thing happen, even if at some point we end up supporting larger logical block numbers. Thanks to the Emese Revfy and the PaX security team for reporting this issue. Reported-by: PaX Team <pageexec@freemail.hu> Reported-by: Emese Revfy <re.emese@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
|
#
4e8d2139 |
|
03-Dec-2013 |
Junho Ryu <jayr@google.com> |
ext4: fix use-after-free in ext4_mb_new_blocks ext4_mb_put_pa should hold pa->pa_lock before accessing pa->pa_count. While ext4_mb_use_preallocated checks pa->pa_deleted first and then increments pa->count later, ext4_mb_put_pa decrements pa->pa_count before holding pa->pa_lock and then sets pa->pa_deleted. * Free sequence ext4_mb_put_pa (1): atomic_dec_and_test pa->pa_count ext4_mb_put_pa (2): lock pa->pa_lock ext4_mb_put_pa (3): check pa->pa_deleted ext4_mb_put_pa (4): set pa->pa_deleted=1 ext4_mb_put_pa (5): unlock pa->pa_lock ext4_mb_put_pa (6): remove pa from a list ext4_mb_pa_callback: free pa * Use sequence ext4_mb_use_preallocated (1): iterate over preallocation ext4_mb_use_preallocated (2): lock pa->pa_lock ext4_mb_use_preallocated (3): check pa->pa_deleted ext4_mb_use_preallocated (4): increase pa->pa_count ext4_mb_use_preallocated (5): unlock pa->pa_lock ext4_mb_release_context: access pa * Use-after-free sequence [initial status] <pa->pa_deleted = 0, pa_count = 1> ext4_mb_use_preallocated (1): iterate over preallocation ext4_mb_use_preallocated (2): lock pa->pa_lock ext4_mb_use_preallocated (3): check pa->pa_deleted ext4_mb_put_pa (1): atomic_dec_and_test pa->pa_count [pa_count decremented] <pa->pa_deleted = 0, pa_count = 0> ext4_mb_use_preallocated (4): increase pa->pa_count [pa_count incremented] <pa->pa_deleted = 0, pa_count = 1> ext4_mb_use_preallocated (5): unlock pa->pa_lock ext4_mb_put_pa (2): lock pa->pa_lock ext4_mb_put_pa (3): check pa->pa_deleted ext4_mb_put_pa (4): set pa->pa_deleted=1 [race condition!] <pa->pa_deleted = 1, pa_count = 1> ext4_mb_put_pa (5): unlock pa->pa_lock ext4_mb_put_pa (6): remove pa from a list ext4_mb_pa_callback: free pa ext4_mb_release_context: access pa AddressSanitizer has detected use-after-free in ext4_mb_new_blocks Bug report: http://goo.gl/rG1On3 Signed-off-by: Junho Ryu <jayr@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
|
#
8f9ff189 |
|
30-Oct-2013 |
Lukas Czerner <lczerner@redhat.com> |
ext4: fix FITRIM in no journal mode When using FITRIM ioctl on a file system without journal it will only trim the block group once, no matter how many times you invoke FITRIM ioctl and how many block you release from the block group. It is because we only clear EXT4_GROUP_INFO_WAS_TRIMMED_BIT in journal callback. Fix this by clearing the bit in no journal mode as well. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reported-by: Jorge Fábregas <jorge.fabregas@gmail.com>
|
#
163a203d |
|
28-Aug-2013 |
Darrick J. Wong <darrick.wong@oracle.com> |
ext4: mark block group as corrupt on block bitmap error When we notice a block-bitmap corruption (because of device failure or something else), we should mark this group as corrupt and prevent further block allocations/deallocations from it. Currently, we end up generating one error message for every block in the bitmap. This potentially could make the system unstable as noticed in some bugs. With this patch, the error will be printed only the first time and mark the entire block group as corrupted. This prevents future access allocations/deallocations from it. Also tested by corrupting the block bitmap and forcefully introducing the mb_free_blocks error: (1) create a largefile (2Gb) $ dd if=/dev/zero of=largefile oflag=direct bs=10485760 count=200 (2) umount filesystem. use dumpe2fs to see which block-bitmaps are in use by largefile and note their block numbers (3) use dd to zero-out the used block bitmaps $ dd if=/dev/zero of=/dev/hdc4 bs=4096 seek=14 count=8 oflag=direct (4) mount the FS and delete the largefile. (5) recreate the largefile. verify that the new largefile does not get any blocks from the groups marked as bad. Without the patch, we will see mb_free_blocks error for each bit in each zero'ed out bitmap at (4). With the patch, we only see the error once per blockgroup: [ 309.706803] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 15: 32768 clusters in bitmap, 0 in gd. blk grp corrupted. [ 309.720824] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 14: 32768 clusters in bitmap, 0 in gd. blk grp corrupted. [ 309.732858] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure [ 309.748321] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 13: 32768 clusters in bitmap, 0 in gd. blk grp corrupted. [ 309.760331] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure [ 309.769695] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 12: 32768 clusters in bitmap, 0 in gd. blk grp corrupted. [ 309.781721] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure [ 309.798166] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 11: 32768 clusters in bitmap, 0 in gd. blk grp corrupted. [ 309.810184] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure [ 309.819532] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 10: 32768 clusters in bitmap, 0 in gd. blk grp corrupted. Google-Bug-Id: 7258357 [darrick.wong@oracle.com] Further modifications (by Darrick) to make more obvious that this corruption bit applies to blocks only. Set the corruption flag if the block group bitmap verification fails. Original-author: Aditya Kali <adityakali@google.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
7d734532 |
|
17-Aug-2013 |
Jan Kara <jack@suse.cz> |
ext4: fix warning in ext4_da_update_reserve_space() reaim workfile.dbase test easily triggers warning in ext4_da_update_reserve_space(): EXT4-fs warning (device ram0): ext4_da_update_reserve_space:365: ino 12, allocated 1 with only 0 reserved metadata blocks (releasing 1 blocks with reserved 9 data blocks) The problem is that (one of) tests creates file and then randomly writes to it with O_SYNC. That results in writing back pages of the file in random order so we create extents for written blocks say 0, 2, 4, 6, 8 - this last allocation also allocates new block for extents. Then we writeout block 1 so we have extents 0-2, 4, 6, 8 and we release indirect extent block because extents fit in the inode again. Then we writeout block 10 and we need to allocate indirect extent block again which triggers the warning because we don't have the reservation anymore. Fix the problem by giving back freed metadata blocks resulting from extent merging into inode's reservation pool. Signed-off-by: Jan Kara <jack@suse.cz>
|
#
e7676a70 |
|
12-Jul-2013 |
Theodore Ts'o <tytso@mit.edu> |
ext4: don't allow ext4_free_blocks() to fail due to ENOMEM The filesystem should not be marked inconsistent if ext4_free_blocks() is not able to allocate memory. Unfortunately some callers (most notably ext4_truncate) don't have a way to reflect an error back up to the VFS. And even if we did, most userspace applications won't deal with most system calls returning ENOMEM anyway. Reported-by: Nagachandra P <nagachandra@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
|
#
2c00ef3e |
|
01-Jul-2013 |
Alexey Khoroshilov <khoroshilov@ispras.ru> |
ext4: implement error handling of ext4_mb_new_preallocation() If memory allocation in ext4_mb_new_group_pa() is failed, it returns error code, ext4_mb_new_preallocation() propages it, but ext4_mb_new_blocks() ignores it. An observed result was: - allocation fail means ext4_mb_new_group_pa() does not update ext4_allocation_context; - ext4_mb_new_blocks() sets ext4_allocation_request->len (ar->len = ac->ac_b_ex.fe_len;) to number of blocks preallocated (512) instead of number of blocks requested (1); - that activates update cycle in ext4_splice_branch(): for (i = 1; i < blks; i++) <-- blks is 512 instead of 1 here *(where->p + i) = cpu_to_le32(current_block++); - it iterates 511 times and corrupts a chunk of memory including inode structure; - page fault happens at EXT4_SB(inode->i_sb) in ext4_mark_inode_dirty(); - system hangs with 'scheduling while atomic' BUG. The patch implements a check for ext4_mb_new_preallocation() error code and handles its failure as if ext4_mb_regular_allocator() fails. Found by Linux File System Verification project (linuxtesting.org). [ Patch restructed by tytso to make the flow of control easier to follow. ] Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
2ed5724d |
|
12-Jun-2013 |
Theodore Ts'o <tytso@mit.edu> |
ext4: add cond_resched() to ext4_free_blocks() & ext4_mb_regular_allocator() For a file systems with a very large number of block groups, if all of the block group bitmaps are in memory and the file system is relatively badly fragmented, it's possible ext4_mb_regular_allocator() to take a long time trying to find a good match. This is especially true if the tuning parameter mb_max_to_scan has been sent to a very large number. So add a cond_resched() to avoid soft lockup warnings and to provide better system responsiveness. For ext4_free_blocks(), if we are deleting a large range of blocks, and data=journal is enabled so that EXT4_FREE_BLOCKS_FORGET is passed, the loop to call sb_find_get_block() and to call ext4_forget() can take over 10-15 milliseocnds or more. So it's better to add a cond_resched() here a well. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
e6155736 |
|
05-May-2013 |
Lachlan McIlroy <lmcilroy@redhat.com> |
ext4: limit group search loop for non-extent files In the case where we are allocating for a non-extent file, we must limit the groups we allocate from to those below 2^32 blocks, and ext4_mb_regular_allocator() attempts to do this initially by putting a cap on ngroups for the subsequent search loop. However, the initial target group comes in from the allocation context (ac), and it may already be beyond the artificially limited ngroups. In this case, the limit if (group == ngroups) group = 0; at the top of the loop is never true, and the loop will run away. Catch this case inside the loop and reset the search to start at group 0. [sandeen@redhat.com: add commit msg & comments] Signed-off-by: Lachlan McIlroy <lmcilroy@redhat.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
|
#
8c8e0ca6 |
|
09-Apr-2013 |
Dmitri Monakho <dmonakhov@openvz.org> |
ext4: fix usless declarations This patch should fix sparse complains about shadow declatations. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
d9dda78b |
|
31-Mar-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
procfs: new helper - PDE_DATA(inode) The only part of proc_dir_entry the code outside of fs/proc really cares about is PDE(inode)->data. Provide a helper for that; static inline for now, eventually will be moved to fs/proc, along with the knowledge of struct proc_dir_entry layout. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
eabe0444 |
|
08-Apr-2013 |
Andrey Sidorov <qrxd43@motorola.com> |
ext4: speed-up releasing blocks on commit Improve mb_free_blocks speed by clearing entire range at once instead of iterating over each bit. Freeing block-by-block also makes buddy bitmap subtree flip twice making most of the work a no-op. Very few bits in buddy bitmap require change, e.g. freeing entire group is a 1 bit flip only. As a result, releasing blocks of 60G file now takes 5ms instead of 2.7s. This is especially good for non-preemptive kernels as there is no rescheduling during release. Signed-off-by: Andrey Sidorov <qrxd43@motorola.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
bd86298e |
|
03-Apr-2013 |
Lukas Czerner <lczerner@redhat.com> |
ext4: introduce ext4_get_group_number() Currently on many places in ext4 we're using ext4_get_group_no_and_offset() even though we're only interested in knowing the block group of the particular block, not the offset within the block group so we can use more efficient way to compute block group. This patch introduces ext4_get_group_number() which computes block group for a given block much more efficiently. Use this function instead of ext4_get_group_no_and_offset() everywhere where we're only interested in knowing the block group. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
5d3ee208 |
|
03-Apr-2013 |
Dmitry Monakhov <dmonakhov@openvz.org> |
ext4: fix journal callback list traversal It is incorrect to use list_for_each_entry_safe() for journal callback traversial because ->next may be removed by other task: ->ext4_mb_free_metadata() ->ext4_mb_free_metadata() ->ext4_journal_callback_del() This results in the following issue: WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250() Hardware name: list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod Pid: 16400, comm: jbd2/dm-1-8 Tainted: G W 3.8.0-rc3+ #107 Call Trace: [<ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0 [<ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50 [<ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0 [<ffffffff8148cae0>] __list_del_entry+0x1c0/0x250 [<ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0 [<ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570 [<ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0 [<ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0 [<ffffffff813d3ecf>] kjournald2+0x19f/0x6a0 [<ffffffff810ad630>] ? wake_up_bit+0x40/0x40 [<ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80 [<ffffffff810ac6be>] kthread+0x10e/0x120 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70 [<ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70 This patch fix the issue as follows: - ext4_journal_commit_callback() make list truly traversial safe simply by always starting from list_head - fix race between two ext4_journal_callback_del() and ext4_journal_callback_try_del() Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.com
|
#
b10a44c3 |
|
03-Apr-2013 |
Theodore Ts'o <tytso@mit.edu> |
ext4: add might_sleep() annotations Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
|
#
90ba983f |
|
11-Mar-2013 |
Theodore Ts'o <tytso@mit.edu> |
ext4: use atomic64_t for the per-flexbg free_clusters count A user who was using a 8TB+ file system and with a very large flexbg size (> 65536) could cause the atomic_t used in the struct flex_groups to overflow. This was detected by PaX security patchset: http://forums.grsecurity.net/viewtopic.php?f=3&t=3289&p=12551#p12551 This bug was introduced in commit 9f24e4208f7e, so it's been around since 2.6.30. :-( Fix this by using an atomic64_t for struct orlav_stats's free_clusters. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Cc: stable@vger.kernel.org
|
#
bb8b20ed |
|
10-Mar-2013 |
Lukas Czerner <lczerner@redhat.com> |
ext4: do not use yield() Using yield() is strongly discouraged (see sched/core.c) especially since we can just use cond_resched(). Replace all use of yield() with cond_resched(). Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
e3d85c36 |
|
10-Mar-2013 |
Lukas Czerner <lczerner@redhat.com> |
ext4: remove unused variable in ext4_free_blocks() Remove unused variable 'freed' in ext4_free_blocks(). Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
810da240 |
|
02-Mar-2013 |
Lukas Czerner <lczerner@redhat.com> |
ext4: convert number of blocks to clusters properly We're using macro EXT4_B2C() to convert number of blocks to number of clusters for bigalloc file systems. However, we should be using EXT4_NUM_B2C(). Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
|
#
a0b30c12 |
|
09-Feb-2013 |
Theodore Ts'o <tytso@mit.edu> |
ext4: use module parameters instead of debugfs for mballoc_debug There are multiple reasons to move away from debugfs. First of all, we are only using it for a single parameter, and it is much more complicated to set up (some 30 lines of code compared to 3), and one more thing that might fail while loading the ext4 module. Secondly, as a module paramter it can be specified as a boot option if ext4 is built into the kernel, or as a parameter when the module is loaded, and it can also be manipulated dynamically under /sys/module/ext4/parameters/mballoc_debug. So it is more flexible. Ultimately we want to move away from using mb_debug() towards tracepoints, but for now this is still a useful simplification of the code base. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
40ae3487 |
|
04-Feb-2013 |
Theodore Ts'o <tytso@mit.edu> |
ext4: optimize mballoc for large allocations The ext4 block allocator only maintains buddy bitmaps for chunks which are less than or equal to one quarter of a block group. That is, for a file aystem with a 1k blocksize, and where the number of blocks in a block group is 8192 blocks, the largest chunk size tracked by buddy bitmaps is 2048 blocks. For a file system with a 4k blocksize, and where the number of blocks in a block group is 32768 blocks, the largest chunk size tracked by buddy bitmaps is 8192 blocks. To work around this code, mballoc.c before this commit would truncate allocation requests to the number of blocks in a block group minus 10. Why 10? Aside from being a completely arbitrary number, it avoids block allocation to be a power of two larger than 25% of the block group. If you try to explicitly fallocate 50% of the block group size, this will demonstrate the problem; the block allocation code will scan the all of the blocks in the file system with cr==0 (since the request is for a natural power of two), but then completely fail for all blocks groups, since the buddy bitmaps don't track chunk sizes of 50% of the block group. To fix this, in these we use ext4_mb_complex_scan_group() instead of ext4_mb_simple_scan_group(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger@dilger.ca>
|
#
f1167009 |
|
01-Feb-2013 |
Niu Yawei <yawei.niu@gmail.com> |
ext4: fix race in ext4_mb_add_n_trim() In ext4_mb_add_n_trim(), lg_prealloc_lock should be taken when changing the lg_prealloc_list. Signed-off-by: Niu Yawei <yawei.niu@intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
|
#
d71c1ae2 |
|
08-Nov-2012 |
Lukas Czerner <lczerner@redhat.com> |
ext4: warn when discard request fails other than EOPNOTSUPP We should warn user then the discard request fails. However we need to exclude -EOPNOTSUPP case since parts of the device might not support it while other parts can. So print the kernel warning when the error != -EOPNOTSUPP is returned from ext4_issue_discard(). We should also handle error cases in batched discard, again excluding EOPNOTSUPP. Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
d8ec0c39 |
|
07-Nov-2012 |
Alan Cox <alan@linux.intel.com> |
ext4: remove unused assignment Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
6d138ced |
|
08-Nov-2012 |
Eric Sandeen <sandeen@redhat.com> |
ext4: fix awful goto in ext4_mb_new_blocks() I think the whole function could be made prettier, but that goto really took the cake for too-clever-by-half. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
5de35e8d |
|
22-Oct-2012 |
Lukas Czerner <lczerner@redhat.com> |
ext4: Avoid underflow in ext4_trim_fs() Currently if len argument in ext4_trim_fs() is smaller than one block, the 'end' variable underflow. Avoid that by returning EINVAL if len is smaller than file system block. Also remove useless unlikely(). Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
|
#
79f1ba49 |
|
21-Oct-2012 |
Tao Ma <boyu.mt@taobao.com> |
ext4: Checksum the block bitmap properly with bigalloc enabled In mke2fs, we only checksum the whole bitmap block and it is right. While in the kernel, we use EXT4_BLOCKS_PER_GROUP to indicate the size of the checksumed bitmap which is wrong when we enable bigalloc. The right size should be EXT4_CLUSTERS_PER_GROUP and this patch fixes it. Also as every caller of ext4_block_bitmap_csum_set and ext4_block_bitmap_csum_verify pass in EXT4_BLOCKS_PER_GROUP(sb)/8, we'd better removes this parameter and sets it in the function itself. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Cc: stable@vger.kernel.org
|
#
aaf7d73e |
|
26-Sep-2012 |
Lukas Czerner <lczerner@redhat.com> |
ext4: enable FITRIM ioctl on bigalloc file system With a minor tweaks regarding minimum extent size to discard and discarded bytes reporting the FITRIM can be enabled on bigalloc file system and it works without any problem. This patch fixes minlen handling and discarded bytes reporting to take into consideration bigalloc enabled file systems and finally removes the restriction and allow FITRIM to be used on file system with bigalloc feature enabled. Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
85556c9a |
|
26-Sep-2012 |
Wei Yongjun <yongjun_wei@trendmicro.com.cn> |
ext4: use kmem_cache_zalloc instead of kmem_cache_alloc/memset Using kmem_cache_zalloc() instead of kmem_cache_alloc() and memset(). spatch with a semantic match is used to found this problem. (http://coccinelle.lip6.fr/) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
838cd0cf |
|
23-Sep-2012 |
Yongqiang Yang <xiaoqiangnk@gmail.com> |
ext4: check free block counters in ext4_mb_find_by_goal Free block counters should be checked before doing allocation. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
b5e2368b |
|
18-Sep-2012 |
Theodore Ts'o <tytso@mit.edu> |
ext4: re-enable -o discard functionality in no-journal mode This is a revert of commit b56ff9d397ce, which removed the call to ext4_issue_discard() to fix a BUG reported because ext4_issue_discard() was being called from inside a block group spinlock. As it turns out this bug had already been fixed by Lukas Czerner in commit 53fdcf992d61 by the simple expedient of moving when we call ext4_issue_discard() outside the spinlock. So it should be safe to re-enable this functionality, which I tested by putting an BUG_ON(in_atomic) just after the restored callsite to ext4_issue_discard(). Addresses-Google-Bug: #6750518 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Anatol Pomozov <anatol.pomozov@gmail.com>
|
#
28623c2f |
|
04-Sep-2012 |
Theodore Ts'o <tytso@mit.edu> |
ext4: grow the s_group_info array as needed Previously we allocated the s_group_info array with enough space for any future possible growth of the file system via online resize. This is unfortunate because it wastes memory, and it doesn't work for the meta_bg scheme, since there is no limit based on the number of reserved gdt blocks. So add the code to grow the s_group_info array as needed. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
4907cb7b |
|
01-Sep-2012 |
Anatol Pomozov <anatol.pomozov@gmail.com> |
treewide: fix comment/printk/variable typos Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
#
15c006a2 |
|
17-Aug-2012 |
Robin Dong <sanbai@taobao.com> |
ext4: remove unused function argument 'order' in mb_find_extent() All the routines call mb_find_extent are setting argument 'order' to 0 just like: mb_find_extent(e4b, 0, ex.fe_start, ex.fe_len, &ex); therefore the useless argument should be removed. Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
01fc48e8 |
|
17-Aug-2012 |
Theodore Ts'o <tytso@mit.edu> |
ext4: don't load the block bitmap for block groups which have no space Add a short circuit check to ext4_mb_group_group() so that we don't bother to load the block bitmap for a block group which does not have any space available. (Or which does not have enough space until we are in desperation mode, i.e., when cr == 3.) Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=45741 Reported-by: mirek@me.com Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
97a74068 |
|
22-Jul-2012 |
Jan Kara <jack@suse.cz> |
ext4: remove useless marking of superblock dirty Commit a0375156 properly notes that superblock doesn't need to be marked as dirty when only number of free inodes / blocks / number of directories changes since that is recomputed on each mount anyway. However that comment leaves some unnecessary markings as dirty in place. Remove these. Artem: tested using xfstests for both journalled and non-journalled ext4. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
|
#
62a1391d |
|
09-Jul-2012 |
Haibo Liu <HaiboLiu6@gmai.com> |
ext4: remove an unused statement in ext4_mb_get_buddy_page_lock() In this patch, the statement "poff = block % blocks_per_page" in ext4_mb_get_buddy_page_lock has no effect. It will be optimized out by the compiler, but it's better to remove it. Signed-off-by: Haibo Liu <HaiboLiu6@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
1c8457ca |
|
30-Jun-2012 |
Aditya Kali <adityakali@google.com> |
ext4: avoid uneeded calls to ext4_mb_load_buddy() while reading mb_groups Currently ext4_mb_load_buddy is called for every group, irrespective of whether the group info is already in memory, while reading /proc/fs/ext4/<partition>/mb_groups proc file. For the purpose of mb_groups proc file, it is unnecessary to load the file group info from disk if it was loaded in past. These calls to ext4_mb_load_buddy make reading the mb_groups proc file expensive. Also, the locks around ext4_get_group_info are not required. This patch modifies the code to call ext4_mb_load_buddy only if the group info had never been loaded into memory in past. It also removes the mb group locking around ext4_get_group_info call. Signed-off-by: Aditya Kali <adityakali@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
95599968 |
|
31-May-2012 |
Salman Qazi <sqazi@google.com> |
ext4: remove mb_groups before tearing down the buddy_cache We can't have references held on pages in the s_buddy_cache while we are trying to truncate its pages and put the inode. All the pages must be gone before we reach clear_inode. This can only be gauranteed if we can prevent new users from grabbing references to s_buddy_cache's pages. The original bug can be reproduced and the bug fix can be verified by: while true; do mount -t ext4 /dev/ram0 /export/hda3/ram0; \ umount /export/hda3/ram0; done & while true; do cat /proc/fs/ext4/ram0/mb_groups; done Signed-off-by: Salman Qazi <sqazi@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
|
#
02b78310 |
|
31-May-2012 |
Salman Qazi <sqazi@google.com> |
ext4: add ext4_mb_unload_buddy in the error path ext4_free_blocks fails to pair an ext4_mb_load_buddy with a matching ext4_mb_unload_buddy when it fails a memory allocation. Signed-off-by: Salman Qazi <sqazi@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
|
#
400db9d3 |
|
28-May-2012 |
Zheng Liu <gnehzuil.liu@gmail.com> |
ext4: cleanup in ext4_discard_allocated_blocks() remove 'len' variable in ext4_discard_allocated_blocks() because it is useless. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
9d99012f |
|
28-May-2012 |
Akira Fujita <a-fujita@rs.jp.nec.com> |
ext4: remove needs_recovery in ext4_mb_init() needs_recovery in ext4_mb_init() is not used, remove it. Signed-off-by: Akira Fujita <a-fujita@rs.jp.ne.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
feb0ab32 |
|
29-Apr-2012 |
Darrick J. Wong <djwong@us.ibm.com> |
ext4: make block group checksums use metadata_csum algorithm metadata_csum supersedes uninit_bg. Convert the ROCOMPAT uninit_bg flag check to a helper function that covers both, and make the checksum calculation algorithm use either crc16 or the metadata_csum chosen algorithm depending on which flag is set. Print a warning if we try to mount a filesystem with both feature flags set. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
fa77dcfa |
|
29-Apr-2012 |
Darrick J. Wong <djwong@us.ibm.com> |
ext4: calculate and verify block bitmap checksum Compute and verify the checksum of the block bitmap; this checksum is stored in the block group descriptor. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
a7967f05 |
|
21-Mar-2012 |
Lukas Czerner <lczerner@redhat.com> |
ext4: always set then trimmed blocks count into len Currently if the range to trim is too small, for example on 1K fs the request to trim the first block, then the 'range->len' is not set reporting wrong number of discarded block to the caller. Fix this by always setting the 'range->len' before we return. Note that when there is a failure (-EINVAL) caller can not depend on 'range->len' being set properly. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
21e7fd22 |
|
21-Mar-2012 |
Lukas Czerner <lczerner@redhat.com> |
ext4: fix trimmed block count accunting Currently when there is not enough free blocks in the block group to discard (grp->bb_free < minlen) the 'trimmed' is bumped up anyway with the number of discarded blocks from the previous iteration. Fix this by bumping up 'trimmed' only if the ext4_trim_all_free() was actually run. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
913eed83 |
|
21-Mar-2012 |
Lukas Czerner <lczerner@redhat.com> |
ext4: fix start and len arguments handling in ext4_trim_fs() The overflow can happen when we are calling get_group_no_and_offset() which stores the group number in the ext4_grpblk_t type which is actually int. However when the blocknr is big enough the group number might be bigger than ext4_grpblk_t resulting in overflow. This will most likely happen with FITRIM default argument len = ULLONG_MAX. Fix this by using "end" variable instead of "start+len" as it is easier to get right and specifically check that the end is not beyond the end of the file system, so we are sure that the result of get_group_no_and_offset() will not overflow. Otherwise truncate it to the size of the file system. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
1084f252 |
|
19-Mar-2012 |
Theodore Ts'o <tytso@mit.edu> |
ext4: remove trailing newlines from ext4_msg() and ext4_error() messages The functions ext4_msg() and ext4_error() already tack on a trailing newline, so remove the unnecessary extra newline. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
7f6a11e7 |
|
19-Mar-2012 |
Joe Perches <joe@perches.com> |
ext4: remove redundant "EXT4-fs: " from uses of ext4_msg ext4_msg adds "EXT4-fs: " to the messsage output. Remove the redundant bits from uses. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
c5e8f3f3 |
|
20-Feb-2012 |
Theodore Ts'o <tytso@mit.edu> |
ext4: remove EXT4_MB_{BITMAP,BUDDY} macros The EXT4_MB_BITMAP and EXT4_MB_BUDDY macros obfuscate more than they provide any abstraction. So remove them. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
1592d2c5 |
|
20-Feb-2012 |
Curt Wohlgemuth <curtw@google.com> |
ext4: mark possibly unused variable in ext4_mb_normalize_request() The 'orig_size' local variable is only used in a call to mb_debug(). Mark it with '__maybe_unused'. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
18aadd47 |
|
20-Feb-2012 |
Bobi Jam <bobijam@whamcloud.com> |
ext4: expand commit callback and The per-commit callback was used by mballoc code to manage free space bitmaps after deleted blocks have been released. This patch expands it to support multiple different callbacks, to allow other things to be done after the commit has been completed. Signed-off-by: Bobi Jam <bobijam@whamcloud.com> Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
813e5727 |
|
20-Feb-2012 |
Theodore Ts'o <tytso@mit.edu> |
ext4: fix race when setting bitmap_uptodate flag In ext4_read_{inode,block}_bitmap() we were setting bitmap_uptodate() before submitting the buffer for read. The is bad, since we check bitmap_uptodate() without locking the buffer, and so if another process is racing with us, it's possible that they will think the bitmap is uptodate even though the read has not completed yet, resulting in inodes and blocks potentially getting allocated more than once if we get really unlucky. Addresses-Google-Bug: 2828254 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
60e07cf5 |
|
18-Dec-2011 |
Yongqiang Yang <xiaoqiangnk@gmail.com> |
ext4: do not reference pa_inode from group_pa pa_inode in group_pa is set NULL in ext4_mb_new_group_pa, so pa_inode should be not referenced. Reported-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
0a10da73 |
|
26-Oct-2011 |
Robin Dong <sanbai@taobao.com> |
ext4: fix a wrong comment in __mb_check_buddy() The comment says the bit should be 0, but the after code assert the bit to be 1. This makes people confused, so fix it. Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
b051d8dc |
|
26-Oct-2011 |
Robin Dong <sanbai@taobao.com> |
ext4: remove unused variable in mb_find_extent() The variable 'ord' in function mb_find_extent() is redundant, so remove it. Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
66a83cde |
|
26-Oct-2011 |
Robin Dong <sanbai@taobao.com> |
ext4: remove unused variable in ext4_mb_generate_from_pa() The variable 'count' in function ext4_mb_generate_from_pa() looks useless, so remove it. Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
ebbe0277 |
|
26-Oct-2011 |
Robin Dong <sanbai@taobao.com> |
ext4: use stream-alloc when mb_group_prealloc set to zero The kernel will crash on ext4_mb_mark_diskspace_used: BUG_ON(ac->ac_b_ex.fe_len <= 0); after we set /sys/fs/ext4/sda/mb_group_prealloc to zero and create new files in an ext4 filesystem. The reason is: ac_b_ex.fe_len also set to zero(mb_group_prealloc) in ext4_mb_normalize_group_request because the ac_flags contains EXT4_MB_HINT_GROUP_ALLOC. I think when someone set mb_group_prealloc to zero, it means DO NOT USE GROUP PREALLOCATION, so we should set alloc-strategy to STREAM in this case. Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
45dc63e7 |
|
20-Oct-2011 |
Dmitry Monakhov <dmonakhov@openvz.org> |
ext4: Allow quota file use root reservation Quota file is fs's metadata, so it is reasonable to permit use root resevation if necessary. This patch fix 265'th xfstest failure Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
7aa0baea |
|
06-Oct-2011 |
Tao Ma <boyu.mt@taobao.com> |
ext4: Free resources in ext4_mb_init()'s error paths In commit 79a77c5ac, we move ext4_mb_init_backend after the allocation of s_locality_group to avoid memory leak in error path, but there are still some other error paths in ext4_mb_init that need to do the same work. So this patch adds all the error patch for ext4_mb_init. And all the pointers are reset to NULL in case the caller may double free them. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
e7d5f315 |
|
09-Sep-2011 |
Theodore Ts'o <tytso@mit.edu> |
ext4: rename ext4_claim_free_blocks() to ext4_claim_free_clusters() This function really claims a number of free clusters, not blocks, so rename it so it's clearer what's going on. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
cff1dfd7 |
|
09-Sep-2011 |
Theodore Ts'o <tytso@mit.edu> |
ext4: rename ext4_free_blocks_after_init() to ext4_free_clusters_after_init() This function really returns the number of clusters after initializing an uninitalized block bitmap has been initialized. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
021b65bb |
|
09-Sep-2011 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Rename ext4_free_blks_{count,set}() to refer to clusters The field bg_free_blocks_count_{lo,high} in the block group descriptor has been repurposed to hold the number of free clusters for bigalloc functions. So rename the functions so it makes it easier to read and audit the block allocation and block freeing code. Note: at this point in bigalloc development we doesn't support online resize, so this also makes it really obvious all of the places we need to fix up to add support for online resize. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
7b415bf6 |
|
09-Sep-2011 |
Aditya Kali <adityakali@google.com> |
ext4: Fix bigalloc quota accounting and i_blocks value With bigalloc changes, the i_blocks value was not correctly set (it was still set to number of blocks being used, but in case of bigalloc, we want i_blocks to represent the number of clusters being used). Since the quota subsystem sets the i_blocks value, this patch fixes the quota accounting and makes sure that the i_blocks value is set correctly. Signed-off-by: Aditya Kali <adityakali@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
27baebb8 |
|
09-Sep-2011 |
Theodore Ts'o <tytso@mit.edu> |
ext4: tune mballoc's default group prealloc size for bigalloc file systems The default group preallocation size had been previously set to 512 blocks/clusters, regardless of the block/cluster size. This is probably too big for large cluster sizes. So adjust the default so that it is 2 megabytes or 32 clusters, whichever is larger. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
24aaa8ef |
|
09-Sep-2011 |
Theodore Ts'o <tytso@mit.edu> |
ext4: convert the free_blocks field in s_flex_groups to be free_clusters Convert the free_blocks to be free_clusters to make the final revised bigalloc changes easier to read/understand. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
57042651 |
|
09-Sep-2011 |
Theodore Ts'o <tytso@mit.edu> |
ext4: convert s_{dirty,free}blocks_counter to s_{dirty,free}clusters_counter Convert the percpu counters s_dirtyblocks_counter and s_freeblocks_counter in struct ext4_super_info to be s_dirtyclusters_counter and s_freeclusters_counter. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
84130193 |
|
09-Sep-2011 |
Theodore Ts'o <tytso@mit.edu> |
ext4: teach ext4_free_blocks() about bigalloc and clusters The ext4_free_blocks() function now has two new flags that indicate whether a partial cluster at the beginning or the end of the block extents should be freed or not. That will be up the caller (i.e., truncate), who can figure out whether partial clusters at the beginning or the end of a block range can be freed. We also have to update the ext4_mb_free_metadata() and release_blocks_on_commit() machinery to be cluster-based, since it is used by ext4_free_blocks(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
53accfa9 |
|
09-Sep-2011 |
Theodore Ts'o <tytso@mit.edu> |
ext4: teach mballoc preallocation code about bigalloc clusters In most of mballoc.c, we do everything in units of clusters, since the block allocation bitmaps and buddy bitmaps are all denominated in clusters. The one place where we do deal with absolute block numbers is in the code that handles the preallocation regions, since in the case of inode-based preallocation regions, the start of the preallocation region can't be relative to the beginning of the group. So this adds a bit of complexity, where pa_pstart and pa_lstart are block numbers, while pa_free, pa_len, and fe_len are denominated in units of clusters. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
7137d7a4 |
|
09-Sep-2011 |
Theodore Ts'o <tytso@mit.edu> |
ext4: convert instances of EXT4_BLOCKS_PER_GROUP to EXT4_CLUSTERS_PER_GROUP Change the places in fs/ext4/mballoc.c where EXT4_BLOCKS_PER_GROUP are used to indicate the number of bits in a block bitmap (which is really a cluster allocation bitmap in bigalloc file systems). There are still some places in the ext4 codebase where usage of EXT4_BLOCKS_PER_GROUP needs to be audited/fixed, in code paths that aren't used given the initial restricted assumptions for bigalloc. These will need to be fixed before we can relax those restrictions. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
79a77c5a |
|
01-Aug-2011 |
Yu Jian <yujian@whamcloud.com> |
ext4: prevent memory leaks from ext4_mb_init_backend() on error path In ext4_mb_init(), if the s_locality_group allocation fails it will currently cause the allocations made in ext4_mb_init_backend() to be leaked. Moving the ext4_mb_init_backend() allocation after the s_locality_group allocation avoids that problem. Signed-off-by: Yu Jian <yujian@whamcloud.com> Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
48e6061b |
|
01-Aug-2011 |
Yu Jian <yujian@whamcloud.com> |
ext4: use EXT4_BAD_INO for buddy cache to avoid colliding with valid inode # Signed-off-by: Yu Jian <yujian@whamcloud.com> Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
9d8b9ec4 |
|
01-Aug-2011 |
Theodore Ts'o <tytso@mit.edu> |
ext4: use ext4_msg() instead of printk in mballoc Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
f18a5f21 |
|
01-Aug-2011 |
Theodore Ts'o <tytso@mit.edu> |
ext4: use ext4_kvzalloc()/ext4_kvmalloc() for s_group_desc and s_group_info Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
c3e94d1d |
|
26-Jul-2011 |
Yongqiang Yang <xiaoqiangnk@gmail.com> |
ext4: let setup_new_group_blocks() set multiple bits at a time Rename mb_set_bits() to ext4_set_bits() and make it a global function so that setup_new_group_blocks() can use it. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
4740b830 |
|
26-Jul-2011 |
Yongqiang Yang <xiaoqiangnk@gmail.com> |
ext4: let ext4_group_add_blocks() handle 0 blocks quickly If ext4_group_add_blocks() is called with 0 block, make it return 0 without doing any extra work. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
cc7365df |
|
26-Jul-2011 |
Yongqiang Yang <xiaoqiangnk@gmail.com> |
ext4: let ext4_group_add_blocks() return an error code This patch lets ext4_group_add_blocks() return an error code if it fails, so that upper functions can handle error correctly. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
0529155e |
|
26-Jul-2011 |
Yongqiang Yang <xiaoqiangnk@gmail.com> |
ext4: rename ext4_add_groupblocks() to ext4_group_add_blocks() Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
ced156e4 |
|
23-Jul-2011 |
Tao Ma <boyu.mt@taobao.com> |
ext4: don't increment s_mb_buddies_generated in ext4_mb_release In ext4_mb_release, we use s_mb_buddies_generated++. Although the output is OK, but I don't think we need this extra ++. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
529da704 |
|
23-Jul-2011 |
Tao Ma <boyu.mt@taobao.com> |
ext4: remove unnecessary ext4_get_group_info in ext4_mb_load_buddy ext4_mb_load_buddy() calls ext4_get_group_info() for setting both "grp" and "e4b->bd_info", but it could do "e4b->bd_info = grp". Reported-by: Andreas Dilger <adilger@whamcloud.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
d7a1fee1 |
|
17-Jul-2011 |
Dan Ehrenberg <dehrenberg@google.com> |
ext4: make the preallocation size be a multiple of stripe size Previously, if a stripe width was provided, then it would be used as the preallocation granularity, with no santiy checking and no way to override this. Now, mb_prealloc_size defaults to the smallest multiple of stripe size that is greater than or equal to the old default mb_prealloc_size, and this can be overridden with the sysfs interface. Signed-off-by: Dan Ehrenberg <dehrenberg@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
caaf7a29 |
|
11-Jul-2011 |
Tao Ma <boyu.mt@taobao.com> |
ext4: Fix a double free of sbi->s_group_info in ext4_mb_init_backend If we meet with an error in ext4_mb_add_groupinfo, we kfree sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)], but fail to reset it to NULL. So the caller ext4_mb_init_backend will try to kfree it again and causes a double free. So fix it by resetting it to NULL. Some typo in comments of mballoc.c are also changed. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
823ba01f |
|
11-Jul-2011 |
Tao Ma <boyu.mt@taobao.com> |
ext4: fix a race which could leak memory in ext4_groupinfo_create_slab() In ext4_groupinfo_create_slab, we create ext4_groupinfo_caches within ext4_grpinfo_slab_create_mutex, but set it outside the lock, and there does exist some case that we may create it twice and causes a memory leak. So set it before we call mutex_unlock. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
22612283 |
|
10-Jul-2011 |
Tao Ma <boyu.mt@taobao.com> |
ext4: Change the wrong param comment for ext4_trim_all_free at ext4_trim_all_free() comment, there is no longer an @e4b parameter, instead it is @group. Reported-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
3d56b8d2 |
|
10-Jul-2011 |
Tao Ma <boyu.mt@taobao.com> |
ext4: Speed up FITRIM by recording flags in ext4_group_info In ext4, when FITRIM is called every time, we iterate all the groups and do trim one by one. It is a bit time wasting if the group has been trimmed and there is no change since the last trim. So this patch adds a new flag in ext4_group_info->bb_state to indicate that the group has been trimmed, and it will be cleared if some blocks is freed(in release_blocks_on_commit). Another trim_minlen is added in ext4_sb_info to record the last minlen we use to trim the volume, so that if the caller provide a small one, we will go on the trim regardless of the bb_state. A simple test with my intel x25m ssd: df -h shows: /dev/sdb1 40G 21G 17G 56% /mnt/ext4 Block size: 4096 run the FITRIM with the following parameter: range.start = 0; range.len = UINT64_MAX; range.minlen = 1048576; without the patch: [root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a real 0m5.505s user 0m0.000s sys 0m1.224s [root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a real 0m5.359s user 0m0.000s sys 0m1.178s [root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a real 0m5.228s user 0m0.000s sys 0m1.151s with the patch: [root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a real 0m5.625s user 0m0.000s sys 0m1.269s [root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a real 0m0.002s user 0m0.000s sys 0m0.001s [root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a real 0m0.002s user 0m0.000s sys 0m0.001s A big improvement for the 2nd and 3rd run. Even after I delete some big image files, it is still much faster than iterating the whole disk. [root@boyu-tm test]# time ./ftrim /mnt/ext4/a real 0m1.217s user 0m0.000s sys 0m0.196s Cc: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
b3d4c2b1 |
|
10-Jul-2011 |
Tao Ma <boyu.mt@taobao.com> |
ext4: Add new ext4 trim tracepoints Add ext4_trim_extent and ext4_trim_all_free. Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
169ddc3e |
|
10-Jul-2011 |
Tao Ma <boyu.mt@taobao.com> |
ext4: speed up group trim with the right free block count When we trim some free blocks in a group of ext4, we need to calculate the free blocks properly and check whether there are enough freed blocks left for us to trim. Current solution will only calculate free spaces if they are large for a trim which isn't appropriate. Let us see a small example: a group has 1.5M free which are 300k, 300k, 300k, 300k, 300k. And minblocks is 1M. With current solution, we have to iterate the whole group since these 300k will never be subtracted from 1.5M. But actually we should exit after we find the first 2 free spaces since the left 3 chunks only sum up to 900K if we subtract the first 600K although they can't be trimed. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
22f10457 |
|
10-Jul-2011 |
Tao Ma <boyu.mt@taobao.com> |
ext4: fix trim length underflow with small trim length In 0f0a25b, we adjust 'len' with s_first_data_block - start, but it could underflow in case blocksize=1K, fstrim_range.len=512 and fstrim_range.start = 0. In this case, when we run the code: len -= first_data_blk - start; len will be underflow to -1ULL. In the end, although we are safe that last_group check later will limit the trim to the whole volume, but that isn't what the user really want. So this patch fix it. It also adds the check for 'start' like ext3 so that we can break immediately if the start is invalid. Cc: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
7132de74 |
|
10-Jul-2011 |
Maxim Patlasov <maxim.patlasov@gmail.com> |
ext4: fix i_blocks/quota accounting when extent insertion fails The current implementation of ext4_free_blocks() always calls dquot_free_block This looks quite sensible in the most cases: blocks to be freed are associated with inode and were accounted in quota and i_blocks some time ago. However, there is a case when blocks to free were not accounted by the time calling ext4_free_blocks() yet: 1. delalloc is on, write_begin pre-allocated some space in quota 2. write-back happens, ext4 allocates some blocks in ext4_ext_map_blocks() 3. then ext4_ext_map_blocks() gets an error (e.g. ENOSPC) from ext4_ext_insert_extent() and calls ext4_free_blocks(). In this scenario, ext4_free_blocks() calls dquot_free_block() who, in turn, decrements i_blocks for blocks which were not accounted yet (due to delalloc) After clean umount, e2fsck reports something like: > Inode 21, i_blocks is 5080, should be 5128. Fix<y>? because i_blocks was erroneously decremented as explained above. The patch fixes the problem by passing the new flag EXT4_FREE_BLOCKS_NO_QUOT_UPDATE to ext4_free_blocks(), to request that the dquot_free_block() call be skipped. Signed-off-by: Maxim Patlasov <maxim.patlasov@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
|
#
9331b626 |
|
28-Jun-2011 |
Yongqiang Yang <xiaoqiangnk@gmail.com> |
ext4: quiet 'unused variables' compile warnings Unused variables was deleted. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
a9c667f8 |
|
06-Jun-2011 |
Lukas Czerner <lczerner@redhat.com> |
ext4: fixed tracepoints cleanup While creating fixed tracepoints for ext3, basically by porting them from ext4, I found a lot of useless retyping, wrong type usage, useless variable passing and other inconsistencies in the ext4 fixed tracepoint code. This patch cleans the fixed tracepoint code for ext4 and also simplify some of them. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
5def1360 |
|
05-Jun-2011 |
Yongqiang Yang <xiaoqiangnk@gmail.com> |
ext4: correct comments for ext4_free_blocks() metadata is not parameter of ext4_free_blocks() any more. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
55f020db |
|
25-May-2011 |
Allison Henderson <achender@linux.vnet.ibm.com> |
ext4: add flag to ext4_has_free_blocks This patch adds an allocation request flag to the ext4_has_free_blocks function which enables the use of reserved blocks. This will allow a punch hole to proceed even if the disk is full. Punching a hole may require additional blocks to first split the extents. Because ext4_has_free_blocks is a low level function, the flag needs to be passed down through several functions listed below: ext4_ext_insert_extent ext4_ext_create_new_leaf ext4_ext_grow_indepth ext4_ext_split ext4_ext_new_meta_block ext4_mb_new_blocks ext4_claim_free_blocks ext4_has_free_blocks [ext4 punch hole patch series 1/5 v7] Signed-off-by: Allison Henderson <achender@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Mingming Cao <cmm@us.ibm.com>
|
#
28739eea |
|
24-May-2011 |
Lukas Czerner <lczerner@redhat.com> |
ext4: protect bb_first_free in ext4_trim_all_free() with group lock We should protect reading bd_info->bb_first_free with the group lock because otherwise we might miss some free blocks. This is not a big deal at all, but the change to do right thing is really simple, so lets do that. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
78944086 |
|
24-May-2011 |
Lukas Czerner <lczerner@redhat.com> |
ext4: only load buddy bitmap in ext4_trim_fs() when it is needed Currently we are loading buddy ext4_mb_load_buddy() for every block group we are going through in ext4_trim_fs() in many cases just to find out that there is not enough space to be bothered with. As Amir Goldstein suggested we can use bb_free information directly from ext4_group_info. This commit removes ext4_mb_load_buddy() from ext4_trim_fs() and rather get the ext4_group_info via ext4_get_group_info() and use the bb_free information directly from that. This avoids unnecessary call to load buddy in the case the group does not have enough free space to trim. Loading buddy is now moved to ext4_trim_all_free(). Tested by me with xfstests 251. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
44183d42 |
|
09-May-2011 |
Amir Goldstein <amir73il@users.sf.net> |
ext4: remove alloc_semp After taking care of all group init races, all that remains is to remove alloc_semp from ext4_allocation_context and ext4_buddy structs. Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
9b8b7d35 |
|
09-May-2011 |
Amir Goldstein <amir73il@users.sf.net> |
ext4: teach ext4_mb_init_cache() to skip uptodate buddy caches After online resize which adds new groups, some of the groups in a buddy page may be initialized and uptodate, while other (new ones) may be uninitialized. The indication for init of new block groups is when ext4_mb_init_cache() is called with an uptodate buddy page. In this case, initialized groups on that buddy page must be skipped when initializing the buddy cache. Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
2de8807b |
|
09-May-2011 |
Amir Goldstein <amir73il@users.sf.net> |
ext4: synchronize ext4_mb_init_group() with buddy page lock The old routines ext4_mb_[get|put]_buddy_cache_lock(), which used to take grp->alloc_sem for all groups on the buddy page have been replaced with the routines ext4_mb_[get|put]_buddy_page_lock(). The new routines take both buddy and bitmap page locks to protect against concurrent init of groups on the same buddy page. The GROUP_NEED_INIT flag is tested again under page lock to check if the group was initialized by another caller. Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
e73a347b |
|
09-May-2011 |
Amir Goldstein <amir73il@users.sf.net> |
ext4: implement ext4_add_groupblocks() by freeing blocks The old imlementation used to take grp->alloc_sem and set the GROUP_NEED_INIT flag, so that the buddy cache would be reloaded. The new implementation updates the buddy cache by freeing the added blocks and making them available for use, so there is no need to reload the buddy cache and there is no need to take grp->alloc_sem. Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
2cd05cc3 |
|
09-May-2011 |
Theodore Ts'o <tytso@mit.edu> |
ext4: remove unneeded ext4_journal_get_undo_access The block allocation code used to use jbd2_journal_get_undo_access as a way to make changes that wouldn't show up until the commit took place. The new multi-block allocation code has a its own way of preventing newly freed blocks from getting reused until the commit takes place (it avoids updating the buddy bitmaps until the commit is done), so we don't need to use jbd2_journal_get_undo_access(), which has extra overhead compared to jbd2_journal_get_write_access(). There was one last vestigal use of ext4_journal_get_undo_access() in ext4_add_groupblocks(); change it to use ext4_journal_get_write_access() and then remove the ext4_journal_get_undo_access() support. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
2846e820 |
|
09-May-2011 |
Amir Goldstein <amir73il@users.sf.net> |
ext4: move ext4_add_groupblocks() to mballoc.c In preparation for the next patch, the function ext4_add_groupblocks() is moved to mballoc.c, where it could use some static functions. Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
d9f34504 |
|
30-Apr-2011 |
Theodore Ts'o <tytso@mit.edu> |
ext4: ignore errors when issuing discards This is an effective revert of commit a30eec2a8: "ext4: stop issuing discards if not supported by device". The problem is that there are some devices that may return errors in response to a discard request some times but not others. (One example would be a hybrid dm device which concatenates an SSD and an HDD device). By this logic, I also removed the error checking from ext4's FITRIM code; so that an error from a discard will not stop the FITRIM from trying to trim the rest of the file system. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
26626f11 |
|
16-Apr-2011 |
Yang Ruirui <ruirui.r.yang@tieto.com> |
ext4: release page cache in ext4_mb_load_buddy error path Add missing page_cache_release in the error path of ext4_mb_load_buddy Signed-off-by: Yang Ruirui <ruirui.r.yang@tieto.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
|
#
25985edc |
|
30-Mar-2011 |
Lucas De Marchi <lucas.demarchi@profusion.mobi> |
Fix common misspellings Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
|
#
0ba08517 |
|
23-Mar-2011 |
Tao Ma <boyu.mt@taobao.com> |
ext4: fix a BUG in mb_mark_used during trim. In a bs=4096 volume, if we call FITRIM with the following parameter as fstrim_range(start = 102400, len = 134144000, minlen = 10240), we will trigger this BUG_ON: BUG_ON(start + len > (e4b->bd_sb->s_blocksize << 3)); Mar 4 00:55:52 boyu-tm kernel: ------------[ cut here ]------------ Mar 4 00:55:52 boyu-tm kernel: kernel BUG at fs/ext4/mballoc.c:1506! Mar 4 01:21:09 boyu-tm kernel: Code: d4 00 00 00 00 49 89 fe 8b 56 0c 44 8b 7e 04 89 55 c4 48 8b 4f 28 89 d6 44 01 fe 48 63 d6 48 8b 41 18 48 c1 e0 03 48 39 c2 76 04 <0f> 0b eb fe 48 8b 55 b0 8b 47 34 3b 42 08 74 04 0f 0b eb fe 48 Mar 4 01:21:09 boyu-tm kernel: RIP [<ffffffffa053eb42>] mb_mark_used+0x47/0x26c [ext4] Mar 4 01:21:09 boyu-tm kernel: RSP <ffff880121e45c38> Mar 4 01:21:09 boyu-tm kernel: ---[ end trace 9f461696f6a9dcf2 ]--- Fix this bug by doing the accounting correctly. Cc: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
4596fe07 |
|
21-Mar-2011 |
Eric Sandeen <sandeen@redhat.com> |
ext4: don't kfree uninitialized s_group_info members We can call kfree on uninitialized members of the s_group_info array on an the error path. We can avoid this by kzalloc'ing the array. This doesn't entirely solve the oops on mount if we fail down this path; failed_mount4: frees the sbi, for one, which gets referenced later in the failed mount paths - I haven't worked that out yet. https://bugzilla.kernel.org/show_bug.cgi?id=30872 Reported-by: Eugene A. Shatokhin <dame_eugene@mail.ru> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
4dd89fc6 |
|
27-Feb-2011 |
Theodore Ts'o <tytso@mit.edu> |
ext4: suppress verbose debugging information if malloc-debug is off If CONFIG_EXT4_DEBUG is enabled, then if a block allocation fails due to disk being full, a verbose debugging message is printed, even if the malloc-debug switch has not been enabled. Suppress the debugging message so that nothing is printed unless malloc-debug has been turned on. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
5a54b2f1 |
|
24-Feb-2011 |
Coly Li <colyli@gmail.com> |
ext4: mballoc: don't replace the current preallocation group unnecessarily In ext4_mb_check_group_pa(), the current preallocation space is replaced with a new preallocation space when the two have the same distance from the goal block. This doesn't actually gain us anything, so change things so that the function only switches to the new preallocation group if its distance from the goal block is strictly smaller than the current preallocaiton group's distance from the goal block. Signed-off-by: Coly Li <bosong.ly@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
7c786059 |
|
24-Feb-2011 |
Coly Li <i@coly.li> |
mballoc: add comments to ext4_mb_mark_free_simple() This patch adds comments to ext4_mb_mark_free_simple to make it more understandable. Signed-off-by: Coly Li <bosong.ly@taobao.com> Cc: Alex Tomas <alex@clusterfs.com> Cc: Theodore Tso <tytso@google.com>
|
#
235772da |
|
24-Feb-2011 |
Coly Li <i@coly.li> |
ext4: remove unncessary call mb_find_buddy() in debugging code In __mb_check_buddy(), look at the code below: 591 fstart = -1; 592 buddy = mb_find_buddy(e4b, 0, &max); 593 for (i = 0; i < max; i++) { 594 if (!mb_test_bit(i, buddy)) { 595 MB_CHECK_ASSERT(i >= e4b->bd_info->bb_first_free); 596 if (fstart == -1) { 597 fragments++; 598 fstart = i; 599 } 600 continue; 601 } 602 fstart = -1; 603 /* check used bits only */ 604 for (j = 0; j < e4b->bd_blkbits + 1; j++) { 605 buddy2 = mb_find_buddy(e4b, j, &max2); 606 k = i >> j; 607 MB_CHECK_ASSERT(k < max2); 608 MB_CHECK_ASSERT(mb_test_bit(k, buddy2)); 609 } 610 } 611 MB_CHECK_ASSERT(!EXT4_MB_GRP_NEED_INIT(e4b->bd_info)); 612 MB_CHECK_ASSERT(e4b->bd_info->bb_fragments == fragments); 613 614 grp = ext4_get_group_info(sb, e4b->bd_group); 615 buddy = mb_find_buddy(e4b, 0, &max); On line 592, buddy is fetched by mb_find_buddy() with order 0, between line 593 to line 615, buddy is not changed, therefore there is no need to fetch buddy again from mb_find_buddy() with order 0 again. We can safely remove the second mb_find_buddy() on line 615. Signed-off-by: Coly Li <bosong.ly@taobao.com> Cc: Alex Tomas <alex@clusterfs.com> Cc: Theodore Tso <tytso@google.com>
|
#
84b775a3 |
|
23-Feb-2011 |
Coly Li <i@coly.li> |
ext4: code cleanup in mb_find_buddy() Current code calculate max no matter whether order is zero, it's unnecessary. This cleanup patch sets max to "1 << (e4b->bd_blkbits + 3)" only when order == 0. Signed-off-by: Coly Li <bosong.ly@taobao.com> Cc: Alex Tomas <alex@clusterfs.com> Cc: Theodore Tso <tytso@google.com>
|
#
0b75a840 |
|
22-Feb-2011 |
Lukas Czerner <lczerner@redhat.com> |
ext4: mark file-local functions and variables as static Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
2892c15d |
|
12-Feb-2011 |
Eric Sandeen <sandeen@redhat.com> |
ext4: make grpinfo slab cache names static In 2.6.37 I was running into oopses with repeated module loads & unloads. I tracked this down to: fb1813f4 ext4: use dedicated slab caches for group_info structures (this was in addition to the features advert unload problem) The kstrdup & subsequent kfree of the cache name was causing a double free. In slub, at least, if I read it right it allocates & frees the name itself, slab seems to do something different... so in slub I think we were leaking -our- cachep->name, and double freeing the one allocated by slub. After getting lost in slab/slub/slob a bit, I just looked at other sized-caches that get allocated. jbd2, biovec, sgpool all do it more or less the way jbd2 does. Below patch follows the jbd2 method of dynamically allocating a cache at mount time from a list of static names. (This might also possibly fix a race creating the caches with parallel mounts running). [Folded in a fix from Dan Carpenter which fixed an off-by-one error in the original patch] Cc: stable@kernel.org Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
0f0a25bf |
|
11-Jan-2011 |
Jan Kara <jack@suse.cz> |
ext4: fix trimming starting with block 0 with small blocksize When s_first_data_block is not zero (which happens e.g. when block size is 1KB) and trim ioctl is called to start trimming from block 0, the math in ext4_get_group_no_and_offset() overflows. The overall result is that ioctl returns EINVAL which is kind of unexpected and we probably don't want userspace tools to bother with internal details of filesystem structure. So just silently increase starting offset (and shorten length) when starting block is below s_first_data_block. CC: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
0a2179b1 |
|
11-Jan-2011 |
Theodore Ts'o <tytso@mit.edu> |
ext4: revert buggy trim overflow patch This reverts commit 4f531501e44: ext4: fix possible overflow in ext4_trim_fs() Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
a5196f8c |
|
09-Jan-2011 |
Theodore Ts'o <tytso@mit.edu> |
ext4: remove ext4_mb_return_to_preallocation() This function was never implemented, except for a BUG_ON which was tripping when ext4 is run without a journal. The problem is that although the comment asserts that "truncate (which is the only way to free block) discards all preallocations", ext4_free_blocks() is also called in various error recovery paths when blocks have been allocated, but for various reasons, we were not able to use those data blocks (for example, because we ran out of memory while trying to manipulate the extent tree, or some other similar situation). In addition to the fact that this function isn't implemented except for the incorrect BUG_ON, the single caller of this function, ext4_free_blocks(), doesn't use it all if the journal is enabled. So remove the (stub) function entirely for now. If we decide it's better to add it back, it's only going to be useful with a relatively large number of code changes anyway. Google-Bug-Id: 3236408 Cc: Jiaying Zhang <jiayingz@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
ca6e909f |
|
09-Jan-2011 |
Jan Kara <jack@suse.cz> |
ext4: fix trimming of a single group When ext4_trim_fs() is called to trim a part of a single group, the logic will wrongly set last block of the interval to 'len' instead of 'first_block + len'. Thus a shorter interval is possibly trimmed. Fix it. CC: Lukas Czerner <lczerner@redhat.com> Cc: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
f2321097 |
|
09-Jan-2011 |
Theodore Ts'o <tytso@mit.edu> |
ext4: replace i_delalloc_reserved_flag with EXT4_STATE_DELALLOC_RESERVED Remove the short element i_delalloc_reserved_flag from the ext4_inode_info structure and replace it a new bit in i_state_flags. Since we have an ext4_inode_info for every ext4 inode cached in the inode cache, any savings we can produce here is a very good thing from a memory utilization perspective. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
93259636 |
|
09-Jan-2011 |
Lukas Czerner <lczerner@redhat.com> |
ext4: remove warning message from ext4_issue_discard helper ext4_issue_discard is supposed to be helper for calling discard, however in case that underlying device does not support discard it prints out the warning message and clears the DISCARD t_mount_opt flag. Since it can be (and is) used by others, it should not do anything and let the caller to handle the error case. This commit removes warning message and flag setting from ext4_issue_discard and use it just in place where it is really needed (release_blocks_on_commit). FITRIM ioctl should not set any flags nor it should print out warning messages, so get rid of the warning as well. Signed-off-by: Lukas Czerner <lczerner@redhat.com>
|
#
4f531501 |
|
09-Jan-2011 |
Lukas Czerner <lczerner@redhat.com> |
ext4: fix possible overflow in ext4_trim_fs() When determining last group through ext4_get_group_no_and_offset() the result may be wrong in cases when range->start and range-len are too big, because it may overflow when summing up those two numbers. Fix that by checking range->len and limit its value to ext4_blocks_count(). This commit was tested by myself with expected result. Signed-off-by: Lukas Czerner <lczerner@redhat.com>
|
#
b72143ab |
|
20-Dec-2010 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Add error checking to kmem_cache_alloc() call in ext4_free_blocks() Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
fd8c37ec |
|
15-Dec-2010 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Simplify the usage of clear_opt() and set_opt() macros Change clear_opt() and set_opt() to take a superblock pointer instead of a pointer to EXT4_SB(sb)->s_mount_opt. This makes it easier for us to support a second mount option field. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
b56ff9d3 |
|
08-Nov-2010 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Don't call sb_issue_discard() in ext4_free_blocks() Commit 5c521830cf (ext4: Support discard requests when running in no-journal mode) attempts to add sb_issue_discard() for data blocks (in data=writeback mode) and in no-journal mode. Unfortunately, this no longer works, because in commit dd3932eddf (block: remove BLKDEV_IFL_WAIT), sb_issue_discard() only presents a synchronous interface, and there are times when we call ext4_free_blocks() when we are are holding a spinlock, or are otherwise in an atomic context. For now, I've removed the call to sb_issue_discard() to prevent a deadlock or (if spinlock debugging is enabled) failures like this: BUG: scheduling while atomic: rc.sysinit/1376/0x00000002 Pid: 1376, comm: rc.sysinit Not tainted 2.6.36-ARCH #1 Call Trace: [<ffffffff810397ce>] __schedule_bug+0x5e/0x70 [<ffffffff81403110>] schedule+0x950/0xa70 [<ffffffff81060bad>] ? insert_work+0x7d/0x90 [<ffffffff81060fbd>] ? queue_work_on+0x1d/0x30 [<ffffffff81061127>] ? queue_work+0x37/0x60 [<ffffffff8140377d>] schedule_timeout+0x21d/0x360 [<ffffffff812031c3>] ? generic_make_request+0x2c3/0x540 [<ffffffff81402680>] wait_for_common+0xc0/0x150 [<ffffffff81041490>] ? default_wake_function+0x0/0x10 [<ffffffff812034bc>] ? submit_bio+0x7c/0x100 [<ffffffff810680a0>] ? wake_bit_function+0x0/0x40 [<ffffffff814027b8>] wait_for_completion+0x18/0x20 [<ffffffff8120a969>] blkdev_issue_discard+0x1b9/0x210 [<ffffffff811ba03e>] ext4_free_blocks+0x68e/0xb60 [<ffffffff811b1650>] ? __ext4_handle_dirty_metadata+0x110/0x120 [<ffffffff811b098c>] ext4_ext_truncate+0x8cc/0xa70 [<ffffffff810d713e>] ? pagevec_lookup+0x1e/0x30 [<ffffffff81191618>] ext4_truncate+0x178/0x5d0 [<ffffffff810eacbb>] ? unmap_mapping_range+0xab/0x280 [<ffffffff810d8976>] vmtruncate+0x56/0x70 [<ffffffff811925cb>] ext4_setattr+0x14b/0x460 [<ffffffff811319e4>] notify_change+0x194/0x380 [<ffffffff81117f80>] do_truncate+0x60/0x90 [<ffffffff811e08fa>] ? security_inode_permission+0x1a/0x20 [<ffffffff811eaec1>] ? tomoyo_path_truncate+0x11/0x20 [<ffffffff81127539>] do_last+0x5d9/0x770 [<ffffffff811278bd>] do_filp_open+0x1ed/0x680 [<ffffffff8140644f>] ? page_fault+0x1f/0x30 [<ffffffff81132bfc>] ? alloc_fd+0xec/0x140 [<ffffffff81118db1>] do_sys_open+0x61/0x120 [<ffffffff81118e8b>] sys_open+0x1b/0x20 [<ffffffff81002e6b>] system_call_fastpath+0x16/0x1b https://bugzilla.kernel.org/show_bug.cgi?id=22302 Reported-by: Mathias Burén <mathias.buren@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: jiayingz@google.com
|
#
eee4adc7 |
|
27-Oct-2010 |
Eric Sandeen <sandeen@redhat.com> |
ext4: move ext4_mb_{get,put}_buddy_cache_lock and make them static These functions are only used within fs/ext4/mballoc.c, so move them so they are used after they are defined, and then make them be static. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
5dabfc78 |
|
27-Oct-2010 |
Theodore Ts'o <tytso@mit.edu> |
ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*() This is a cleanup to avoid namespace leaks out of fs/ext4 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
7360d173 |
|
27-Oct-2010 |
Lukas Czerner <lczerner@redhat.com> |
ext4: Add batched discard support for ext4 Walk through allocation groups and trim all free extents. It can be invoked through FITRIM ioctl on the file system. The main idea is to provide a way to trim the whole file system if needed, since some SSD's may suffer from performance loss after the whole device was filled (it does not mean that fs is full!). It search for free extents in allocation groups specified by Byte range start -> start+len. When the free extent is within this range, blocks are marked as used and then trimmed. Afterwards these blocks are marked as free in per-group bitmap. Since fstrim is a long operation it is good to have an ability to interrupt it by a signal. This was added by Dmitry Monakhov. Thanks Dimitry. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
77ca6cdf |
|
27-Oct-2010 |
Lukas Czerner <lczerner@redhat.com> |
ext4: Use return value from sb_issue_discard() Use return value from sb_issue_discard() as return value in ext4_issue_discard(). Since sb_issue_discard() may result in more serious errors than just -EOPNOTSUPP it is worth to inform user of this function about them to handle error cases properly. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
87783690 |
|
27-Oct-2010 |
Namhyung Kim <namhyung@gmail.com> |
ext4: Check return value of sb_getblk() and friends Fail block allocation if sb_getblk() returns NULL. In that case, sb_find_get_block() also likely to fail so that it should skip calling ext4_forget(). Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
16828088 |
|
27-Oct-2010 |
Theodore Ts'o <tytso@mit.edu> |
ext4: use KMEM_CACHE instead of kmem_cache_create Also remove the SLAB_RECLAIM_ACCOUNT flag from the system zone kmem cache. This slab tends to be fairly static, so it shouldn't be marked as likely to have free pages that can be reclaimed. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
3e1e5f50 |
|
27-Oct-2010 |
Eric Sandeen <sandeen@redhat.com> |
ext4: don't use ext4_allocation_contexts for tracing Many tracepoints were populating an ext4_allocation_context to pass in, but this requires a slab allocation even when tracepoints are off. In fact, 4 of 5 of these allocations were only for tracing. In addition, we were only using a small fraction of the 144 bytes of this structure for this purpose. We can do away with all these alloc/frees of the ac and simply pass in the bits we care about, instead. I tested this by turning on tracing and running through xfstests on x86_64. I did not actually do anything with the trace output, however. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
53fdcf99 |
|
27-Oct-2010 |
Lukas Czerner <lczerner@redhat.com> |
ext4: don't hold spinlock while calling ext4_issue_discard() We can't hold the block group spinlock because we ext4_issue_discard() calls wait and hence can get rescheduled. Google-Bug-Id: 3017678 Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
58298709 |
|
27-Oct-2010 |
Lukas Czerner <lczerner@redhat.com> |
ext4: check for negative error code from sb_issue_discard sb_issue_discard() is returning negative error code, so check for -EOPNOTSUPP. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
fb1813f4 |
|
27-Oct-2010 |
Curt Wohlgemuth <curtw@google.com> |
ext4: use dedicated slab caches for group_info structures ext4_group_info structures are currently allocated with kmalloc(). With a typical 4K block size, these are 136 bytes each -- meaning they'll each consume a 256-byte slab object. On a system with many ext4 large partitions, that's a lot of wasted kernel slab space. (E.g., a single 1TB partition will have about 8000 block groups, using about 2MB of slab, of which nearly 1MB is wasted.) This patch creates an array of slab pointers created as needed -- depending on the superblock block size -- and uses these slabs to allocate the group info objects. Google-Bug-Id: 2980809 Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
85fe4025 |
|
23-Oct-2010 |
Christoph Hellwig <hch@lst.de> |
fs: do not assign default i_ino in new_inode Instead of always assigning an increasing inode number in new_inode move the call to assign it into those callers that actually need it. For now callers that need it is estimated conservatively, that is the call is added to all filesystems that do not assign an i_ino by themselves. For a few more filesystems we can avoid assigning any inode number given that they aren't user visible, and for others it could be done lazily when an inode number is actually needed, but that's left for later patches. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
dd3932ed |
|
16-Sep-2010 |
Christoph Hellwig <hch@lst.de> |
block: remove BLKDEV_IFL_WAIT All the blkdev_issue_* helpers can only sanely be used for synchronous caller. To issue cache flushes or barriers asynchronously the caller needs to set up a bio by itself with a completion callback to move the asynchronous state machine ahead. So drop the BLKDEV_IFL_WAIT flag that is always specified when calling blkdev_issue_* and also remove the now unused flags argument to blkdev_issue_flush and blkdev_issue_zeroout. For blkdev_issue_discard we need to keep it for the secure discard flag, which gains a more descriptive name and loses the bitops vs flag confusion. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
#
61002f7db |
|
18-Aug-2010 |
Christoph Hellwig <hch@infradead.org> |
ext4: do not send discards as barriers ext4 already uses synchronous discards, no need to add I/O barriers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
#
2cf6d26a |
|
18-Aug-2010 |
Christoph Hellwig <hch@infradead.org> |
block: pass gfp_mask and flags to sb_issue_discard We'll need to get rid of the BLKDEV_IFL_BARRIER flag, and to facilitate that and to make the interface less confusing pass all flags explicitly. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
#
6c7a120a |
|
05-Aug-2010 |
Aditya Kali <adityakali@google.com> |
ext4: Adding error check after calling ext4_mb_regular_allocator() If the bitmap block on disk is bad, ext4_mb_load_buddy() returns an error. This error is returned to the caller, ext4_mb_regular_allocator() and then to ext4_mb_new_blocks(). But ext4_mb_new_blocks() did not check for the return value of ext4_mb_regular_allocator() and would repeatedly try to load the bitmap block. The fix simply catches the return value and exits out of the 'repeat' loop after cleanup. We also take the opportunity to clean up the error handling in ext4_mb_new_blocks(). Google-Bug-Id: 2853530 Signed-off-by: Aditya Kali <adityakali@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
73b2c716 |
|
30-Jul-2010 |
Uwe Kleine-König <u.kleine-koenig@pengutronix.de> |
fix comment typo "choosed" -> "chosen" Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
#
e3570639 |
|
27-Jul-2010 |
Eric Sandeen <sandeen@redhat.com> |
ext4: don't print scary messages for allocation failures post-abort I often get emails containing the "This should not happen!!" message, conveniently trimmed to remove things like: sd 0:0:0:0: [sda] Unhandled error code sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 03 13 c9 70 00 00 28 00 end_request: I/O error, dev sda, sector 51628400 Aborting journal on device dm-0-8. EXT4-fs error (device dm-0): ext4_journal_start_sb: Detected aborted journal EXT4-fs (dm-0): Remounting filesystem read-only I don't think there is any value to the verbosity if the reason is due to a filesystem abort; it just obfuscates the root cause. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
506bf2d8 |
|
27-Jul-2010 |
Eric Sandeen <sandeen@redhat.com> |
ext4: allocate stripe-multiple IOs on stripe boundaries For some reason, today mballoc only allocates IOs which are exactly stripe-sized on a stripe boundary. If you have a multiple (say, a 128k IO on a 64k stripe) you may end up unaligned. It seems to me that a simple change to align stripe-multiple IOs on stripe boundaries would be a very good idea, unless this breaks some other mballoc heuristic for some reason... Reported-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
5c521830 |
|
27-Jul-2010 |
Jiaying Zhang <jiayingz@google.com> |
ext4: Support discard requests when running in no-journal mode Issue discard request in ext4_free_blocks() when ext4 has no journal and is mounted with discard option. Signed-off-by: Jiaying Zhang <jiayingz@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
a271fe85 |
|
27-Jul-2010 |
Joe Perches <joe@perches.com> |
ext4: Remove unnecessary casts of private_data Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
e5880d76 |
|
27-Jul-2010 |
Theodore Ts'o <tytso@mit.edu> |
ext4: fix potential NULL dereference while tracing The allocation_context pointer can be NULL. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
e29136f8 |
|
28-Jun-2010 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Enhance ext4_grp_locked_error() to take block and function numbers Also use a macro definition so that __func__ and __LINE__ is implicit. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
5a0790c2 |
|
14-Jun-2010 |
Andi Kleen <andi@firstfloor.org> |
ext4: remove initialized but not read variables No real bugs found, just removed some dead code. Found by gcc 4.6's new warnings. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
a0375156 |
|
11-Jun-2010 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Clean up s_dirt handling We don't need to set s_dirt in most of the ext4 code when journaling is enabled. In ext3/4 some of the summary statistics for # of free inodes, blocks, and directories are calculated from the per-block group statistics when the file system is mounted or unmounted. As a result the superblock doesn't have to be updated, either via the journal or by setting s_dirt. There are a few exceptions, most notably when resizing the file system, where the superblock needs to be modified --- and in that case it should be done as a journalled operation if possible, and s_dirt set only in no-journal mode. This patch will optimize out some unneeded disk writes when using ext4 with a journal. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
60e6679e |
|
17-May-2010 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Drop whitespace at end of lines This patch was generated using: #!/usr/bin/perl -i while (<>) { s/[ ]+$//; print; } Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
f307333e |
|
17-May-2010 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Add new tracepoints to track mballoc's buddy bitmap loads Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
12e9b892 |
|
16-May-2010 |
Dmitry Monakhov <dmonakhov@openvz.org> |
ext4: Use bitops to read/modify i_flags in struct ext4_inode_info At several places we modify EXT4_I(inode)->i_flags without holding i_mutex (ext4_do_update_inode, ...). These modifications are racy and we can lose updates to i_flags. So convert handling of i_flags to use bitops which are atomic. https://bugzilla.kernel.org/show_bug.cgi?id=15792 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
291dae47 |
|
16-May-2010 |
Curt Wohlgemuth <curtw@google.com> |
ext4: Fix for ext4_mb_collect_stats() Fix ext4_mb_collect_stats() to use the correct test for s_bal_success; it should be testing "best-extent.fe_len >= orig-extent.fe_len" , not "orig-extent.fe_len >= goal-extent.fe_len" . Signed-off-by: Curt Wohlgemuth <curtw@google.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
8a57d9d6 |
|
16-May-2010 |
Curt Wohlgemuth <curtw@google.com> |
ext4: check for a good block group before loading buddy pages This adds a new field in ext4_group_info to cache the largest available block range in a block group; and don't load the buddy pages until *after* we've done a sanity check on the block group. With large allocation requests (e.g., fallocate(), 8MiB) and relatively full partitions, it's easy to have no block groups with a block extent large enough to satisfy the input request length. This currently causes the loop during cr == 0 in ext4_mb_regular_allocator() to load the buddy bitmap pages for EVERY block group. That can be a lot of pages. The patch below allows us to call ext4_mb_good_group() BEFORE we load the buddy pages (although we have check again after we lock the block group). Addresses-Google-Bug: #2578108 Addresses-Google-Bug: #2704453 Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
a30eec2a |
|
16-May-2010 |
Eric Sandeen <sandeen@redhat.com> |
ext4: stop issuing discards if not supported by device Turn off issuance of discard requests if the device does not support it - similar to the action we take for barriers. This will save a little computation time if a non-discardable device is mounted with -o discard, and also makes it obvious that it's not doing what was asked at mount time ... Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
e39e07fd |
|
13-May-2010 |
Jing Zhang <zj.barak@gmail.com> |
ext4: rename ext4_mb_release_desc() to ext4_mb_unload_buddy() This function cleans up after ext4_mb_load_buddy(), so the renaming makes the code clearer. Signed-off-by: Jing Zhang <zj.barak@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
62e823a2 |
|
12-May-2010 |
Jing Zhang <zj.barak@gmail.com> |
ext4: Remove unnecessary call to ext4_get_group_desc() in mballoc Signed-off-by: Jing Zhang <zj.barak@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
b90f6870 |
|
20-Apr-2010 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Issue the discard operation *before* releasing the blocks to be reused Otherwise, we can end up having data corruption because the blocks could get reused and then discarded! https://bugzilla.kernel.org/show_bug.cgi?id=15579 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
5a0e3ad6 |
|
24-Mar-2010 |
Tejun Heo <tj@kernel.org> |
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
|
#
64e290ec |
|
04-Mar-2010 |
Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> |
ext4: fix up rb_root initializations to use RB_ROOT ext4 uses rb_node = NULL; to zero rb_root at few places. Using RB_ROOT as the initializer is more portable in case the underlying implementation of rbtrees changes in the future. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Eric Paris <eparis@redhat.com>
|
#
5dd4056d |
|
03-Mar-2010 |
Christoph Hellwig <hch@infradead.org> |
dquot: cleanup space allocation / freeing routines Get rid of the alloc_space, free_space, reserve_space, claim_space and release_rsv dquot operations - they are always called from the filesystem and if a filesystem really needs their own (which none currently does) it can just call into it's own routine directly. Move shared logic into the common __dquot_alloc_space, dquot_claim_space_nodirty and __dquot_free_space low-level methods, and rationalize the wrappers around it to move as much as possible code into the common block for CONFIG_QUOTA vs not. Also rename all these helpers to be named dquot_* instead of vfs_dq_*. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
|
#
bda00de7 |
|
03-Mar-2010 |
Akinobu Mita <akinobu.mita@gmail.com> |
ext4: cleanup to use ext4_grp_offs_to_block() More cleanup to convert open-coded calculations of the first block number of a free extent to use ext4_grp_offs_to_block() instead. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger@sun.com>
|
#
5661bd68 |
|
03-Mar-2010 |
Akinobu Mita <akinobu.mita@gmail.com> |
ext4: cleanup to use ext4_group_first_block_no() This is a cleanup and simplification patch which takes some open-coded calculations to calculate the first block number of a group and converts them to use the (already defined) ext4_group_first_block_no() function. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger@sun.com>
|
#
cc483f10 |
|
01-Mar-2010 |
Tao Ma <tao.ma@oracle.com> |
ext4: Fix fencepost error in chosing choosing group vs file preallocation. The ext4 multiblock allocator decides whether to use group or file preallocation based on the file size. When the file size reaches s_mb_stream_request (default is 16 blocks), it changes to use a file-specific preallocation. This is cool, but it has a tiny problem. See a simple script: mkfs.ext4 -b 1024 /dev/sda8 1000000 mount -t ext4 -o nodelalloc /dev/sda8 /mnt/ext4 for((i=0;i<5;i++)) do cat /mnt/4096>>/mnt/ext4/a #4096 is a file with 4096 characters. cat /mnt/4096>>/mnt/ext4/b done debuge4fs -R 'stat a' /dev/sda8|grep BLOCKS -A 1 And you get BLOCKS: (0-14):8705-8719, (15):2356, (16-19):8465-8468 So there are 3 extents, a bit strange for the lonely 15th logical block. As we write to the 16 blocks, we choose file preallocation in ext4_mb_group_or_file, but in ext4_mb_normalize_request, we meet with the 16*1024 range, so no preallocation will be carried. file b then reserves the space after '2356', so when when write 16, we start from another part. This patch just change the check in ext4_mb_group_or_file, so that for the lonely 15 we will still use group preallocation. After the patch, we will get: debuge4fs -R 'stat a' /dev/sda8|grep BLOCKS -A 1 BLOCKS: (0-15):8705-8720, (16-19):8465-8468 Looks more sane. Thanks. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
12062ddd |
|
15-Feb-2010 |
Eric Sandeen <sandeen@redhat.com> |
ext4: move __func__ into a macro for ext4_warning, ext4_error Just a pet peeve of mine; we had a mishash of calls with either __func__ or "function_name" and the latter tends to get out of sync. I think it's easier to just hide the __func__ in a macro, and it'll be consistent from then on. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
1537a363 |
|
29-Jan-2010 |
Daniel Mack <daniel@caiaq.de> |
tree-wide: fix 'lenght' typo in comments and code Some misspelled occurences of 'octet' and some comments were also fixed as I was on it. Signed-off-by: Daniel Mack <daniel@caiaq.de> Cc: Jiri Kosina <trivial@kernel.org> Cc: Joe Perches <joe@perches.com> Cc: Junio C Hamano <gitster@pobox.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
#
d21cd8f1 |
|
09-Dec-2009 |
Dmitry Monakhov <dmonakhov@openvz.org> |
ext4: Fix potential quota deadlock We have to delay vfs_dq_claim_space() until allocation context destruction. Currently we have following call-trace: ext4_mb_new_blocks() /* task is already holding ac->alloc_semp */ ->ext4_mb_mark_diskspace_used ->vfs_dq_claim_space() /* acquire dqptr_sem here. Possible deadlock */ ->ext4_mb_release_context() /* drop ac->alloc_semp here */ Let's move quota claiming to ext4_da_update_reserve_space() ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.32-rc7 #18 ------------------------------------------------------- write-truncate-/3465 is trying to acquire lock: (&s->s_dquot.dqptr_sem){++++..}, at: [<c025e73b>] dquot_claim_space+0x3b/0x1b0 but task is already holding lock: (&meta_group_info[i]->alloc_sem){++++..}, at: [<c02ce962>] ext4_mb_load_buddy+0xb2/0x370 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&meta_group_info[i]->alloc_sem){++++..}: [<c017d04b>] __lock_acquire+0xd7b/0x1260 [<c017d5ea>] lock_acquire+0xba/0xd0 [<c0527191>] down_read+0x51/0x90 [<c02ce962>] ext4_mb_load_buddy+0xb2/0x370 [<c02d0c1c>] ext4_mb_free_blocks+0x46c/0x870 [<c029c9d3>] ext4_free_blocks+0x73/0x130 [<c02c8cfc>] ext4_ext_truncate+0x76c/0x8d0 [<c02a8087>] ext4_truncate+0x187/0x5e0 [<c01e0f7b>] vmtruncate+0x6b/0x70 [<c022ec02>] inode_setattr+0x62/0x190 [<c02a2d7a>] ext4_setattr+0x25a/0x370 [<c022ee81>] notify_change+0x151/0x340 [<c021349d>] do_truncate+0x6d/0xa0 [<c0221034>] may_open+0x1d4/0x200 [<c022412b>] do_filp_open+0x1eb/0x910 [<c021244d>] do_sys_open+0x6d/0x140 [<c021258e>] sys_open+0x2e/0x40 [<c0103100>] sysenter_do_call+0x12/0x32 -> #2 (&ei->i_data_sem){++++..}: [<c017d04b>] __lock_acquire+0xd7b/0x1260 [<c017d5ea>] lock_acquire+0xba/0xd0 [<c0527191>] down_read+0x51/0x90 [<c02a5787>] ext4_get_blocks+0x47/0x450 [<c02a74c1>] ext4_getblk+0x61/0x1d0 [<c02a7a7f>] ext4_bread+0x1f/0xa0 [<c02bcddc>] ext4_quota_write+0x12c/0x310 [<c0262d23>] qtree_write_dquot+0x93/0x120 [<c0261708>] v2_write_dquot+0x28/0x30 [<c025d3fb>] dquot_commit+0xab/0xf0 [<c02be977>] ext4_write_dquot+0x77/0x90 [<c02be9bf>] ext4_mark_dquot_dirty+0x2f/0x50 [<c025e321>] dquot_alloc_inode+0x101/0x180 [<c029fec2>] ext4_new_inode+0x602/0xf00 [<c02ad789>] ext4_create+0x89/0x150 [<c0221ff2>] vfs_create+0xa2/0xc0 [<c02246e7>] do_filp_open+0x7a7/0x910 [<c021244d>] do_sys_open+0x6d/0x140 [<c021258e>] sys_open+0x2e/0x40 [<c0103100>] sysenter_do_call+0x12/0x32 -> #1 (&sb->s_type->i_mutex_key#7/4){+.+...}: [<c017d04b>] __lock_acquire+0xd7b/0x1260 [<c017d5ea>] lock_acquire+0xba/0xd0 [<c0526505>] mutex_lock_nested+0x65/0x2d0 [<c0260c9d>] vfs_load_quota_inode+0x4bd/0x5a0 [<c02610af>] vfs_quota_on_path+0x5f/0x70 [<c02bc812>] ext4_quota_on+0x112/0x190 [<c026345a>] sys_quotactl+0x44a/0x8a0 [<c0103100>] sysenter_do_call+0x12/0x32 -> #0 (&s->s_dquot.dqptr_sem){++++..}: [<c017d361>] __lock_acquire+0x1091/0x1260 [<c017d5ea>] lock_acquire+0xba/0xd0 [<c0527191>] down_read+0x51/0x90 [<c025e73b>] dquot_claim_space+0x3b/0x1b0 [<c02cb95f>] ext4_mb_mark_diskspace_used+0x36f/0x380 [<c02d210a>] ext4_mb_new_blocks+0x34a/0x530 [<c02c83fb>] ext4_ext_get_blocks+0x122b/0x13c0 [<c02a5966>] ext4_get_blocks+0x226/0x450 [<c02a5ff3>] mpage_da_map_blocks+0xc3/0xaa0 [<c02a6ed6>] ext4_da_writepages+0x506/0x790 [<c01de272>] do_writepages+0x22/0x50 [<c01d766d>] __filemap_fdatawrite_range+0x6d/0x80 [<c01d7b9b>] filemap_flush+0x2b/0x30 [<c02a40ac>] ext4_alloc_da_blocks+0x5c/0x60 [<c029e595>] ext4_release_file+0x75/0xb0 [<c0216b59>] __fput+0xf9/0x210 [<c0216c97>] fput+0x27/0x30 [<c02122dc>] filp_close+0x4c/0x80 [<c014510e>] put_files_struct+0x6e/0xd0 [<c01451b7>] exit_files+0x47/0x60 [<c0146a24>] do_exit+0x144/0x710 [<c0147028>] do_group_exit+0x38/0xa0 [<c0159abc>] get_signal_to_deliver+0x2ac/0x410 [<c0102849>] do_notify_resume+0xb9/0x890 [<c01032d2>] work_notifysig+0x13/0x21 other info that might help us debug this: 3 locks held by write-truncate-/3465: #0: (jbd2_handle){+.+...}, at: [<c02e1f8f>] start_this_handle+0x38f/0x5c0 #1: (&ei->i_data_sem){++++..}, at: [<c02a57f6>] ext4_get_blocks+0xb6/0x450 #2: (&meta_group_info[i]->alloc_sem){++++..}, at: [<c02ce962>] ext4_mb_load_buddy+0xb2/0x370 stack backtrace: Pid: 3465, comm: write-truncate- Not tainted 2.6.32-rc7 #18 Call Trace: [<c0524cb3>] ? printk+0x1d/0x22 [<c017ac9a>] print_circular_bug+0xca/0xd0 [<c017d361>] __lock_acquire+0x1091/0x1260 [<c016bca2>] ? sched_clock_local+0xd2/0x170 [<c0178fd0>] ? trace_hardirqs_off_caller+0x20/0xd0 [<c017d5ea>] lock_acquire+0xba/0xd0 [<c025e73b>] ? dquot_claim_space+0x3b/0x1b0 [<c0527191>] down_read+0x51/0x90 [<c025e73b>] ? dquot_claim_space+0x3b/0x1b0 [<c025e73b>] dquot_claim_space+0x3b/0x1b0 [<c02cb95f>] ext4_mb_mark_diskspace_used+0x36f/0x380 [<c02d210a>] ext4_mb_new_blocks+0x34a/0x530 [<c02c601d>] ? ext4_ext_find_extent+0x25d/0x280 [<c02c83fb>] ext4_ext_get_blocks+0x122b/0x13c0 [<c016bca2>] ? sched_clock_local+0xd2/0x170 [<c016be60>] ? sched_clock_cpu+0x120/0x160 [<c016beef>] ? cpu_clock+0x4f/0x60 [<c0178fd0>] ? trace_hardirqs_off_caller+0x20/0xd0 [<c052712c>] ? down_write+0x8c/0xa0 [<c02a5966>] ext4_get_blocks+0x226/0x450 [<c016be60>] ? sched_clock_cpu+0x120/0x160 [<c016beef>] ? cpu_clock+0x4f/0x60 [<c017908b>] ? trace_hardirqs_off+0xb/0x10 [<c02a5ff3>] mpage_da_map_blocks+0xc3/0xaa0 [<c01d69cc>] ? find_get_pages_tag+0x16c/0x180 [<c01d6860>] ? find_get_pages_tag+0x0/0x180 [<c02a73bd>] ? __mpage_da_writepage+0x16d/0x1a0 [<c01dfc4e>] ? pagevec_lookup_tag+0x2e/0x40 [<c01ddf1b>] ? write_cache_pages+0xdb/0x3d0 [<c02a7250>] ? __mpage_da_writepage+0x0/0x1a0 [<c02a6ed6>] ext4_da_writepages+0x506/0x790 [<c016beef>] ? cpu_clock+0x4f/0x60 [<c016bca2>] ? sched_clock_local+0xd2/0x170 [<c016be60>] ? sched_clock_cpu+0x120/0x160 [<c016be60>] ? sched_clock_cpu+0x120/0x160 [<c02a69d0>] ? ext4_da_writepages+0x0/0x790 [<c01de272>] do_writepages+0x22/0x50 [<c01d766d>] __filemap_fdatawrite_range+0x6d/0x80 [<c01d7b9b>] filemap_flush+0x2b/0x30 [<c02a40ac>] ext4_alloc_da_blocks+0x5c/0x60 [<c029e595>] ext4_release_file+0x75/0xb0 [<c0216b59>] __fput+0xf9/0x210 [<c0216c97>] fput+0x27/0x30 [<c02122dc>] filp_close+0x4c/0x80 [<c014510e>] put_files_struct+0x6e/0xd0 [<c01451b7>] exit_files+0x47/0x60 [<c0146a24>] do_exit+0x144/0x710 [<c017b163>] ? lock_release_holdtime+0x33/0x210 [<c0528137>] ? _spin_unlock_irq+0x27/0x30 [<c0147028>] do_group_exit+0x38/0xa0 [<c017babb>] ? trace_hardirqs_on+0xb/0x10 [<c0159abc>] get_signal_to_deliver+0x2ac/0x410 [<c0102849>] do_notify_resume+0xb9/0x890 [<c0178fd0>] ? trace_hardirqs_off_caller+0x20/0xd0 [<c017b163>] ? lock_release_holdtime+0x33/0x210 [<c0165b50>] ? autoremove_wake_function+0x0/0x50 [<c017ba54>] ? trace_hardirqs_on_caller+0x134/0x190 [<c017babb>] ? trace_hardirqs_on+0xb/0x10 [<c0300ba4>] ? security_file_permission+0x14/0x20 [<c0215761>] ? vfs_write+0x131/0x190 [<c0214f50>] ? do_sync_write+0x0/0x120 [<c0103115>] ? sysenter_do_call+0x27/0x32 [<c01032d2>] work_notifysig+0x13/0x21 CC: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz>
|
#
1f2acb60 |
|
22-Jan-2010 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Add block validity check when truncating indirect block mapped inodes Add checks to ext4_free_branches() to make sure a block number found in an indirect block are valid before trying to free it. If a bad block number is found, stop freeing the indirect block immediately, since the file system is corrupt and we will need to run fsck anyway. This also avoids spamming the logs, and specifically avoids driver-level "attempt to access beyond end of device" errors obscure what is really going on. If you get *really*, *really*, *really* unlucky, without this patch, a supposed indirect block containing garbage might contain a reference to a primary block group descriptor, in which case ext4_free_branches() could end up zero'ing out a block group descriptor block, and if then one of the block bitmaps for a block group described by that bg descriptor block is not in memory, and is read in by ext4_read_block_bitmap(). This function calls ext4_valid_block_bitmap(), which assumes that bg_inode_table() was validated at mount time and hasn't been modified since. Since this assumption is no longer valid, it's possible for the value (ext4_inode_table(sb, desc) - group_first_block) to go negative, which will cause ext4_find_next_zero_bit() to trigger a kernel GPF. Addresses-Google-Bug: #2220436 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
b844167e |
|
08-Dec-2009 |
Curt Wohlgemuth <curtw@google.com> |
ext4: remove blocks from inode prealloc list on failure This fixes a leak of blocks in an inode prealloc list if device failures cause ext4_mb_mark_diskspace_used() to fail. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Acked-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
af901ca1 |
|
14-Nov-2009 |
André Goddard Rosa <andre.goddard@gmail.com> |
tree-wide: fix assorted typos all over the place That is "success", "unknown", "through", "performance", "[re|un]mapping" , "access", "default", "reasonable", "[con]currently", "temperature" , "channel", "[un]used", "application", "example","hierarchy", "therefore" , "[over|under]flow", "contiguous", "threshold", "enough" and others. Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
#
9084d471 |
|
22-Nov-2009 |
Theodore Ts'o <tytso@mit.edu> |
ext4: use ext4_data_block_valid() in ext4_free_blocks() The block validity framework does a more comprehensive set of checks, and it saves object code space to use the ext4_data_block_valid() than the limited open-coded version that had been in ext4_free_blocks(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
e6362609 |
|
23-Nov-2009 |
Theodore Ts'o <tytso@mit.edu> |
ext4: call ext4_forget() from ext4_free_blocks() Add the facility for ext4_forget() to be called from ext4_free_blocks(). This simplifies the code in a large number of places, and centralizes most of the work of calling ext4_forget() into a single place. Also fix a bug in the extents migration code; it wasn't calling ext4_forget() when releasing the indirect blocks during the conversion. As a result, if the system cashed during or shortly after the extents migration, and the released indirect blocks get reused as data blocks, the journal replay would corrupt the data blocks. With this new patch, fixing this bug was as simple as adding the EXT4_FREE_BLOCKS_FORGET flags to the call to ext4_free_blocks(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
|
#
44338711 |
|
22-Nov-2009 |
Theodore Ts'o <tytso@mit.edu> |
ext4: fold ext4_free_blocks() and ext4_mb_free_blocks() ext4_mb_free_blocks() is only called by ext4_free_blocks(), and the latter function doesn't really do much. So merge the two functions together, such that ext4_free_blocks() is now found in fs/ext4/mballoc.c. This saves about 200 bytes of compiled text space. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
5328e635 |
|
19-Nov-2009 |
Eric Sandeen <sandeen@redhat.com> |
ext4: make trim/discard optional (and off by default) It is anticipated that when sb_issue_discard starts doing real work on trim-capable devices, we may see issues. Make this mount-time optional, and default it to off until we know that things are working out OK. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
ca0c9584 |
|
03-Oct-2009 |
Christoph Lameter <cl@linux-foundation.org> |
this_cpu: Straight transformations Use this_cpu_ptr and __this_cpu_ptr in locations where straight transformations are possible because per_cpu_ptr is used with either smp_processor_id() or raw_smp_processor_id(). cc: David Howells <dhowells@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> cc: Ingo Molnar <mingo@elte.hu> cc: Rusty Russell <rusty@rustcorp.com.au> cc: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Tejun Heo <tj@kernel.org>
|
#
296c355c |
|
29-Sep-2009 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Use tracepoints for mb_history trace file The /proc/fs/ext4/<dev>/mb_history was maintained manually, and had a number of problems: it required a largish amount of memory to be allocated for each ext4 filesystem, and the s_mb_history_lock introduced a CPU contention problem. By ripping out the mb_history code and replacing it with ftrace tracepoints, and we get more functionality: timestamps, event filtering, the ability to correlate mballoc history with other ext4 tracepoints, etc. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
90576c0b |
|
29-Sep-2009 |
Theodore Ts'o <tytso@mit.edu> |
ext4, jbd2: Drop unneeded printks at mount and unmount time There are a number of kernel printk's which are printed when an ext4 filesystem is mounted and unmounted. Disable them to economize space in the system logs. In addition, disabling the mballoc stats by default saves a number of unneeded atomic operations for every block allocation or deallocation. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
71780577 |
|
27-Sep-2009 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Fix hueristic which avoids group preallocation for closed files The hueristic was designed to avoid using locality group preallocation when writing the last segment of a closed file. Fix it by move setting size to the maximum of size and isize until after we check whether size == isize. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
fb0a387d |
|
16-Sep-2009 |
Eric Sandeen <sandeen@redhat.com> |
ext4: limit block allocations for indirect-block files to < 2^32 Today, the ext4 allocator will happily allocate blocks past 2^32 for indirect-block files, which results in the block numbers getting truncated, and corruption ensues. This patch limits such allocations to < 2^32, and adds BUG_ONs if we do get blocks larger than that. This should address RH Bug 519471, ext4 bitmap allocator must limit blocks to < 2^32 * ext4_find_goal() is modified to choose a goal < UINT_MAX, so that our starting point is in an acceptable range. * ext4_xattr_block_set() is modified such that the goal block is < UINT_MAX, as above. * ext4_mb_regular_allocator() is modified so that the group search does not continue into groups which are too high * ext4_mb_use_preallocated() has a check that we don't use preallocated space which is too far out * ext4_alloc_blocks() and ext4_xattr_block_set() add some BUG_ONs No attempt has been made to limit inode locations to < 2^32, so we may wind up with blocks far from their inodes. Doing this much already will lead to some odd ENOSPC issues when the "lower 32" gets full, and further restricting inodes could make that even weirder. For high inodes, choosing a goal of the original, % UINT_MAX, may be a bit odd, but then we're in an odd situation anyway, and I don't know of a better heuristic. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
08c3a813 |
|
09-Sep-2009 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Clarify the locking details in mballoc We don't need to take the alloc_sem lock when we are adding new groups, since mballoc won't see the new group added until we bump sbi->s_groups_count. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
|
#
f41c0750 |
|
09-Sep-2009 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: check for need init flag in ext4_mb_load_buddy We should check for need init flag with the group's alloc_sem held, to make sure while we are loading the buddy cache and holding a reference to it, a file system resize can't add new blocks to same group. The patch also drops the need init flag check in ext4_mb_regular_allocator() because doing the check without holding alloc_sem is racy. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
|
#
b6a758ec |
|
09-Sep-2009 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: move ext4_mb_init_group() function earlier in the mballoc.c This moves the function around so that it can be called from ext4_mb_load_buddy(). Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
7f1346a9 |
|
05-Sep-2009 |
Tobias Klauser <tklauser@distanz.ch> |
ext4: Declare seq_operations and file_operations structures as const Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
a36b4498 |
|
25-Aug-2009 |
Eric Sandeen <sandeen@redhat.com> |
ext4: use ext4_grpblk_t more extensively unsigned short is potentially too small to track blocks within a group; today it is safe due to restrictions in e2fsprogs but we have _lo / _hi bits for group blocks with the intent to go up to 32 bits, so clean this up now. There are many more places where we use unsigned/int/unsigned int to contain a group block but this should at least fix all the short types. I added a few comments to the struct ext4_group_info definition as well. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
1927805e |
|
25-Aug-2009 |
Eric Sandeen <sandeen@redhat.com> |
ext4: use variables not types in sizeofs() for allocations Precursor to changing some types; to keep things in sync, it seems better to allocate/memset based on the size of the variables we are using rather than on some disconnected basic type like "unsigned short" Signed-off-by: Eric Sandeen <sandeen@redhat.com>
|
#
38877f4e |
|
17-Aug-2009 |
Eric Sandeen <sandeen@redhat.com> |
simplify some logic in ext4_mb_normalize_request While reading through some of the mballoc code it seems that a couple spots in the size normalization function could be streamlined. The test for non-overlapping PAs can be or'd for the start & end conditions, and the tests for adjacent PAs can be else-if'd - it's essentially independently testing: if (A + B <= C) ... if (A > C) ... These cannot both be true so it seems like the else-if might be slightly more efficient and/or informative. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
0373130d |
|
17-Aug-2009 |
Eric Sandeen <sandeen@redhat.com> |
ext4: open-code ext4_mb_update_group_info ext4_mb_update_group_info is only called in one place, and it's extremely simple. There's no reason to have it in a separate function in a separate file as far as I can tell, it just obfuscates what's really going on. Perhaps it was intended to keep the grp->bb_* manipulation local to mballoc.c but we're already accessing other grp-> fields in balloc.c directly so this seems ok. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
50797481 |
|
18-Sep-2009 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Avoid group preallocation for closed files Currently the group preallocation code tries to find a large (512) free block from which to do per-cpu group allocation for small files. The problem with this scheme is that it leaves the filesystem horribly fragmented. In the worst case, if the filesystem is unmounted and remounted (after a system shutdown, for example) we forget the fact that wee were using a particular (now-partially filled) 512 block extent. So the next time we try to allocate space for a small file, we will find *another* completely free 512 block chunk to allocate small files. Given that there are 32,768 blocks in a block group, after 64 iterations of "mount, write one 4k file in a directory, unmount", the block group will have 64 files, each separated by 511 blocks, and the block group will no longer have any free 512 completely free chunks of blocks for group preallocation space. So if we try to allocate blocks for a file that has been closed, such that we know the final size of the file, and the filesystem is not busy, avoid using group preallocation. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
4ba74d00 |
|
09-Aug-2009 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Fix bugs in mballoc's stream allocation mode The logic around sbi->s_mb_last_group and sbi->s_mb_last_start was all screwed up. These fields were getting unconditionally all the time, set even when stream allocation had not taken place, and if they were being used when the file was smaller than s_mb_stream_request, which is when the allocation should _not_ be doing stream allocation. Fix this by determining whether or not we stream allocation should take place once, in ext4_mb_group_or_file(), and setting a flag which gets used in ext4_mb_regular_allocator() and ext4_mb_use_best_found(). This simplifies the code and assures that we are consistently using (or not using) the stream allocation logic. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
0ef90db9 |
|
09-Aug-2009 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Display the mballoc flags in mb_history in hex instead of decimal Displaying the flags in base 16 makes it easier to see which flags have been set. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
6ba495e9 |
|
18-Sep-2009 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Add configurable run-time mballoc debugging Allow mballoc debugging to be enabled at run-time instead of just at compile time. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
833576b3 |
|
13-Jul-2009 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Fix ext4_mb_initialize_context() to initialize all fields Pavel Roskin pointed out that kmemcheck indicated that ext4_mb_store_history() was accessing uninitialized values of ac->ac_tail and ac->ac_buddy leading to garbage in the mballoc history. Fix this by initializing the entire structure to all zeros first. Also, two fields were getting doubly initialized by the caller of ext4_mb_initialize_context, so remove them for efficiency's sake. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
1c718505 |
|
05-Jul-2009 |
Akira Fujita <a-fujita@rs.jp.nec.com> |
ext4: Fix compile warnings with MB_DEBUG When MB_DEBUG is enabled, we get some compile warnings because ext4_group_t is unsigned int. This patch fixes them. Signed-off-by Akira Fujita <a-fujita@rs.jp.nec.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
5a4a7989 |
|
05-Jul-2009 |
Joe Perches <joe@perches.com> |
ext4: Remove unnecessary semicolons in mballoc.c Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
024eab4d |
|
17-Jul-2009 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Fix memory leak fix when mounting an ext4 filesystem The allocation of the ext4_group_info array was moved to a new function ext4_mb_add_group_info() in commit 5f21b0e6 so that online resize would use a common (and correct) codepath. Unfortunately, the call to the new ext4_mb_add_group_info() function was added without removing the code which originally allocated the array. This caused a memory leak each time an ext4 filesystem was mounted. The fix is simple; remove the code that did the original allocation, since it is no longer needed. Reported-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
0610b6e9 |
|
15-Jun-2009 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Fix 64-bit block type problem on 32-bit platforms The function ext4_mb_free_blocks() was using an "unsigned long" to pass a block number; this will cause 64-bit block numbers to get truncated on x86 and other 32-bit platforms. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
|
#
3e03f9ca |
|
05-Jul-2009 |
Jesper Dangaard Brouer <hawk@comx.dk> |
ext4: Use rcu_barrier() on module unload. The ext4 module uses rcu_call() thus it should use rcu_barrier()on module unload. The kmem cache ext4_pspace_cachep is sometimes free'ed using call_rcu() callbacks. Thus, we must wait for completion of call_rcu() before doing kmem_cache_destroy(). Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
089ceecc |
|
05-Jul-2009 |
Eric Sandeen <sandeen@redhat.com> |
ext4: mark several more functions in mballoc.c as noinline Ted noticed a stack-deep callchain through writepages->ext4_mb_regular_allocator->ext4_mb_init_cache->submit_bh ... With all the static functions in mballoc.c, gcc helpfully inlines for us, and we get something like this: ext4_mb_regular_allocator (232 bytes stack) ext4_mb_init_cache (232 bytes stack) submit_bh (starts 464 deeper) the 2 ext4 functions here get several others inlined; by telling gcc not to inline them, we can save stack space for when we head off into submit_bh land and associated block layer callchains. The following noinlined functions are only called once, so this won't impact any other callchains: ext4_mb_regular_allocator (104) (was 232) ext4_mb_find_by_goal (56) (noinlined) ext4_mb_init_group (24) (noinlined) ext4_mb_init_cache (136) (was 232) ext4_mb_generate_buddy (88) (noinlined) ext4_mb_generate_from_pa (40) (noinlined) submit_bh ext4_mb_simple_scan_group (24) (noinlined) ext4_mb_scan_aligned (56) (noinlined) ext4_mb_complex_scan_group (40) (noinlined) ext4_mb_try_best_found (24) (noinlined) now when we head off into submit_bh() we're only 264 bytes deeper in stack than when we entered ext4_mb_regular_allocator() (vs. 464 bytes before). Every 200 bytes helps. :) Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
6fd058f7 |
|
17-May-2009 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Add a comprehensive block validity check to ext4_get_blocks() To catch filesystem bugs or corruption which could lead to the filesystem getting severly damaged, this patch adds a facility for tracking all of the filesystem metadata blocks by contiguous regions in a red-black tree. This allows quick searching of the tree to locate extents which might overlap with filesystem metadata blocks. This facility is also used by the multi-block allocator to assure that it is not allocating blocks out of the system zone, as well as by the routines used when reading indirect blocks and extents information from disk to make sure their contents are valid. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
bc8e6740 |
|
15-May-2009 |
Vincent Minet <vincent@vincent-minet.net> |
ext4: Fix spinlock assertions on UP systems On UP systems without DEBUG_SPINLOCK, ext4_is_group_locked always fails which triggers a BUG_ON() call. This patch fixes it by using assert_spin_locked instead. Signed-off-by: Vincent Minet <vincent@vincent-minet.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
955ce5f5 |
|
02-May-2009 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Convert ext4_lock_group to use sb_bgl_lock We have sb_bgl_lock() and ext4_group_info.bb_state bit spinlock to protech group information. The later is only used within mballoc code. Consolidate them to use sb_bgl_lock(). This makes the mballoc.c code much simpler and also avoid confusion with two locks protecting same info. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
f4033903 |
|
01-May-2009 |
Curt Wohlgemuth <curtw@google.com> |
ext4: Make the length of the mb_history file tunable In memory-constrained systems with many partitions, the ~68K for each partition for the mb_history buffer can be excessive. This patch adds a new mount option, mb_history_length, as well as a way of setting the default via a module parameter (or via a sysfs parameter in /sys/module/ext4/parameter/default_mb_history_length). If the mb_history_length is set to zero, the mb_history facility is disabled entirely. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
75507efb |
|
30-Apr-2009 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Don't avoid using BLOCK_UNINIT block groups in mballoc By avoiding the use of not-yet-used block groups (i.e., block groups with the BLOCK_UNINIT flag), mballoc had a tendency to create large files with large non-contiguous gaps. In addition avoiding the use of new block groups had a tendency to push regular file data into the first block group in a flex_bg group, which slows down the speed of e2fsck pass 2, since it has a tendency to seek much more. For example: Before Patch After Patch Time in seconds Time in seconds Real / User/ Sys MB/s Real / User/ Sys MB/s Pass 1 8.52 / 2.21 / 0.46 20.43 8.84 / 4.97 / 1.11 19.68 Pass 2 21.16 / 1.02 / 1.86 11.30 6.54 / 1.77 / 1.78 36.39 Pass 3 0.01 / 0.00 / 0.00 139.00 0.01 / 0.01 / 0.00 128.90 Pass 4 0.16 / 0.15 / 0.00 0.00 0.17 / 0.17 / 0.00 0.00 Pass 5 2.52 / 1.99 / 0.09 0.79 2.31 / 1.78 / 0.06 0.86 Total 32.40 / 5.11 / 2.49 12.81 17.99 / 8.75 / 2.98 23.01 This was on a sample 80 gig root filesystem which was approximately 50% full. Note the improved e2fsck pass 2 performance, by over a factor of 3, due to a decreased number of seeks. (The total amount of I/O in pass 2 was unchanged; the layout of the directory blocks was simply much better from e2fsck's's perspective.) Other changes as a result of this patch on this sample filesystem: Before Patch After Patch # of non-contig files 762 779 # of non-contig directories 571 570 # of BLOCK_UNINIT bg's 307 293 # of INODE_UNINIT bg's 503 503 Out of 640 block groups, of which 333 were in use, this patch caused an extra 14 block groups to be utilized. The number of non-contiguous files did go up slightly, but when measured against the 99.9% of the files (603,154) which were contiguously allocated, this is pretty insignificant. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andreas Dilger <adilger@sun.com>
|
#
9bffad1e |
|
17-Jun-2009 |
Theodore Ts'o <tytso@mit.edu> |
ext4: convert instrumentation from markers to tracepoints Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
8df9675f |
|
01-May-2009 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Avoid races caused by on-line resizing and SMP memory reordering Ext4's on-line resizing adds a new block group and then, only at the last step adjusts s_groups_count. However, it's possible on SMP systems that another CPU could see the updated the s_group_count and not see the newly initialized data structures for the just-added block group. For this reason, it's important to insert a SMP read barrier after reading s_groups_count and before reading any (for example) the new block group descriptors allowed by the increased value of s_groups_count. Unfortunately, we rather blatently violate this locking protocol documented in fs/ext4/resize.c. Fortunately, (1) on-line resizes happen relatively rarely, and (2) it seems rare that the filesystem code will immediately try to use just-added block group before any memory ordering issues resolve themselves. So apparently problems here are relatively hard to hit, since ext3 has been vulnerable to the same issue for years with no one apparently complaining. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
e7c9e3e9 |
|
27-Mar-2009 |
Theodore Ts'o <tytso@mit.edu> |
ext4: fix locking typo in mballoc which could cause soft lockup hangs Smatch (http://repo.or.cz/w/smatch.git/) complains about the locking in ext4_mb_add_n_trim() from fs/ext4/mballoc.c 4438 list_for_each_entry_rcu(tmp_pa, &lg->lg_prealloc_list[order], 4439 pa_inode_list) { 4440 spin_lock(&tmp_pa->pa_lock); 4441 if (tmp_pa->pa_deleted) { 4442 spin_unlock(&pa->pa_lock); 4443 continue; 4444 } Brown paper bag time... Reported-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
|
#
a7b19448 |
|
27-Mar-2009 |
Dan Carpenter <error27@gmail.com> |
ext4: fix typo which causes a memory leak on error path This was found by smatch (http://repo.or.cz/w/smatch.git/) Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
|
#
cc0fb9ad |
|
27-Mar-2009 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Rename pa_linear to pa_type Impact: code cleanup This patch rename pa_linear to pa_type and add MB_INODE_PA and MB_GROUP_PA to indicate inode and group prealloc space. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
a269eb18 |
|
26-Jan-2009 |
Jan Kara <jack@suse.cz> |
ext4: Use lowercase names of quota functions Use lowercase names of quota functions instead of old uppercase ones. Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mingming Cao <cmm@us.ibm.com> CC: linux-ext4@vger.kernel.org
|
#
60e58e0f |
|
22-Jan-2009 |
Mingming Cao <cmm@us.ibm.com> |
ext4: quota reservation for delayed allocation Uses quota reservation/claim/release to handle quota properly for delayed allocation in the three steps: 1) quotas are reserved when data being copied to cache when block allocation is defered 2) when new blocks are allocated. reserved quotas are converted to the real allocated quota, 2) over-booked quotas for metadata blocks are released back. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz>
|
#
d33a1976 |
|
16-Mar-2009 |
Eric Sandeen <sandeen@redhat.com> |
ext4: fix bb_prealloc_list corruption due to wrong group locking This is for Red Hat bug 490026: EXT4 panic, list corruption in ext4_mb_new_inode_pa ext4_lock_group(sb, group) is supposed to protect this list for each group, and a common code flow to remove an album is like this: ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL); ext4_lock_group(sb, grp); list_del(&pa->pa_group_list); ext4_unlock_group(sb, grp); so it's critical that we get the right group number back for this prealloc context, to lock the right group (the one associated with this pa) and prevent concurrent list manipulation. however, ext4_mb_put_pa() passes in (pa->pa_pstart - 1) with a comment, "-1 is to protect from crossing allocation group". This makes sense for the group_pa, where pa_pstart is advanced by the length which has been used (in ext4_mb_release_context()), and when the entire length has been used, pa_pstart has been advanced to the first block of the next group. However, for inode_pa, pa_pstart is never advanced; it's just set once to the first block in the group and not moved after that. So in this case, if we subtract one in ext4_mb_put_pa(), we are actually locking the *previous* group, and opening the race with the other threads which do not subtract off the extra block. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
8d03c7a0 |
|
14-Mar-2009 |
Eric Sandeen <sandeen@redhat.com> |
ext4: fix bogus BUG_ONs in in mballoc code Thiemo Nagel reported that: # dd if=/dev/zero of=image.ext4 bs=1M count=2 # mkfs.ext4 -v -F -b 1024 -m 0 -g 512 -G 4 -I 128 -N 1 \ -O large_file,dir_index,flex_bg,extent,sparse_super image.ext4 # mount -o loop image.ext4 mnt/ # dd if=/dev/zero of=mnt/file oopsed, with a BUG_ON in ext4_mb_normalize_request because size == EXT4_BLOCKS_PER_GROUP It appears to me (esp. after talking to Andreas) that the BUG_ON is bogus; a request of exactly EXT4_BLOCKS_PER_GROUP should be allowed, though larger sizes do indicate a problem. Fix that an another (apparently rare) codepath with a similar check. Reported-by: Thiemo Nagel <thiemo.nagel@ph.tum.de> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
9f24e420 |
|
04-Mar-2009 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Use atomic_t's in struct flex_groups Reduce pressure on the sb_bgl_lock family of locks by using atomic_t's to track the number of free blocks and inodes in each flex_group. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
b713a5ec |
|
31-Mar-2009 |
Theodore Ts'o <tytso@mit.edu> |
ext4: remove /proc tuning knobs Remove tuning knobs in /proc/fs/ext4/<dev/* since they have been replaced by knobs in sysfs at /sys/fs/ext4/<dev>/*. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
a4912123 |
|
11-Mar-2009 |
Theodore Ts'o <tytso@mit.edu> |
ext4: New inode/block allocation algorithms for flex_bg filesystems The find_group_flex() inode allocator is now only used if the filesystem is mounted using the "oldalloc" mount option. It is replaced with the original Orlov allocator that has been updated for flex_bg filesystems (it should behave the same way if flex_bg is disabled). The inode allocator now functions by taking into account each flex_bg group, instead of each block group, when deciding whether or not it's time to allocate a new directory into a fresh flex_bg. The block allocator has also been changed so that the first block group in each flex_bg is preferred for use for storing directory blocks. This keeps directory blocks close together, which is good for speeding up e2fsck since large directories are more likely to look like this: debugfs: stat /home/tytso/Maildir/cur Inode: 1844562 Type: directory Mode: 0700 Flags: 0x81000 Generation: 1132745781 Version: 0x00000000:0000ad71 User: 15806 Group: 15806 Size: 1060864 File ACL: 0 Directory ACL: 0 Links: 2 Blockcount: 2072 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x499c0ff4:164961f4 -- Wed Feb 18 08:41:08 2009 atime: 0x499c0ff4:00000000 -- Wed Feb 18 08:41:08 2009 mtime: 0x49957f51:00000000 -- Fri Feb 13 09:10:25 2009 crtime: 0x499c0f57:00d51440 -- Wed Feb 18 08:38:31 2009 Size of extra inode fields: 28 BLOCKS: (0):7348651, (1-258):7348654-7348911 TOTAL: 259 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
d794bf8e |
|
14-Feb-2009 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Initialize preallocation list_head's properly When creating a new ext4_prealloc_space structure, we have to initialize its list_head pointers before we add them to any prealloc lists. Otherwise, with list debug enabled, we will get list corruption warnings. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
ba443916 |
|
10-Feb-2009 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Fix lockdep warning We should not call ext4_mb_add_n_trim while holding alloc_semp. ============================================= [ INFO: possible recursive locking detected ] 2.6.29-rc4-git1-dirty #124 --------------------------------------------- ffsb/3116 is trying to acquire lock: (&meta_group_info[i]->alloc_sem){----}, at: [<ffffffff8035a6e8>] ext4_mb_load_buddy+0xd2/0x343 but task is already holding lock: (&meta_group_info[i]->alloc_sem){----}, at: [<ffffffff8035a6e8>] ext4_mb_load_buddy+0xd2/0x343 http://bugzilla.kernel.org/show_bug.cgi?id=12672 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
9fd9784c |
|
26-Jan-2009 |
Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com> |
ext4: Fix building with EXT4FS_DEBUG When bg_free_blocks_count was renamed to bg_free_blocks_count_lo in 560671a0, its uses under EXT4FS_DEBUG were not changed to the helper ext4_free_blks_count. Another commit, 498e5f24, also did not change everything needed under EXT4FS_DEBUG, thus making it spill some warnings related to printing format. This commit fixes both issues and makes ext4 build again when EXT4FS_DEBUG is enabled. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
ba80b101 |
|
03-Jan-2009 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Add markers for better debuggability Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
0087d9fb |
|
05-Jan-2009 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc With nodelalloc option we need to update the dirty block counter on block allocation failure. This is needed because we increment the dirty block counter early in the block allocation phase. Without the patch s_dirty_blocks_counter goes wrong so that filesystem's free blocks decreases incorrectly. Tested-by: Akira Fujita <a-fujita@rs.jp.nec.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
|
#
29eaf024 |
|
05-Jan-2009 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Init the complete page while building buddy cache We need to init the complete page during buddy cache init by setting the contents to '1'. Otherwise we can see the following errors after doing an online resize of the filesystem: EXT4-fs error (device sdb1): ext4_mb_mark_diskspace_used: Allocating block 1040385 in system zone of 127 group Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
|
#
8556e8f3 |
|
05-Jan-2009 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Don't allow new groups to be added during block allocation After we mark the blocks in the buddy cache as allocated, we need to ensure that we don't reinit the buddy cache until the block bitmap is updated. This commit achieves this by holding the group_info alloc_semaphore till ext4_mb_release_context Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
|
#
648f5879 |
|
05-Jan-2009 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: mark the blocks/inode bitmap beyond end of group as used We need to mark the block/inode bitmap beyond the end of the group with '1'. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
|
#
2ccb5fb9 |
|
05-Jan-2009 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Use new buffer_head flag to check uninit group bitmaps initialization For uninit block group, the on-disk bitmap is not initialized. That implies we cannot depend on the uptodate flag on the bitmap buffer_head to find bitmap validity. Use a new buffer_head flag which would be set after we properly initialize the bitmap. This also prevents (re-)initializing the uninit group bitmap every time we call ext4_read_block_bitmap(). Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
|
#
3300beda |
|
03-Jan-2009 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: code cleanup Rename some variables. We also unlock locks in the reverse order we acquired as a part of cleanup. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
560671a0 |
|
05-Jan-2009 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Use high 16 bits of the block group descriptor's free counts fields Rename the lower bits with suffix _lo and add helper to access the values. Also rename bg_itable_unused_hi to bg_pad as in e2fsprogs. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
e8134b27 |
|
05-Jan-2009 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Fix race between read_block_bitmap() and mark_diskspace_used() We need to make sure we update the block bitmap and clear EXT4_BG_BLOCK_UNINIT flag with sb_bgl_lock held, since ext4_read_block_bitmap() looks at EXT4_BG_BLOCK_UNINIT to decide whether to initialize the block bitmap each time it is called (introduced by commit c806e68f), and this can race with block allocations in ext4_mb_mark_diskspace_used(). ext4_read_block_bitmap does: spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group)); if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { ext4_init_block_bitmap(sb, bh, block_group, desc); Now on the block allocation side we do mb_set_bits(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group), bitmap_bh->b_data, ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len); .... spin_lock(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group)); if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT); ie on allocation we update the bitmap then we take the sb_bgl_lock and clear the EXT4_BG_BLOCK_UNINIT flag. What can happen is a parallel ext4_read_block_bitmap can zero out the bitmap in between the above mb_set_bits and spin_lock(sb_bg_lock..) The race results in below user visible errors EXT4-fs error (device sdb1): ext4_mb_release_inode_pa: free 100, pa_free 105 EXT4-fs error (device sdb1): mb_free_blocks: double-free of inode 0's block .. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
|
#
5d1b1b3f |
|
05-Jan-2009 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: fix BUG when calling ext4_error with locked block group The mballoc code likes to call ext4_error while it is holding locked block groups. This can causes a scheduling in atomic context BUG. We can't just unlock the block group and relock it after/if ext4_error returns since that might result in race conditions in the case where the filesystem is set to continue after finding errors. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
b7be019e |
|
23-Nov-2008 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Fix lockdep recursive locking warning In ext4_mb_init_group(), if the filesystem block size is less than PAGE_SIZE/2, the code tries to grab alloc_sem for multiple block groups in a loop. We need to allow for this by using down_write_nested() and passing in the loop index as a lock subclass number. This works because no other code path needs to take multiple alloc_sem's. Note that lockdep will fail for filesystem blocksize smaller than to PAGE_SIZE/16k. (e.g., a 1k filesystem blocksize with a 32k page size, or a 2k filesystem blocksize with a 64k blocksize, etc.) Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
7a2fcbf7 |
|
05-Jan-2009 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: don't use blocks freed but not yet committed in buddy cache init When we generate buddy cache (especially during resize) we need to make sure we don't use the blocks freed but not yet comitted. This makes sure we have the right value of free blocks count in the group info and also in the bitmap. This also ensures the ordered mode consistency Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
|
#
c3a326a6 |
|
25-Nov-2008 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: cleanup mballoc header files Move some of the forward declaration of the static functions to mballoc.c where they are used. This enables us to include mballoc.h in other .c files. Also correct the buddy cache documentation. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
920313a7 |
|
05-Jan-2009 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Use EXT4_GROUP_INFO_NEED_INIT_BIT during resize The new groups added during resize are flagged as need_init group. Make sure we properly initialize these groups. When we have block size < page size and we are adding new groups the page may still be marked uptodate even though we haven't initialized the group. While forcing the init of buddy cache we need to make sure other groups part of the same page of buddy cache is not using the cache. group_info->alloc_sem is added to ensure the same. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> cc: stable@kernel.org
|
#
3a06d778 |
|
22-Nov-2008 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: sparse fixes * Change EXT4_HAS_*_FEATURE to return a boolean * Add a function prototype for ext4_fiemap() in ext4.h * Make ext4_ext_fiemap_cb() and ext4_xattr_fiemap() be static functions * Add lock annotations to mb_free_blocks() Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
498e5f24 |
|
04-Nov-2008 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Change unsigned long to unsigned int Convert the unsigned longs that are most responsible for bloating the stack usage on 64-bit systems. Nearly all places in the ext3/4 code which uses "unsigned long" is probably a bug, since on 32-bit systems a ulong a 32-bits, which means we are wasting stack space on 64-bit systems. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
a9df9a49 |
|
05-Jan-2009 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Make ext4_group_t be an unsigned int Nearly all places in the ext3/4 code which uses "unsigned long" is probably a bug, since on 32-bit systems a ulong a 32-bits, which means we are wasting stack space on 64-bit systems. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
032115fc |
|
05-Jan-2009 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Don't overwrite allocation_context ac_status We can call ext4_mb_check_limits even after successfully allocating the requested blocks. In that case, make sure we don't overwrite ac_status if it already has the status AC_STATUS_FOUND. This fixes the lockdep warning: ============================================= [ INFO: possible recursive locking detected ] 2.6.28-rc6-autokern1 #1 --------------------------------------------- fsstress/11948 is trying to acquire lock: (&meta_group_info[i]->alloc_sem){----}, at: [<c04d9a49>] ext4_mb_load_buddy+0x9f/0x278 ..... stack backtrace: ..... [<c04db974>] ext4_mb_regular_allocator+0xbb5/0xd44 ..... but task is already holding lock: (&meta_group_info[i]->alloc_sem){----}, at: [<c04d9a49>] ext4_mb_load_buddy+0x9f/0x278 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
|
#
fde4d95a |
|
05-Jan-2009 |
Theodore Ts'o <tytso@mit.edu> |
ext4: remove extraneous newlines from calls to ext4_error() and ext4_warning() This removes annoying blank syslog entries emitted by ext4_error() or ext4_warning(), since these functions add their own newline. Signed-off-by: Nick Warne <nick@ukfsn.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
0390131b |
|
06-Jan-2009 |
Frank Mayhar <fmayhar@google.com> |
ext4: Allow ext4 to run without a journal A few weeks ago I posted a patch for discussion that allowed ext4 to run without a journal. Since that time I've integrated the excellent comments from Andreas and fixed several serious bugs. We're currently running with this patch and generating some performance numbers against both ext2 (with backported reservations code) and ext4 with and without a journal. It just so happens that running without a journal is slightly faster for most everything. We did iozone -T -t 4 s 2g -r 256k -T -I -i0 -i1 -i2 which creates 4 threads, each of which create and do reads and writes on a 2G file, with a buffer size of 256K, using O_DIRECT for all file opens to bypass the page cache. Results: ext2 ext4, default ext4, no journal initial writes 13.0 MB/s 15.4 MB/s 15.7 MB/s rewrites 13.1 MB/s 15.6 MB/s 15.9 MB/s reads 15.2 MB/s 16.9 MB/s 17.2 MB/s re-reads 15.3 MB/s 16.9 MB/s 17.2 MB/s random readers 5.6 MB/s 5.6 MB/s 5.7 MB/s random writers 5.1 MB/s 5.3 MB/s 5.4 MB/s So it seems that, so far, this was a useful exercise. Signed-off-by: Frank Mayhar <fmayhar@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
ff7ef329 |
|
16-Dec-2008 |
Yasunori Goto <y-goto@jp.fujitsu.com> |
ext4: Widen type of ext4_sb_info.s_mb_maxs[] I chased the cause of following ext4 oops report which is tested on ia64 box. http://bugzilla.kernel.org/show_bug.cgi?id=12018 The cause is the size of s_mb_maxs array that is defined as "unsigned short" in ext4_sb_info structure. If the file system's block size is 8k or greater, an unsigned short is not wide enough to contain the value fs->blocksize << 3. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: stable@kernel.org
|
#
ae2d9fb1 |
|
04-Nov-2008 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: fix missing ext4_unlock_group in error path If we try to free a block which is already freed, the code was returning without first unlocking the group. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
3e624fc7 |
|
16-Oct-2008 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Replace hackish ext4_mb_poll_new_transaction with commit callback The multiblock allocator needs to be able to release blocks (and issue a blkdev discard request) when the transaction which freed those blocks is committed. Previously this was done via a polling mechanism when blocks are allocated or freed. A much better way of doing things is to create a jbd2 callback function and attaching the list of blocks to be freed directly to the transaction structure. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
0b09923e |
|
17-Oct-2008 |
Manish Katiyar <mkatiyar@gmail.com> |
ext4: Remove compile warnings when building w/o CONFIG_PROC_FS Signed-off-by: Manish Katiyar <mkatiyar@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
8a0aba73 |
|
16-Oct-2008 |
Theodore Ts'o <tytso@mit.edu> |
ext4: let the block device know when unused blocks can be discarded Let the block device know when unused blocks can be discarded, using the new sb_issue_discard() interface. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
c894058d |
|
16-Oct-2008 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Use an rbtree for tracking blocks freed during transaction. With this patch we track the block freed during a transaction using red-black tree. We also make sure contiguous blocks freed are collected in one node in the tree. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
688f05a0 |
|
12-Oct-2008 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Free ext4_prealloc_space using kmem_cache_free We should use kmem_cache_free to free memory allocated via kmem_cache_alloc Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
c806e68f |
|
10-Oct-2008 |
Frederic Bohe <frederic.bohe@bull.net> |
ext4: fix initialization of UNINIT bitmap blocks This fixes a bug which caused on-line resizing of filesystems with a 1k blocksize to fail. The root cause of this bug was the fact that if an uninitalized bitmap block gets read in by userspace (which e2fsprogs does try to avoid, but can happen when the blocksize is less than the pagesize and an adjacent blocks is read into memory) ext4_read_block_bitmap() was erroneously depending on the buffer uptodate flag to decide whether it needed to initialize the bitmap block in memory --- i.e., to set the standard set of blocks in use by a block group (superblock, bitmaps, inode table, etc.). Essentially, ext4_read_block_bitmap() assumed it was the only routine that might try to read a block containing a block bitmap, which is simply not true. To fix this, ext4_read_block_bitmap() and ext4_read_inode_bitmap() must always initialize uninitialized bitmap blocks. Once a block or inode is allocated out of that bitmap, it will be marked as initialized in the block group descriptor, so in general this won't result any extra unnecessary work. Signed-off-by: Frederic Bohe <frederic.bohe@bull.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
c2ea3fde |
|
10-Oct-2008 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Remove old legacy block allocator Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
5e8814f2 |
|
23-Sep-2008 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Combine proc file handling into a single set of functions Previously mballoc created a separate set of functions for each proc file. This combines the tunables into a single set of functions which gets used for all of the per-superblock proc files, saving approximately 2k of compiled object code. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
9f6200bb |
|
23-Sep-2008 |
Theodore Ts'o <tytso@mit.edu> |
ext4: move /proc setup and teardown out of mballoc.c ...and into the core setup/teardown code in fs/ext4/super.c so that other parts of ext4 can define tuning parameters. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
730c213c |
|
13-Sep-2008 |
Eric Sandeen <sandeen@redhat.com> |
ext4: use percpu data structures for lg_prealloc_list lg_prealloc_list seems to cry out for a per-cpu data structure; on a large smp system I think this should be better. I've lightly tested this change on a 4-cpu system. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Acked-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
899fc1a4 |
|
14-Sep-2008 |
Alexey Dobriyan <adobriyan@gmail.com> |
ext4: fix #11321: create /proc/ext4/*/stats more carefully ext4 creates per-suberblock directory in /proc/ext4/ . Name used as basis is taken from bdevname, which, surprise, can contain slash. However, proc while allowing to use proc_create("a/b", parent) form of PDE creation, assumes that parent/a was already created. bdevname in question is 'cciss/c0d0p9', directory is not created and all this stuff goes directly into /proc (which is real bug). Warning comes when _second_ partition is mounted. http://bugzilla.kernel.org/show_bug.cgi?id=11321 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
6bc6e63f |
|
10-Oct-2008 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Add percpu dirty block accounting. This patch adds dirty block accounting using percpu_counters. Delayed allocation block reservation is now done by updating dirty block counter. In a later patch we switch to non delalloc mode if the filesystem free blocks is greater than 150% of total filesystem dirty blocks Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao<cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
030ba6bc |
|
08-Sep-2008 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Retry block reservation During block reservation if we don't have enough blocks left, retry block reservation with smaller block counts. This makes sure we try fallocate and DIO with smaller request size and don't fail early. The delayed allocation reservation cannot try with smaller block count. So retry block reservation to handle temporary disk full conditions. Also print free blocks details if we fail block allocation during writepages. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
a30d542a |
|
09-Oct-2008 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Make sure all the block allocation paths reserve blocks With delayed allocation we need to make sure block are reserved before we attempt to allocate them. Otherwise we get block allocation failure (ENOSPC) during writepages which cannot be handled. This would mean silent data loss (We do a printk stating data will be lost). This patch updates the DIO and fallocate code path to do block reservation before block allocation. This is needed to make sure parallel DIO and fallocate request doesn't take block out of delayed reserve space. When free blocks count go below a threshold we switch to a slow patch which looks at other CPU's accumulated percpu counter values. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
4776004f |
|
08-Sep-2008 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Add printk priority levels to clean up checkpatch warnings Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
5e745b04 |
|
18-Aug-2008 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Fix small file fragmentation For small file block allocations, mballoc uses per cpu prealloc space. Use goal block when searching for the right prealloc space. Also make sure ext4_da_writepages tries to write all the pages for small files in single attempt Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
6be2ded1 |
|
23-Jul-2008 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Don't allow lg prealloc list to be grow large. Currently, the locality group prealloc list is freed only when there is a block allocation failure. This can result in large number of entries in the preallocation list making ext4_mb_use_preallocated() expensive. To fix this, we convert the locality group prealloc list to a hash list. The hash index is the order of number of blocks in the prealloc space with a max order of 9. When adding prealloc space to the list we make sure total entries for each order does not exceed 8. If it is more than 8 we discard few entries and make sure the we have only <= 5 entries. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
1320cbcf |
|
23-Jul-2008 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Convert the usage of NR_CPUS to nr_cpu_ids. NR_CPUS can be really large. We should be using nr_cpu_ids instead. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
ce89f46c |
|
23-Jul-2008 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Improve error handling in mballoc Don't call BUG_ON on file system failures. Instead use ext4_error and also handle the continue case properly. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
b5f10eed |
|
02-Aug-2008 |
Eric Sandeen <sandeen@redhat.com> |
ext4: lock block groups when initializing I noticed when filling a 1T filesystem with 4 threads using the fs_mark benchmark: fs_mark -d /mnt/test -D 256 -n 100000 -t 4 -s 20480 -F -S 0 that I occasionally got checksum mismatch errors: EXT4-fs error (device sdb): ext4_init_inode_bitmap: Checksum bad for group 6935 etc. I'd reliably get 4-5 of them during the run. It appears that the problem is likely a race to init the bg's when the uninit_bg feature is enabled. With the patch below, which adds sb_bgl_locking around initialization, I was able to complete several runs with no errors or warnings. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
d2a17637 |
|
14-Jul-2008 |
Mingming Cao <cmm@us.ibm.com> |
ext4: delayed allocation ENOSPC handling This patch does block reservation for delayed allocation, to avoid ENOSPC later at page flush time. Blocks(data and metadata) are reserved at da_write_begin() time, the freeblocks counter is updated by then, and the number of reserved blocks is store in per inode counter. At the writepage time, the unused reserved meta blocks are returned back. At unlink/truncate time, reserved blocks are properly released. Updated fix from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> to fix the oldallocator block reservation accounting with delalloc, added lock to guard the counters and also fix the reservation for meta blocks. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
5f21b0e6 |
|
11-Jul-2008 |
Frederic Bohe <frederic.bohe@bull.net> |
ext4: fix online resize with mballoc Update group infos when updating a group's descriptor. Add group infos when adding a group's descriptor. Refresh cache pages used by mb_alloc when changes occur. This will probably need modifications when META_BG resizing will be allowed. Signed-off-by: Frederic Bohe <frederic.bohe@bull.net> Signed-off-by: Mingming Cao <cmm@us.ibm.com>
|
#
07031431 |
|
11-Jul-2008 |
Mingming Cao <cmm@us.ibm.com> |
ext4: mballoc avoid use root reserved blocks for non root allocation mballoc allocation missed check for blocks reserved for root users. Add ext4_has_free_blocks() check before allocation. Also modified ext4_has_free_blocks() to support multiple block allocation request. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
654b4908 |
|
11-Jul-2008 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: cleanup block allocator Move the code related to block allocation to a single function and add helper funtions to differient allocation for data and meta data blocks Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
363d4251 |
|
11-Jul-2008 |
Shen Feng <shen@cn.fujitsu.com> |
ext4: remove quota allocation when ext4_mb_new_blocks fails Quota allocation is not removed when ext4_mb_new_blocks calls kmem_cache_alloc failed. Also make sure the allocation context is freed on the error path. Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
772cb7c8 |
|
11-Jul-2008 |
Jose R. Santos <jrs@us.ibm.com> |
ext4: New inode allocation for FLEX_BG meta-data groups. This patch mostly controls the way inode are allocated in order to make ialloc aware of flex_bg block group grouping. It achieves this by bypassing the Orlov allocator when block group meta-data are packed toghether through mke2fs. Since the impact on the block allocator is minimal, this patch should have little or no effect on other block allocation algorithms. By controlling the inode allocation, it can basically control where the initial search for new block begins and thus indirectly manipulate the block allocator. This allocator favors data and meta-data locality so the disk will gradually be filled from block group zero upward. This helps improve performance by reducing seek time. Since the group of inode tables within one flex_bg are treated as one giant inode table, uninitialized block groups would not need to partially initialize as many inode table as with Orlov which would help fsck time as the filesystem usage goes up. Signed-off-by: Jose R. Santos <jrs@us.ibm.com> Signed-off-by: Valerie Clement <valerie.clement@bull.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
7e5a8cdd |
|
13-Jul-2008 |
Shen Feng <shen@cn.fujitsu.com> |
ext4: fix error processing in mb_free_blocks The error processing of the return value of mb_free_blocks is meanless because it only returns 0. This fix includes - make mb_free_blocks return void - remove the error processing part in callers - unlock group before calling ext4_error in mb_free_blocks Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Cc: Mingming Cao <cmm@us.ibm.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
cfbe7e4f |
|
13-Jul-2008 |
Shen Feng <shen@cn.fujitsu.com> |
ext4: error proc entry creation when the fs/ext4 is not correctly created When the directory fs/ext4 is not correctly created under proc, the entry under this directory should not be created. Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
574ca174 |
|
11-Jul-2008 |
Theodore Ts'o <tytso@mit.edu> |
ext4: Rename read_block_bitmap() to ext4_read_block_bitmap() Since this a non-static function, make it be ext4 specific to avoid conflicts with potentially other filesystems. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
74767c5a |
|
11-Jul-2008 |
Shen Feng <shen@cn.fujitsu.com> |
ext4: miscellaneous error checks and coding cleanups for mballoc ext4_mb_seq_history_open(): check if sbi->s_mb_history is NULL ext4_mb_history_init(): replace kmalloc and memset with kzalloc ext4_mb_init_backend(): remove memset since kzalloc is used ext4_mb_init(): the return value of ext4_mb_init_backend is int, but i is unsigned, replace it with a new int variable. Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
fdf6c7a7 |
|
11-Jul-2008 |
Shen Feng <shen@cn.fujitsu.com> |
ext4: add error processing when calling ext4_mb_init_cache in mballoc Add error processing for ext4_mb_load_buddy when it calls ext4_mb_init_cache. Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
31b481dc |
|
11-Jul-2008 |
Mingming Cao <cmm@us.ibm.com> |
ext4: Fix ext4_mb_init_cache return error ext4_mb_init_cache() incorrectly always return EIO on success. This causes the caller of ext4_mb_init_cache() fail when it checks the return value. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
91d99827 |
|
11-Jul-2008 |
Alexey Dobriyan <adobriyan@gmail.com> |
ext4: switch to seq_files Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
ed8f9c75 |
|
11-Jul-2008 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: start searching for the right extent from the goal group. With mballoc we search for the best extent using different criteria. We should always use the goal group when we are starting with a new criteria. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
e7dfb246 |
|
11-Jul-2008 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Fix mb_find_next_bit not to return larger than max Some architectures implement ext4_find_next_bit and ext4_find_next_zero_bit in such a way that they return greater than max for some input values. Make sure mb_find_next_bit and mb_find_next_zero_bit return the right values. On 2.6.25 we have include/asm-x86/bitops_32.h static inline unsigned find_first_bit(const unsigned long *addr, unsigned size) { unsigned x = 0; while (x < size) { unsigned long val = *addr++; if (val) return __ffs(val) + x; x += (sizeof(*addr)<<3); } return x; } This can return value greater than size. Reported and fixed here for lustre https://bugzilla.lustre.org/show_bug.cgi?id=15932 https://bugzilla.lustre.org/attachment.cgi?id=17205 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
03cddb80 |
|
05-Jun-2008 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Fix use of uninitialized data with debug enabled. Fix use of uninitialized data with debug enabled. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
519deca0 |
|
15-May-2008 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Retry block allocation if new blocks are allocated from system zone. If the block allocator gets blocks out of system zone ext4 calls ext4_error. But if the file system is mounted with errors=continue retry block allocation. We need to mark the system zone blocks as in use to make sure retry don't pick them again System zone is the block range mapping block bitmap, inode bitmap and inode table. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
1930479c |
|
13-May-2008 |
Valerie Clement <valerie.clement@bull.net> |
ext4: mballoc fix mb_normalize_request algorithm for 1KB block size filesystems In case of inode preallocation, the number of blocks to allocate depends on the file size and it is calculated in ext4_mb_normalize_request(). Each group in the filesystem is then checked to find one that can be used for allocation; this is done in ext4_mb_good_group(). When a file bigger than 4MB is created, the requested number of blocks to preallocate, calculated by ext4_mb_normalize_request is 4096. However for a filesystem with 1KB block size, the maximum size of the block buddies used by the multiblock allocator is 2048, so none of groups in the filesystem satisfies the search criteria in ext4_mb_good_group(). Scanning all the filesystem groups impacts performance. This was demonstrated by using a freshly created, 70GB, 1k block filesystem, with caches dropped write before the test via /proc/sys/vm/drop_caches, and with the filesystem mounted with nodelalloc and nodealloc,nomballoc. The time to write an 8 megabyte file using "dd if=/dev/zero of=/mnt/test/fo bs=8k count=1k conv=fsync" took 35.5091 seconds (236kB/s) with nodellaloc, and 0.233754 seconds (35.9 MB/s) with the nodelloc,nomballoc options. With a 1TB partition, it took several minutes to write 8MB! This patch modifies the algorithm in ext4_mb_normalize_group_request to calculate the number of blocks to allocate by taking into account the maximum size of free blocks chunks handled by the multiblock allocator. It has also been tested for filesystems with 2KB and 4KB block sizes to ensure that those cases don't regress. Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Valerie Clement <valerie.clement@bull.net> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
f36f21ec |
|
12-May-2008 |
Jean Delvare <khali@linux-fr.org> |
Fix misuses of bdevname() bdevname() fills the buffer that it is given as a parameter, so calling strcpy() or snprintf() on the returned value is redundant (and probably not guaranteed to work - I don't think strcpy and snprintf support overlapping buffers.) Signed-off-by: Jean Delvare <khali@linux-fr.org> Cc: Stephen Tweedie <sct@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
f1fa3342 |
|
29-Apr-2008 |
Roel Kluin <12o3l@tiscali.nl> |
ext4: fix hot spins in mballoc after err_freebuddy and err_freemeta In ext4_mb_init_backend() 'i' is of type ext4_group_t. Since unsigned, i >= 0 is always true, so fix hot spins after err_freebuddy: and -meta: and prevent decrements when zero. Signed-off-by: Roel Kluin <12o3l@tiscali.nl> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
8f6e39a7 |
|
29-Apr-2008 |
Mingming Cao <cmm@us.ibm.com> |
ext4: Move mballoc headers/structures to a seperate header file mballoc.h Move function and structure definiations out of mballoc.c and put it under a new header file mballoc.h Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
60bd63d1 |
|
29-Apr-2008 |
Solofo Ramangalahy <Solofo.Ramangalahy@bull.net> |
ext4: cleanup for compiling mballoc with verification and debugging #defines This patch allows compiling mballoc with: #define AGGRESSIVE_CHECK #define DOUBLE_CHECK #define MB_DEBUG It fixes: Compilation errors: fs/ext4/mballoc.c: In function '__mb_check_buddy': fs/ext4/mballoc.c:605: error: 'struct ext4_prealloc_space' has no member named 'group_list' fs/ext4/mballoc.c:606: error: 'struct ext4_prealloc_space' has no member named 'pstart' fs/ext4/mballoc.c:608: error: 'struct ext4_prealloc_space' has no member named 'len' Compilation warnings: fs/ext4/mballoc.c: In function 'ext4_mb_normalize_group_request': fs/ext4/mballoc.c:2863: warning: format '%lu' expects type 'long unsigned int', but argument 3 has type 'int' fs/ext4/mballoc.c: In function 'ext4_mb_use_inode_pa': fs/ext4/mballoc.c:3103: warning: format '%lu' expects type 'long unsigned int', but argument 3 has type 'int' Sparse check: fs/ext4/mballoc.c:3818:2: warning: context imbalance in 'ext4_mb_show_ac' - different lock contexts for basic block Signed-off-by: Solofo Ramangalahy <Solofo.Ramangalahy@bull.net> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
c83617db |
|
29-Apr-2008 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Don't do GFP_NOFS allocations after taking ext4_lock_group We can't do GFP_NOFS allocation after taking ext4_lock_group BUG: sleeping function called from invalid context at mm/slab.c:3054 in_atomic():1, irqs_disabled():0 1 lock held by vi/2426: #0: (&ei->i_data_sem){----}, at: [<c01cf665>] ext4_release_file+0x23/0x66 Pid: 2426, comm: vi Not tainted 2.6.25-rc7 #24 [<c011a3dc>] __might_sleep+0xbe/0xc5 [<c01620c9>] kmem_cache_alloc+0x22/0xa6 [<c01e382a>] ext4_mb_release_inode_pa+0x73/0x1b3 [<c01e6adf>] ext4_mb_discard_inode_preallocations+0x22d/0x2d4 [<c013000a>] ? param_set_ushort+0x32/0x39 [<c01ceba1>] ext4_discard_reservation+0x27/0x6a [<c01cf66c>] ext4_release_file+0x2a/0x66 [<c0165bd6>] __fput+0xae/0x155 [<c0165e46>] fput+0x17/0x19 [<c0163756>] filp_close+0x50/0x5a [<c01647c0>] sys_close+0x71/0xad [<c0104aba>] sysenter_past_esp+0x5f/0xa5 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
3dcf5451 |
|
29-Apr-2008 |
Christoph Hellwig <hch@lst.de> |
ext4: move headers out of include/linux Move ext4 headers out of include/linux. This is just the trivial move, there's some more thing that could be done later. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
46fe74f2 |
|
29-Apr-2008 |
Denis V. Lunev <den@openvz.org> |
ext4: use non-racy method for proc entries creation Use proc_create()/proc_create_data() to make sure that ->proc_fops and ->data be setup before gluing PDE to main tree. Signed-off-by: Denis V. Lunev <den@openvz.org> Cc: <linux-ext4@vger.kernel.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
36a5aeb8 |
|
29-Apr-2008 |
Alexey Dobriyan <adobriyan@gmail.com> |
proc: remove proc_root_fs Use creation by full path instead: "fs/foo". Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
46e665e9 |
|
17-Apr-2008 |
Harvey Harrison <harvey.harrison@gmail.com> |
ext4: replace remaining __FUNCTION__ occurrences __FUNCTION__ is gcc-specific, use __func__ Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
14499f35 |
|
17-Apr-2008 |
Mingming Cao <cmm@us.ibm.com> |
ext4: remove extra define of ext4_new_blocks_old from mballoc.c The function prototype of ext4_new_blocks_old() is defined in ext4_fs.h, so we don't need the extra function prototype in mballoc.c Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
e8546d06 |
|
17-Apr-2008 |
Marcin Slusarz <marcin.slusarz@gmail.com> |
ext4: le*_add_cpu conversion replace all: little_endian_variable = cpu_to_leX(leX_to_cpu(little_endian_variable) + expression_in_cpu_byteorder); with: leX_add_cpu(&little_endian_variable, expression_in_cpu_byteorder); generated with semantic patch Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-ext4@vger.kernel.org Cc: sct@redhat.com Cc: Andrew Morton <akpm@linux-foundation.org> Cc: adilger@clusterfs.com Cc: Mingming Cao <cmm@us.ibm.com>
|
#
9a0762c5 |
|
17-Apr-2008 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Convert list_for_each_rcu() to list_for_each_entry_rcu() The list_for_each_entry_rcu() primitive should be used instead of list_for_each_rcu(), as the former is easier to use and provides better type safety. http://groups.google.com/group/linux.kernel/browse_thread/thread/45749c83451cebeb/0633a65759ce7713?lnk=raot Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Roel Kluin <12o3l@tiscali.nl> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
4ddfef7b |
|
29-Apr-2008 |
Eric Sandeen <sandeen@redhat.com> |
ext4: reduce mballoc stack usage with noinline_for_stack mballoc.c is a whole lot of static functions, which gcc seems to really like to inline. With the changes below, on x86, I can at least get from: 432 ext4_mb_new_blocks 240 ext4_mb_free_blocks 208 ext4_mb_discard_group_preallocations 188 ext4_mb_seq_groups_show 164 ext4_mb_init_cache 152 ext4_mb_release_inode_pa 136 ext4_mb_seq_history_show ... to 220 ext4_mb_free_blocks 188 ext4_mb_seq_groups_show 176 ext4_mb_regular_allocator 164 ext4_mb_init_cache 156 ext4_mb_new_blocks 152 ext4_mb_release_inode_pa 136 ext4_mb_seq_history_show 124 ext4_mb_release_group_pa ... which still has some big functions in there, but not 432 bytes! Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
ffad0a44 |
|
22-Feb-2008 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: ext4_find_next_zero_bit needs an aligned address on some arch ext4_find_next_zero_bit and ext4_find_next_bit needs a long aligned address on x8_64. Add mb_find_next_zero_bit and mb_find_next_bit and use them in the mballoc. Fix: https://bugzilla.redhat.com/show_bug.cgi?id=433286 Eric Sandeen debugged the problem and suggested the fix. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
e56eb659 |
|
15-Feb-2008 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Don't claim block from group which has corrupt bitmap In ext4_mb_complex_scan_group, if the extent length of the newly found extentet is greater than than the total free blocks counted in group info, break without claiming the block. Document different ext4_error usage, explaining the state with which we continue if we mount with errors=continue Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
b73fce69 |
|
15-Feb-2008 |
Valerie Clement <valerie.clement@bull.net> |
ext4: Fix kernel BUG at fs/ext4/mballoc.c:910! With the flex_bg feature enabled, a large file creation oopses the kernel. The BUG_ON is: BUG_ON(len >= EXT4_BLOCKS_PER_GROUP(sb)); As the allocation of the bitmaps and the inode table can be done outside the block group with flex_bg, this allows to allocate up to EXT4_BLOCKS_PER_GROUP blocks in a group. This patch fixes the oops. Signed-off-by: Valerie Clement <valerie.clement@bull.net> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
26346ff6 |
|
09-Feb-2008 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Don't panic in case of corrupt bitmap Multiblock allocator calls BUG_ON in many case if the free and used blocks count obtained looking at the bitmap is different from what the allocator internally accounted for. Use ext4_error in such case and don't panic the system. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
256bdb49 |
|
09-Feb-2008 |
Eric Sandeen <sandeen@redhat.com> |
ext4: allocate struct ext4_allocation_context from a kmem cache struct ext4_allocation_context is rather large, and this bloats the stack of many functions which use it. Allocating it from a named slab cache will alleviate this. For example, with this change (on top of the noinline patch sent earlier): -ext4_mb_new_blocks 200 +ext4_mb_new_blocks 40 -ext4_mb_free_blocks 344 +ext4_mb_free_blocks 168 -ext4_mb_release_inode_pa 216 +ext4_mb_release_inode_pa 40 -ext4_mb_release_group_pa 192 +ext4_mb_release_group_pa 24 Most of these stack-allocated structs are actually used only for mballoc history; and in those cases often a smaller struct would do. So changing that may be another way around it, at least for those functions, if preferred. For now, in those cases where the ac is only for history, an allocation failure simply skips the history recording, and does not cause any other failures. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
#
42a10add |
|
09-Feb-2008 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
ext4: Fix null bh pointer dereference in mballoc Repoted by Adrian Bunk <bunk@kernel.org>: The Coverity checker spotted the following NULL dereference: static int ext4_mb_mark_diskspace_used { ... if (!bitmap_bh) goto out_err; ... out_err: sb->s_dirt = 1; put_bh(bitmap_bh); ... Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com>
|
#
c9de560d |
|
28-Jan-2008 |
Alex Tomas <alex@clusterfs.com> |
ext4: Add multi block allocator for ext4 Signed-off-by: Alex Tomas <alex@clusterfs.com> Signed-off-by: Andreas Dilger <adilger@clusterfs.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|