Searched +hist:8174202 +hist:b (Results 1 - 25 of 26) sorted by path
/linux-master/fs/9p/ | ||
H A D | vfs_file.c | diff 24e42e32 Wed Nov 18 02:06:42 MST 2020 David Howells <dhowells@redhat.com> 9p: Use fscache indexing rewrite and reenable caching Change the 9p filesystem to take account of the changes to fscache's indexing rewrite and reenable caching in 9p. The following changes have been made: (1) The fscache_netfs struct is no more, and there's no need to register the filesystem as a whole. (2) The session cookie is now an fscache_volume cookie, allocated with fscache_acquire_volume(). That takes three parameters: a string representing the "volume" in the index, a string naming the cache to use (or NULL) and a u64 that conveys coherency metadata for the volume. For 9p, I've made it render the volume name string as: "9p,<devname>,<cachetag>" where the cachetag is replaced by the aname if it wasn't supplied. This probably needs rethinking a bit as the aname can have slashes in it. It might be better to hash the cachetag and use the hash or I could substitute commas for the slashes or something. (3) The fscache_cookie_def is no more and needed information is passed directly to fscache_acquire_cookie(). The cache no longer calls back into the filesystem, but rather metadata changes are indicated at other times. fscache_acquire_cookie() is passed the same keying and coherency information as before. (4) The functions to set/reset/flush cookies are removed and fscache_use_cookie() and fscache_unuse_cookie() are used instead. fscache_use_cookie() is passed a flag to indicate if the cookie is opened for writing. fscache_unuse_cookie() is passed updates for the metadata if we changed it (ie. if the file was opened for writing). These are called when the file is opened or closed. (5) wait_on_page_bit[_killable]() is replaced with the specific wait functions for the bits waited upon. (6) I've got rid of some of the 9p-specific cache helper functions and called things like fscache_relinquish_cookie() directly as they'll optimise away if v9fs_inode_cookie() returns an unconditional NULL (which will be the case if CONFIG_9P_FSCACHE=n). (7) v9fs_vfs_setattr() is made to call fscache_resize() to change the size of the cache object. Notes: (A) We should call fscache_invalidate() if we detect that the server's copy of a file got changed by a third party, but I don't know where to do that. We don't need to do that when allocating the cookie as we get a check-and-invalidate when we initially bind to the cache object. (B) The copy-to-cache-on-writeback side of things will be handled in separate patch. Changes ======= ver #3: - Canonicalise the cookie key and coherency data to make them endianness-independent. ver #2: - Use gfpflags_allow_blocking() rather than using flag directly. - fscache_acquire_volume() now returns errors. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dominique Martinet <asmadeus@codewreck.org> cc: Eric Van Hensbergen <ericvh@gmail.com> cc: Latchesar Ionkov <lucho@ionkov.net> cc: v9fs-developer@lists.sourceforge.net cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/163819664645.215744.1555314582005286846.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906975017.143852.3459573173204394039.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967178512.1823006.17377493641569138183.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021573143.640689.3977487095697717967.stgit@warthog.procyon.org.uk/ # v4 diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 0b173bc4 Mon Oct 08 17:28:46 MDT 2012 Konstantin Khlebnikov <khlebnikov@openvz.org> mm: kill vma flag VM_CAN_NONLINEAR Move actual pte filling for non-linear file mappings into the new special vma operation: ->remap_pages(). Filesystems must implement this method to get non-linear mapping support, if it uses filemap_fault() then generic_file_remap_pages() can be used. Now device drivers can implement this method and obtain nonlinear vma support. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> #arch/tile Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 733482e4 Tue Nov 08 22:34:55 MST 2005 Olaf Hering <olh@suse.de> [PATCH] changing CONFIG_LOCALVERSION rebuilds too much, for no good reason This patch removes almost all inclusions of linux/version.h. The 3 #defines are unused in most of the touched files. A few drivers use the simple KERNEL_VERSION(a,b,c) macro, which is unfortunatly in linux/version.h. There are also lots of #ifdef for long obsolete kernels, this was not touched. In a few places, the linux/version.h include was move to where the LINUX_VERSION_CODE was used. quilt vi `find * -type f -name "*.[ch]"|xargs grep -El '(UTS_RELEASE|LINUX_VERSION_CODE|KERNEL_VERSION|linux/version.h)'|grep -Ev '(/(boot|coda|drm)/|~$)'` search pattern: /UTS_RELEASE\|LINUX_VERSION_CODE\|KERNEL_VERSION\|linux\/\(utsname\|version\).h Signed-off-by: Olaf Hering <olh@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> |
/linux-master/fs/adfs/ | ||
H A D | file.c | diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
/linux-master/fs/affs/ | ||
H A D | file.c | diff 7633978b Fri Dec 12 17:57:47 MST 2014 Fabian Frederick <fabf@skynet.be> fs/affs/file.c: forward declaration clean-up -Move file_operations to avoid forward declarations. -Remove unused declarations. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
/linux-master/fs/bfs/ | ||
H A D | file.c | diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
/linux-master/fs/ecryptfs/ | ||
H A D | file.c | diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
/linux-master/fs/ext2/ | ||
H A D | file.c | diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
/linux-master/fs/f2fs/ | ||
H A D | file.c | diff c7115e09 Fri Jan 12 12:41:32 MST 2024 Chao Yu <chao@kernel.org> f2fs: introduce FAULT_BLKADDR_CONSISTENCE We will encounter below inconsistent status when FAULT_BLKADDR type fault injection is on. Info: checkpoint state = d6 : nat_bits crc fsck compacted_summary orphan_inodes sudden-power-off [ASSERT] (fsck_chk_inode_blk:1254) --> ino: 0x1c100 has i_blocks: 000000c0, but has 191 blocks [FIX] (fsck_chk_inode_blk:1260) --> [0x1c100] i_blocks=0x000000c0 -> 0xbf [FIX] (fsck_chk_inode_blk:1269) --> [0x1c100] i_compr_blocks=0x00000026 -> 0x27 [ASSERT] (fsck_chk_inode_blk:1254) --> ino: 0x1cadb has i_blocks: 0000002f, but has 46 blocks [FIX] (fsck_chk_inode_blk:1260) --> [0x1cadb] i_blocks=0x0000002f -> 0x2e [FIX] (fsck_chk_inode_blk:1269) --> [0x1cadb] i_compr_blocks=0x00000011 -> 0x12 [ASSERT] (fsck_chk_inode_blk:1254) --> ino: 0x1c62c has i_blocks: 00000002, but has 1 blocks [FIX] (fsck_chk_inode_blk:1260) --> [0x1c62c] i_blocks=0x00000002 -> 0x1 After we inject fault into f2fs_is_valid_blkaddr() during truncation, a) it missed to increase @nr_free or @valid_blocks b) it can cause in blkaddr leak in truncated dnode Which may cause inconsistent status. This patch separates FAULT_BLKADDR_CONSISTENCE from FAULT_BLKADDR, and rename FAULT_BLKADDR to FAULT_BLKADDR_VALIDITY so that we can: a) use FAULT_BLKADDR_CONSISTENCE in f2fs_truncate_data_blocks_range() to simulate inconsistent issue independently, then it can verify fsck repair flow. b) FAULT_BLKADDR_VALIDITY fault will not cause any inconsistent status, we can just use it to check error path handling in kernel side. Reviewed-by: Daeho Jeong <daehojeong@google.com> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> diff c7115e09 Fri Jan 12 12:41:32 MST 2024 Chao Yu <chao@kernel.org> f2fs: introduce FAULT_BLKADDR_CONSISTENCE We will encounter below inconsistent status when FAULT_BLKADDR type fault injection is on. Info: checkpoint state = d6 : nat_bits crc fsck compacted_summary orphan_inodes sudden-power-off [ASSERT] (fsck_chk_inode_blk:1254) --> ino: 0x1c100 has i_blocks: 000000c0, but has 191 blocks [FIX] (fsck_chk_inode_blk:1260) --> [0x1c100] i_blocks=0x000000c0 -> 0xbf [FIX] (fsck_chk_inode_blk:1269) --> [0x1c100] i_compr_blocks=0x00000026 -> 0x27 [ASSERT] (fsck_chk_inode_blk:1254) --> ino: 0x1cadb has i_blocks: 0000002f, but has 46 blocks [FIX] (fsck_chk_inode_blk:1260) --> [0x1cadb] i_blocks=0x0000002f -> 0x2e [FIX] (fsck_chk_inode_blk:1269) --> [0x1cadb] i_compr_blocks=0x00000011 -> 0x12 [ASSERT] (fsck_chk_inode_blk:1254) --> ino: 0x1c62c has i_blocks: 00000002, but has 1 blocks [FIX] (fsck_chk_inode_blk:1260) --> [0x1c62c] i_blocks=0x00000002 -> 0x1 After we inject fault into f2fs_is_valid_blkaddr() during truncation, a) it missed to increase @nr_free or @valid_blocks b) it can cause in blkaddr leak in truncated dnode Which may cause inconsistent status. This patch separates FAULT_BLKADDR_CONSISTENCE from FAULT_BLKADDR, and rename FAULT_BLKADDR to FAULT_BLKADDR_VALIDITY so that we can: a) use FAULT_BLKADDR_CONSISTENCE in f2fs_truncate_data_blocks_range() to simulate inconsistent issue independently, then it can verify fsck repair flow. b) FAULT_BLKADDR_VALIDITY fault will not cause any inconsistent status, we can just use it to check error path handling in kernel side. Reviewed-by: Daeho Jeong <daehojeong@google.com> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> diff b5ab3276 Wed Jul 05 20:06:14 MDT 2023 Chao Yu <chao@kernel.org> f2fs: fix to avoid mmap vs set_compress_option case Compression option in inode should not be changed after they have been used, however, it may happen in below race case: Thread A Thread B - f2fs_ioc_set_compress_option - check f2fs_is_mmap_file() - check get_dirty_pages() - check F2FS_HAS_BLOCKS() - f2fs_file_mmap - set_inode_flag(FI_MMAP_FILE) - fault - do_page_mkwrite - f2fs_vm_page_mkwrite - f2fs_get_block_locked - fault_dirty_shared_page - set_page_dirty - update i_compress_algorithm - update i_log_cluster_size - update i_cluster_size Avoid such race condition by covering f2fs_file_mmap() w/ i_sem lock, meanwhile add mmap file check condition in f2fs_may_compress() as well. Fixes: e1e8debec656 ("f2fs: add F2FS_IOC_SET_COMPRESS_OPTION ioctl") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> diff d8189834 Tue May 23 00:17:25 MDT 2023 Chao Yu <chao@kernel.org> f2fs: fix to avoid NULL pointer dereference f2fs_write_end_io() butt3rflyh4ck reports a bug as below: When a thread always calls F2FS_IOC_RESIZE_FS to resize fs, if resize fs is failed, f2fs kernel thread would invoke callback function to update f2fs io info, it would call f2fs_write_end_io and may trigger null-ptr-deref in NODE_MAPPING. general protection fault, probably for non-canonical address KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037] RIP: 0010:NODE_MAPPING fs/f2fs/f2fs.h:1972 [inline] RIP: 0010:f2fs_write_end_io+0x727/0x1050 fs/f2fs/data.c:370 <TASK> bio_endio+0x5af/0x6c0 block/bio.c:1608 req_bio_endio block/blk-mq.c:761 [inline] blk_update_request+0x5cc/0x1690 block/blk-mq.c:906 blk_mq_end_request+0x59/0x4c0 block/blk-mq.c:1023 lo_complete_rq+0x1c6/0x280 drivers/block/loop.c:370 blk_complete_reqs+0xad/0xe0 block/blk-mq.c:1101 __do_softirq+0x1d4/0x8ef kernel/softirq.c:571 run_ksoftirqd kernel/softirq.c:939 [inline] run_ksoftirqd+0x31/0x60 kernel/softirq.c:931 smpboot_thread_fn+0x659/0x9e0 kernel/smpboot.c:164 kthread+0x33e/0x440 kernel/kthread.c:379 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 The root cause is below race case can cause leaving dirty metadata in f2fs after filesystem is remount as ro: Thread A Thread B - f2fs_ioc_resize_fs - f2fs_readonly --- return false - f2fs_resize_fs - f2fs_remount - write_checkpoint - set f2fs as ro - free_segment_range - update meta_inode's data Then, if f2fs_put_super() fails to write_checkpoint due to readonly status, and meta_inode's dirty data will be writebacked after node_inode is put, finally, f2fs_write_end_io will access NULL pointer on sbi->node_inode. Thread A IRQ context - f2fs_put_super - write_checkpoint fails - iput(node_inode) - node_inode = NULL - iput(meta_inode) - write_inode_now - f2fs_write_meta_page - f2fs_write_end_io - NODE_MAPPING(sbi) : access NULL pointer on node_inode Fixes: b4b10061ef98 ("f2fs: refactor resize_fs to avoid meta updates in progress") Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com> Closes: https://lore.kernel.org/r/1684480657-2375-1-git-send-email-yangtiezhu@loongson.cn Tested-by: butt3rflyh4ck <butterflyhuangxx@gmail.com> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> diff 92318f20 Mon Feb 20 05:20:04 MST 2023 Hans Holmberg <hans.holmberg@wdc.com> f2fs: preserve direct write semantics when buffering is forced In some cases, e.g. for zoned block devices, direct writes are forced into buffered writes that will populate the page cache and be written out just like buffered io. Direct reads, on the other hand, is supported for the zoned block device case. This has the effect that applications built for direct io will fill up the page cache with data that will never be read, and that is a waste of resources. If we agree that this is a problem, how do we fix it? A) Supporting proper direct writes for zoned block devices would be the best, but it is currently not supported (probably for a good but non-obvious reason). Would it be feasible to implement proper direct IO? B) Avoid the cost of keeping unwanted data by syncing and throwing out the cached pages for buffered O_DIRECT writes before completion. This patch implements B) by reusing the code for how partial block writes are flushed out on the "normal" direct write path. Note that this changes the performance characteristics of f2fs quite a bit. Direct IO performance for zoned block devices is lower for small writes after this patch, but this should be expected with direct IO and in line with how f2fs behaves on top of conventional block devices. Another open question is if the flushing should be done for all cases where buffered writes are forced. Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Yonggil Song <yonggil.song@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> diff 92318f20 Mon Feb 20 05:20:04 MST 2023 Hans Holmberg <hans.holmberg@wdc.com> f2fs: preserve direct write semantics when buffering is forced In some cases, e.g. for zoned block devices, direct writes are forced into buffered writes that will populate the page cache and be written out just like buffered io. Direct reads, on the other hand, is supported for the zoned block device case. This has the effect that applications built for direct io will fill up the page cache with data that will never be read, and that is a waste of resources. If we agree that this is a problem, how do we fix it? A) Supporting proper direct writes for zoned block devices would be the best, but it is currently not supported (probably for a good but non-obvious reason). Would it be feasible to implement proper direct IO? B) Avoid the cost of keeping unwanted data by syncing and throwing out the cached pages for buffered O_DIRECT writes before completion. This patch implements B) by reusing the code for how partial block writes are flushed out on the "normal" direct write path. Note that this changes the performance characteristics of f2fs quite a bit. Direct IO performance for zoned block devices is lower for small writes after this patch, but this should be expected with direct IO and in line with how f2fs behaves on top of conventional block devices. Another open question is if the flushing should be done for all cases where buffered writes are forced. Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Yonggil Song <yonggil.song@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> diff 44abff2c Thu Apr 14 22:52:57 MDT 2022 Christoph Hellwig <hch@lst.de> block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARD Secure erase is a very different operation from discard in that it is a data integrity operation vs hint. Fully split the limits and helper infrastructure to make the separation more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> [nifs2] Acked-by: Jaegeuk Kim <jaegeuk@kernel.org> [f2fs] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Chao Yu <chao@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-27-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> diff 7b47ef52 Thu Apr 14 22:52:56 MDT 2022 Christoph Hellwig <hch@lst.de> block: add a bdev_discard_granularity helper Abstract away implementation details from file systems by providing a block_device based helper to retrieve the discard granularity. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Link: https://lore.kernel.org/r/20220415045258.199825-26-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> diff 984fc4e7 Thu Feb 03 22:24:56 MST 2022 Chao Yu <chao@kernel.org> f2fs: support idmapped mounts This patch enables idmapped mounts for f2fs, since all dedicated helpers for this functionality existsm, so, in this patch we just pass down the user_namespace argument from the VFS methods to the relevant helpers. Simple idmap example on f2fs image: 1. truncate -s 128M f2fs.img 2. mkfs.f2fs f2fs.img 3. mount f2fs.img /mnt/f2fs/ 4. touch /mnt/f2fs/file 5. ls -ln /mnt/f2fs/ total 0 -rw-r--r-- 1 0 0 0 2月 4 13:17 file 6. ./mount-idmapped --map-mount b:0:1001:1 /mnt/f2fs/ /mnt/scratch_f2fs/ 7. ls -ln /mnt/scratch_f2fs/ total 0 -rw-r--r-- 1 1001 1001 0 2月 4 13:17 file Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> diff 1018a546 Fri Feb 04 00:19:46 MST 2022 Chao Yu <chao@kernel.org> f2fs: introduce F2FS_IPU_HONOR_OPU_WRITE ipu policy Once F2FS_IPU_FORCE policy is enabled in some cases: a) f2fs forces to use F2FS_IPU_FORCE in a small-sized volume b) user sets F2FS_IPU_FORCE policy via sysfs Then we may fail to defragment file due to IPU policy check, it doesn't make sense, let's introduce a new IPU policy to allow OPU during file defragmentation. In small-sized volume, let's enable F2FS_IPU_HONOR_OPU_WRITE policy by default. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> |
/linux-master/fs/fat/ | ||
H A D | file.c | diff 7b47ef52 Thu Apr 14 22:52:56 MDT 2022 Christoph Hellwig <hch@lst.de> block: add a bdev_discard_granularity helper Abstract away implementation details from file systems by providing a block_device based helper to retrieve the discard granularity. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Link: https://lore.kernel.org/r/20220415045258.199825-26-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> diff 70200574 Thu Apr 14 22:52:55 MDT 2022 Christoph Hellwig <hch@lst.de> block: remove QUEUE_FLAG_DISCARD Just use a non-zero max_discard_sectors as an indicator for discard support, similar to what is done for write zeroes. The only places where needs special attention is the RAID5 driver, which must clear discard support for security reasons by default, even if the default stacking rules would allow for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Jan Höppner <hoeppner@linux.ibm.com> [s390] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-25-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> diff 4b789936 Thu Jan 21 06:19:56 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> fat: handle idmapped mounts Let fat handle idmapped mounts. This allows to have the same fat mount appear in multiple locations with different id mappings. This allows to expose a vfat formatted USB stick to multiple user with different ids on the host or in user namespaces allowing for dac permissions: mount -o uid=1000,gid=1000 /dev/sdb /mnt u1001@f2-vm:/lower1$ ls -ln /mnt/ total 4 -rwxr-xr-x 1 1000 1000 4 Oct 28 03:44 aaa -rwxr-xr-x 1 1000 1000 0 Oct 28 01:09 bbb -rwxr-xr-x 1 1000 1000 0 Oct 28 01:10 ccc -rwxr-xr-x 1 1000 1000 0 Oct 28 03:46 ddd -rwxr-xr-x 1 1000 1000 0 Oct 28 04:01 eee mount-idmapped --map-mount b:1000:1001:1 u1001@f2-vm:/lower1$ ls -ln /lower1/ total 4 -rwxr-xr-x 1 1001 1001 4 Oct 28 03:44 aaa -rwxr-xr-x 1 1001 1001 0 Oct 28 01:09 bbb -rwxr-xr-x 1 1001 1001 0 Oct 28 01:10 ccc -rwxr-xr-x 1 1001 1001 0 Oct 28 03:46 ddd -rwxr-xr-x 1 1001 1001 0 Oct 28 04:01 eee u1001@f2-vm:/lower1$ touch /lower1/fff u1001@f2-vm:/lower1$ ls -ln /lower1/fff -rwxr-xr-x 1 1001 1001 0 Oct 28 04:03 /lower1/fff u1001@f2-vm:/lower1$ ls -ln /mnt/fff -rwxr-xr-x 1 1000 1000 0 Oct 28 04:03 /mnt/fff Link: https://lore.kernel.org/r/20210121131959.646623-38-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> diff b13bb33e Wed Jan 20 15:59:41 MST 2016 Namjae Jeon <namjae.jeon@samsung.com> fat: add fat_fallocate operation Implement preallocation via the fallocate syscall on VFAT partitions. This patch is based on an earlier patch of the same name which had some issues detailed below and did not get accepted. Refer https://lkml.org/lkml/2007/12/22/130. a) The preallocated space was not persistent when the FALLOC_FL_KEEP_SIZE flag was set. It will deallocate cluster at evict time. b) There was no need to zero out the clusters when the flag was set Instead of doing an expanding truncate, just allocate clusters and add them to the fat chain. This reduces preallocation time. Compatibility with windows: There are no issues when FALLOC_FL_KEEP_SIZE is not set because it just does an expanding truncate. Thus reading from the preallocated area on windows returns null until data is written to it. When a file with preallocated area using the FALLOC_FL_KEEP_SIZE was written to on windows, the windows driver freed-up the preallocated clusters and allocated new clusters for the new data. The freed up clusters gets reflected in the free space available for the partition which can be seen from the Volume properties. The windows chkdsk tool also does not report any errors on a disk containing files with preallocated space. And there is also no issue using linux fat fsck. because discard preallocated clusters at repair time. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
/linux-master/fs/hfs/ | ||
H A D | inode.c | diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff e8edc6e0 Sun May 20 15:22:52 MDT 2007 Alexey Dobriyan <adobriyan@gmail.com> Detach sched.h from mm.h First thing mm.h does is including sched.h solely for can_do_mlock() inline function which has "current" dereference inside. By dealing with can_do_mlock() mm.h can be detached from sched.h which is good. See below, why. This patch a) removes unconditional inclusion of sched.h from mm.h b) makes can_do_mlock() normal function in mm/mlock.c c) exports can_do_mlock() to not break compilation d) adds sched.h inclusions back to files that were getting it indirectly. e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were getting them indirectly Net result is: a) mm.h users would get less code to open, read, preprocess, parse, ... if they don't need sched.h b) sched.h stops being dependency for significant number of files: on x86_64 allmodconfig touching sched.h results in recompile of 4083 files, after patch it's only 3744 (-8.3%). Cross-compile tested on all arm defconfigs, all mips defconfigs, all powerpc defconfigs, alpha alpha-up arm i386 i386-up i386-defconfig i386-allnoconfig ia64 ia64-up m68k mips parisc parisc-up powerpc powerpc-up s390 s390-up sparc sparc-up sparc64 sparc64-up um-x86_64 x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig as well as my two usual configs. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff e8edc6e0 Sun May 20 15:22:52 MDT 2007 Alexey Dobriyan <adobriyan@gmail.com> Detach sched.h from mm.h First thing mm.h does is including sched.h solely for can_do_mlock() inline function which has "current" dereference inside. By dealing with can_do_mlock() mm.h can be detached from sched.h which is good. See below, why. This patch a) removes unconditional inclusion of sched.h from mm.h b) makes can_do_mlock() normal function in mm/mlock.c c) exports can_do_mlock() to not break compilation d) adds sched.h inclusions back to files that were getting it indirectly. e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were getting them indirectly Net result is: a) mm.h users would get less code to open, read, preprocess, parse, ... if they don't need sched.h b) sched.h stops being dependency for significant number of files: on x86_64 allmodconfig touching sched.h results in recompile of 4083 files, after patch it's only 3744 (-8.3%). Cross-compile tested on all arm defconfigs, all mips defconfigs, all powerpc defconfigs, alpha alpha-up arm i386 i386-up i386-defconfig i386-allnoconfig ia64 ia64-up m68k mips parisc parisc-up powerpc powerpc-up s390 s390-up sparc sparc-up sparc64 sparc64-up um-x86_64 x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig as well as my two usual configs. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 733482e4 Tue Nov 08 22:34:55 MST 2005 Olaf Hering <olh@suse.de> [PATCH] changing CONFIG_LOCALVERSION rebuilds too much, for no good reason This patch removes almost all inclusions of linux/version.h. The 3 #defines are unused in most of the touched files. A few drivers use the simple KERNEL_VERSION(a,b,c) macro, which is unfortunatly in linux/version.h. There are also lots of #ifdef for long obsolete kernels, this was not touched. In a few places, the linux/version.h include was move to where the LINUX_VERSION_CODE was used. quilt vi `find * -type f -name "*.[ch]"|xargs grep -El '(UTS_RELEASE|LINUX_VERSION_CODE|KERNEL_VERSION|linux/version.h)'|grep -Ev '(/(boot|coda|drm)/|~$)'` search pattern: /UTS_RELEASE\|LINUX_VERSION_CODE\|KERNEL_VERSION\|linux\/\(utsname\|version\).h Signed-off-by: Olaf Hering <olh@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> |
/linux-master/fs/hfsplus/ | ||
H A D | inode.c | diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff e8edc6e0 Sun May 20 15:22:52 MDT 2007 Alexey Dobriyan <adobriyan@gmail.com> Detach sched.h from mm.h First thing mm.h does is including sched.h solely for can_do_mlock() inline function which has "current" dereference inside. By dealing with can_do_mlock() mm.h can be detached from sched.h which is good. See below, why. This patch a) removes unconditional inclusion of sched.h from mm.h b) makes can_do_mlock() normal function in mm/mlock.c c) exports can_do_mlock() to not break compilation d) adds sched.h inclusions back to files that were getting it indirectly. e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were getting them indirectly Net result is: a) mm.h users would get less code to open, read, preprocess, parse, ... if they don't need sched.h b) sched.h stops being dependency for significant number of files: on x86_64 allmodconfig touching sched.h results in recompile of 4083 files, after patch it's only 3744 (-8.3%). Cross-compile tested on all arm defconfigs, all mips defconfigs, all powerpc defconfigs, alpha alpha-up arm i386 i386-up i386-defconfig i386-allnoconfig ia64 ia64-up m68k mips parisc parisc-up powerpc powerpc-up s390 s390-up sparc sparc-up sparc64 sparc64-up um-x86_64 x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig as well as my two usual configs. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff e8edc6e0 Sun May 20 15:22:52 MDT 2007 Alexey Dobriyan <adobriyan@gmail.com> Detach sched.h from mm.h First thing mm.h does is including sched.h solely for can_do_mlock() inline function which has "current" dereference inside. By dealing with can_do_mlock() mm.h can be detached from sched.h which is good. See below, why. This patch a) removes unconditional inclusion of sched.h from mm.h b) makes can_do_mlock() normal function in mm/mlock.c c) exports can_do_mlock() to not break compilation d) adds sched.h inclusions back to files that were getting it indirectly. e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were getting them indirectly Net result is: a) mm.h users would get less code to open, read, preprocess, parse, ... if they don't need sched.h b) sched.h stops being dependency for significant number of files: on x86_64 allmodconfig touching sched.h results in recompile of 4083 files, after patch it's only 3744 (-8.3%). Cross-compile tested on all arm defconfigs, all mips defconfigs, all powerpc defconfigs, alpha alpha-up arm i386 i386-up i386-defconfig i386-allnoconfig ia64 ia64-up m68k mips parisc parisc-up powerpc powerpc-up s390 s390-up sparc sparc-up sparc64 sparc64-up um-x86_64 x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig as well as my two usual configs. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 733482e4 Tue Nov 08 22:34:55 MST 2005 Olaf Hering <olh@suse.de> [PATCH] changing CONFIG_LOCALVERSION rebuilds too much, for no good reason This patch removes almost all inclusions of linux/version.h. The 3 #defines are unused in most of the touched files. A few drivers use the simple KERNEL_VERSION(a,b,c) macro, which is unfortunatly in linux/version.h. There are also lots of #ifdef for long obsolete kernels, this was not touched. In a few places, the linux/version.h include was move to where the LINUX_VERSION_CODE was used. quilt vi `find * -type f -name "*.[ch]"|xargs grep -El '(UTS_RELEASE|LINUX_VERSION_CODE|KERNEL_VERSION|linux/version.h)'|grep -Ev '(/(boot|coda|drm)/|~$)'` search pattern: /UTS_RELEASE\|LINUX_VERSION_CODE\|KERNEL_VERSION\|linux\/\(utsname\|version\).h Signed-off-by: Olaf Hering <olh@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> |
/linux-master/fs/hostfs/ | ||
H A D | hostfs_kern.c | diff 2773bf00 Tue Sep 27 03:03:58 MDT 2016 Miklos Szeredi <mszeredi@redhat.com> fs: rename "rename2" i_op to "rename" Generated patch: sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2` sed -i "s/\brename2\b/rename/g" `git grep -wl rename2` Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> diff 680baacb Sat May 02 11:32:22 MDT 2015 Al Viro <viro@zeniv.linux.org.uk> new ->follow_link() and ->put_link() calling conventions a) instead of storing the symlink body (via nd_set_link()) and returning an opaque pointer later passed to ->put_link(), ->follow_link() _stores_ that opaque pointer (into void * passed by address by caller) and returns the symlink body. Returning ERR_PTR() on error, NULL on jump (procfs magic symlinks) and pointer to symlink body for normal symlinks. Stored pointer is ignored in all cases except the last one. Storing NULL for opaque pointer (or not storing it at all) means no call of ->put_link(). b) the body used to be passed to ->put_link() implicitly (via nameidata). Now only the opaque pointer is. In the cases when we used the symlink body to free stuff, ->follow_link() now should store it as opaque pointer in addition to returning it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 733482e4 Tue Nov 08 22:34:55 MST 2005 Olaf Hering <olh@suse.de> [PATCH] changing CONFIG_LOCALVERSION rebuilds too much, for no good reason This patch removes almost all inclusions of linux/version.h. The 3 #defines are unused in most of the touched files. A few drivers use the simple KERNEL_VERSION(a,b,c) macro, which is unfortunatly in linux/version.h. There are also lots of #ifdef for long obsolete kernels, this was not touched. In a few places, the linux/version.h include was move to where the LINUX_VERSION_CODE was used. quilt vi `find * -type f -name "*.[ch]"|xargs grep -El '(UTS_RELEASE|LINUX_VERSION_CODE|KERNEL_VERSION|linux/version.h)'|grep -Ev '(/(boot|coda|drm)/|~$)'` search pattern: /UTS_RELEASE\|LINUX_VERSION_CODE\|KERNEL_VERSION\|linux\/\(utsname\|version\).h Signed-off-by: Olaf Hering <olh@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> |
/linux-master/fs/hpfs/ | ||
H A D | file.c | diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
/linux-master/fs/jffs2/ | ||
H A D | file.c | diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 5ffd3412 Wed Oct 17 14:59:30 MDT 2012 Thomas Betker <thomas.betker@freenet.de> jffs2: Fix lock acquisition order bug in jffs2_write_begin jffs2_write_begin() first acquires the page lock, then f->sem. This causes an AB-BA deadlock with jffs2_garbage_collect_live(), which first acquires f->sem, then the page lock: jffs2_garbage_collect_live mutex_lock(&f->sem) (A) jffs2_garbage_collect_dnode jffs2_gc_fetch_page read_cache_page_async do_read_cache_page lock_page(page) (B) jffs2_write_begin grab_cache_page_write_begin find_lock_page lock_page(page) (B) mutex_lock(&f->sem) (A) We fix this by restructuring jffs2_write_begin() to take f->sem before the page lock. However, we make sure that f->sem is not held when calling jffs2_reserve_space(), as this is not permitted by the locking rules. The deadlock above was observed multiple times on an SoC with a dual ARMv7 (Cortex-A9), running the long-term 3.4.11 kernel; it occurred when using scp to copy files from a host system to the ARM target system. The fix was heavily tested on the same target system. Cc: stable@vger.kernel.org Signed-off-by: Thomas Betker <thomas.betker@rohde-schwarz.com> Acked-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> diff 5ffd3412 Wed Oct 17 14:59:30 MDT 2012 Thomas Betker <thomas.betker@freenet.de> jffs2: Fix lock acquisition order bug in jffs2_write_begin jffs2_write_begin() first acquires the page lock, then f->sem. This causes an AB-BA deadlock with jffs2_garbage_collect_live(), which first acquires f->sem, then the page lock: jffs2_garbage_collect_live mutex_lock(&f->sem) (A) jffs2_garbage_collect_dnode jffs2_gc_fetch_page read_cache_page_async do_read_cache_page lock_page(page) (B) jffs2_write_begin grab_cache_page_write_begin find_lock_page lock_page(page) (B) mutex_lock(&f->sem) (A) We fix this by restructuring jffs2_write_begin() to take f->sem before the page lock. However, we make sure that f->sem is not held when calling jffs2_reserve_space(), as this is not permitted by the locking rules. The deadlock above was observed multiple times on an SoC with a dual ARMv7 (Cortex-A9), running the long-term 3.4.11 kernel; it occurred when using scp to copy files from a host system to the ARM target system. The fix was heavily tested on the same target system. Cc: stable@vger.kernel.org Signed-off-by: Thomas Betker <thomas.betker@rohde-schwarz.com> Acked-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> diff 0533400b Mon Jul 07 06:45:59 MDT 2008 Stoyan Gaydarov <stoyboyker@gmail.com> [JFFS2] Use .unlocked_ioctl This changes the .ioctl to the .unlocked_ioctl version. Signed-off-by: Stoyan Gaydarov <stoyboyker@gmail.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> |
/linux-master/fs/jfs/ | ||
H A D | file.c | diff 0ae45f63 Sun Feb 01 22:37:00 MST 2015 Theodore Ts'o <tytso@mit.edu> vfs: add support for a lazytime mount option Add a new mount option which enables a new "lazytime" mode. This mode causes atime, mtime, and ctime updates to only be made to the in-memory version of the inode. The on-disk times will only get updated when (a) if the inode needs to be updated for some non-time related change, (b) if userspace calls fsync(), syncfs() or sync(), or (c) just before an undeleted inode is evicted from memory. This is OK according to POSIX because there are no guarantees after a crash unless userspace explicitly requests via a fsync(2) call. For workloads which feature a large number of random write to a preallocated file, the lazytime mount option significantly reduces writes to the inode table. The repeated 4k writes to a single block will result in undesirable stress on flash devices and SMR disk drives. Even on conventional HDD's, the repeated writes to the inode table block will trigger Adjacent Track Interference (ATI) remediation latencies, which very negatively impact long tail latencies --- which is a very big deal for web serving tiers (for example). Google-Bug-Id: 18297052 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
/linux-master/fs/minix/ | ||
H A D | file.c | diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
/linux-master/fs/nilfs2/ | ||
H A D | file.c | diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 0b173bc4 Mon Oct 08 17:28:46 MDT 2012 Konstantin Khlebnikov <khlebnikov@openvz.org> mm: kill vma flag VM_CAN_NONLINEAR Move actual pte filling for non-linear file mappings into the new special vma operation: ->remap_pages(). Filesystems must implement this method to get non-linear mapping support, if it uses filemap_fault() then generic_file_remap_pages() can be used. Now device drivers can implement this method and obtain nonlinear vma support. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> #arch/tile Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff e3154e97 Tue Mar 08 19:05:08 MST 2011 Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> nilfs2: get rid of nilfs_sb_info structure This directly uses sb->s_fs_info to keep a nilfs filesystem object and fully removes the intermediate nilfs_sb_info structure. With this change, the hierarchy of on-memory structures of nilfs will be simplified as follows: Before: super_block -> nilfs_sb_info -> the_nilfs -> cptree --+-> nilfs_root (current file system) +-> nilfs_root (snapshot A) +-> nilfs_root (snapshot B) : -> nilfs_sc_info (log writer structure) After: super_block -> the_nilfs -> cptree --+-> nilfs_root (current file system) +-> nilfs_root (snapshot A) +-> nilfs_root (snapshot B) : -> nilfs_sc_info (log writer structure) The reason why we didn't design so from the beginning is because the initial shape also differed from the above. The early hierachy was composed of "per-mount-point" super_block -> nilfs_sb_info pairs and a shared nilfs object. On the kernel 2.6.37, it was changed to the current shape in order to unify super block instances into one per device, and this cleanup became applicable as the result. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> diff e3154e97 Tue Mar 08 19:05:08 MST 2011 Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> nilfs2: get rid of nilfs_sb_info structure This directly uses sb->s_fs_info to keep a nilfs filesystem object and fully removes the intermediate nilfs_sb_info structure. With this change, the hierarchy of on-memory structures of nilfs will be simplified as follows: Before: super_block -> nilfs_sb_info -> the_nilfs -> cptree --+-> nilfs_root (current file system) +-> nilfs_root (snapshot A) +-> nilfs_root (snapshot B) : -> nilfs_sc_info (log writer structure) After: super_block -> the_nilfs -> cptree --+-> nilfs_root (current file system) +-> nilfs_root (snapshot A) +-> nilfs_root (snapshot B) : -> nilfs_sc_info (log writer structure) The reason why we didn't design so from the beginning is because the initial shape also differed from the above. The early hierachy was composed of "per-mount-point" super_block -> nilfs_sb_info pairs and a shared nilfs object. On the kernel 2.6.37, it was changed to the current shape in order to unify super block instances into one per device, and this cleanup became applicable as the result. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> |
/linux-master/fs/omfs/ | ||
H A D | file.c | diff 3f649ab7 Wed Jun 03 14:09:38 MDT 2020 Kees Cook <keescook@chromium.org> treewide: Remove uninitialized_var() usage Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: Kees Cook <keescook@chromium.org> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
/linux-master/fs/ramfs/ | ||
H A D | file-mmu.c | diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
H A D | file-nommu.c | diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 7e660872 Fri Jan 15 18:01:39 MST 2010 David Howells <dhowells@redhat.com> nommu: fix shared mmap after truncate shrinkage problems Fix a problem in NOMMU mmap with ramfs whereby a shared mmap can happen over the end of a truncation. The problem is that ramfs_nommu_check_mappings() checks that the reduced file size against the VMA tree, but not the vm_region tree. The following sequence of events can cause the problem: fd = open("/tmp/x", O_RDWR|O_TRUNC|O_CREAT, 0600); ftruncate(fd, 32 * 1024); a = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); b = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); munmap(a, 32 * 1024); ftruncate(fd, 16 * 1024); c = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); Mapping 'a' creates a vm_region covering 32KB of the file. Mapping 'b' sees that the vm_region from 'a' is covering the region it wants and so shares it, pinning it in memory. Mapping 'a' then goes away and the file is truncated to the end of VMA 'b'. However, the region allocated by 'a' is still in effect, and has _not_ been reduced. Mapping 'c' is then created, and because there's a vm_region covering the desired region, get_unmapped_area() is _not_ called to repeat the check, and the mapping is granted, even though the pages from the latter half of the mapping have been discarded. However: d = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); Mapping 'd' should work, and should end up sharing the region allocated by 'a'. To deal with this, we shrink the vm_region struct during the truncation, lest do_mmap_pgoff() take it as licence to share the full region automatically without calling the get_unmapped_area() file op again. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 7e660872 Fri Jan 15 18:01:39 MST 2010 David Howells <dhowells@redhat.com> nommu: fix shared mmap after truncate shrinkage problems Fix a problem in NOMMU mmap with ramfs whereby a shared mmap can happen over the end of a truncation. The problem is that ramfs_nommu_check_mappings() checks that the reduced file size against the VMA tree, but not the vm_region tree. The following sequence of events can cause the problem: fd = open("/tmp/x", O_RDWR|O_TRUNC|O_CREAT, 0600); ftruncate(fd, 32 * 1024); a = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); b = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); munmap(a, 32 * 1024); ftruncate(fd, 16 * 1024); c = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); Mapping 'a' creates a vm_region covering 32KB of the file. Mapping 'b' sees that the vm_region from 'a' is covering the region it wants and so shares it, pinning it in memory. Mapping 'a' then goes away and the file is truncated to the end of VMA 'b'. However, the region allocated by 'a' is still in effect, and has _not_ been reduced. Mapping 'c' is then created, and because there's a vm_region covering the desired region, get_unmapped_area() is _not_ called to repeat the check, and the mapping is granted, even though the pages from the latter half of the mapping have been discarded. However: d = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); Mapping 'd' should work, and should end up sharing the region allocated by 'a'. To deal with this, we shrink the vm_region struct during the truncation, lest do_mmap_pgoff() take it as licence to share the full region automatically without calling the get_unmapped_area() file op again. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 7e660872 Fri Jan 15 18:01:39 MST 2010 David Howells <dhowells@redhat.com> nommu: fix shared mmap after truncate shrinkage problems Fix a problem in NOMMU mmap with ramfs whereby a shared mmap can happen over the end of a truncation. The problem is that ramfs_nommu_check_mappings() checks that the reduced file size against the VMA tree, but not the vm_region tree. The following sequence of events can cause the problem: fd = open("/tmp/x", O_RDWR|O_TRUNC|O_CREAT, 0600); ftruncate(fd, 32 * 1024); a = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); b = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); munmap(a, 32 * 1024); ftruncate(fd, 16 * 1024); c = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); Mapping 'a' creates a vm_region covering 32KB of the file. Mapping 'b' sees that the vm_region from 'a' is covering the region it wants and so shares it, pinning it in memory. Mapping 'a' then goes away and the file is truncated to the end of VMA 'b'. However, the region allocated by 'a' is still in effect, and has _not_ been reduced. Mapping 'c' is then created, and because there's a vm_region covering the desired region, get_unmapped_area() is _not_ called to repeat the check, and the mapping is granted, even though the pages from the latter half of the mapping have been discarded. However: d = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); Mapping 'd' should work, and should end up sharing the region allocated by 'a'. To deal with this, we shrink the vm_region struct during the truncation, lest do_mmap_pgoff() take it as licence to share the full region automatically without calling the get_unmapped_area() file op again. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
/linux-master/fs/reiserfs/ | ||
H A D | file.c | diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 0e4f6a79 Sat Jul 03 14:18:57 MDT 2010 Al Viro <viro@zeniv.linux.org.uk> Fix reiserfs_file_release() a) count file openers correctly; i_count use was completely wrong b) use new mutex for exclusion between final close/open/truncate, to protect tailpacking logics. i_mutex use was wrong and resulted in deadlocks. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
/linux-master/fs/sysv/ | ||
H A D | file.c | diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
/linux-master/fs/ufs/ | ||
H A D | file.c | diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 8174202b Thu Apr 03 01:17:43 MDT 2014 Al Viro <viro@zeniv.linux.org.uk> write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
/linux-master/include/linux/ | ||
H A D | fs.h | diff e4e8b47a Mon Jan 15 10:30:46 MST 2018 Max Kellermann <mk@cm4all.com> fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n Make IS_POSIXACL() return false if POSIX ACL support is disabled. Never skip applying the umask in namei.c and never bother to do any ACL specific checks if the filesystem falsely indicates it has ACLs enabled when the feature is completely disabled in the kernel. This fixes a problem where the umask is always ignored in the NFS client when compiled without CONFIG_FS_POSIX_ACL. This is a 4 year old regression caused by commit 013cdf1088d723 which itself was not completely wrong, but failed to consider all the side effects by misdesigned VFS code. Prior to that commit, there were two places where the umask could be applied, for example when creating a directory: 1. in the VFS layer in SYSCALL_DEFINE3(mkdirat), but only if !IS_POSIXACL() 2. again (unconditionally) in nfs3_proc_mkdir() The first one does not apply, because even without CONFIG_FS_POSIX_ACL, the NFS client sets SB_POSIXACL in nfs_fill_super(). After that commit, (2.) was replaced by: 2b. in posix_acl_create(), called by nfs3_proc_mkdir() There's one branch in posix_acl_create() which applies the umask; however, without CONFIG_FS_POSIX_ACL, posix_acl_create() is an empty dummy function which does not apply the umask. The approach chosen by this patch is to make IS_POSIXACL() always return false when POSIX ACL support is disabled, so the umask always gets applied by the VFS layer. This is consistent with the (regular) behavior of posix_acl_create(): that function returns early if IS_POSIXACL() is false, before applying the umask. Therefore, posix_acl_create() is responsible for applying the umask if there is ACL support enabled in the file system (SB_POSIXACL), and the VFS layer is responsible for all other cases (no SB_POSIXACL or no CONFIG_FS_POSIX_ACL). Signed-off-by: Max Kellermann <mk@cm4all.com> Link: https://lore.kernel.org/r/151603744662.29035.4910161264124875658.stgit@rabbit.intern.cm-ag Signed-off-by: Christian Brauner <brauner@kernel.org> diff a7bc2e8d Thu Jun 01 03:24:00 MDT 2023 chenzhiyin <zhiyin.chen@intel.com> fs.h: Optimize file struct to prevent false sharing In the syscall test of UnixBench, performance regression occurred due to false sharing. The lock and atomic members, including file::f_lock, file::f_count and file::f_pos_lock are highly contended and frequently updated in the high-concurrency test scenarios. perf c2c indentified one affected read access, file::f_op. To prevent false sharing, the layout of file struct is changed as following (A) f_lock, f_count and f_pos_lock are put together to share the same cache line. (B) The read mostly members, including f_path, f_inode, f_op are put into a separate cache line. (C) f_mode is put together with f_count, since they are used frequently at the same time. Due to '__randomize_layout' attribute of file struct, the updated layout only can be effective when CONFIG_RANDSTRUCT_NONE is 'y'. The optimization has been validated in the syscall test of UnixBench. performance gain is 30~50%. Furthermore, to confirm the optimization effectiveness on the other codes path, the results of fsdisk, fsbuffer and fstime are also shown. Here are the detailed test results of unixbench. Command: numactl -C 3-18 ./Run -c 16 syscall fsbuffer fstime fsdisk Without Patch ------------------------------------------------------------------------ File Copy 1024 bufsize 2000 maxblocks 875052.1 KBps (30.0 s, 2 samples) File Copy 256 bufsize 500 maxblocks 235484.0 KBps (30.0 s, 2 samples) File Copy 4096 bufsize 8000 maxblocks 2815153.5 KBps (30.0 s, 2 samples) System Call Overhead 5772268.3 lps (10.0 s, 7 samples) System Benchmarks Partial Index BASELINE RESULT INDEX File Copy 1024 bufsize 2000 maxblocks 3960.0 875052.1 2209.7 File Copy 256 bufsize 500 maxblocks 1655.0 235484.0 1422.9 File Copy 4096 bufsize 8000 maxblocks 5800.0 2815153.5 4853.7 System Call Overhead 15000.0 5772268.3 3848.2 ======== System Benchmarks Index Score (Partial Only) 2768.3 With Patch ------------------------------------------------------------------------ File Copy 1024 bufsize 2000 maxblocks 1009977.2 KBps (30.0 s, 2 samples) File Copy 256 bufsize 500 maxblocks 264765.9 KBps (30.0 s, 2 samples) File Copy 4096 bufsize 8000 maxblocks 3052236.0 KBps (30.0 s, 2 samples) System Call Overhead 8237404.4 lps (10.0 s, 7 samples) System Benchmarks Partial Index BASELINE RESULT INDEX File Copy 1024 bufsize 2000 maxblocks 3960.0 1009977.2 2550.4 File Copy 256 bufsize 500 maxblocks 1655.0 264765.9 1599.8 File Copy 4096 bufsize 8000 maxblocks 5800.0 3052236.0 5262.5 System Call Overhead 15000.0 8237404.4 5491.6 ======== System Benchmarks Index Score (Partial Only) 3295.3 Signed-off-by: chenzhiyin <zhiyin.chen@intel.com> Message-Id: <20230601092400.27162-1-zhiyin.chen@intel.com> Signed-off-by: Christian Brauner <brauner@kernel.org> diff bbcd53c9 Thu May 06 19:05:55 MDT 2021 David Hildenbrand <david@redhat.com> drivers/char: remove /dev/kmem for good Patch series "drivers/char: remove /dev/kmem for good". Exploring /dev/kmem and /dev/mem in the context of memory hot(un)plug and memory ballooning, I started questioning the existence of /dev/kmem. Comparing it with the /proc/kcore implementation, it does not seem to be able to deal with things like a) Pages unmapped from the direct mapping (e.g., to be used by secretmem) -> kern_addr_valid(). virt_addr_valid() is not sufficient. b) Special cases like gart aperture memory that is not to be touched -> mem_pfn_is_ram() Unless I am missing something, it's at least broken in some cases and might fault/crash the machine. Looks like its existence has been questioned before in 2005 and 2010 [1], after ~11 additional years, it might make sense to revive the discussion. CONFIG_DEVKMEM is only enabled in a single defconfig (on purpose or by mistake?). All distributions disable it: in Ubuntu it has been disabled for more than 10 years, in Debian since 2.6.31, in Fedora at least starting with FC3, in RHEL starting with RHEL4, in SUSE starting from 15sp2, and OpenSUSE has it disabled as well. 1) /dev/kmem was popular for rootkits [2] before it got disabled basically everywhere. Ubuntu documents [3] "There is no modern user of /dev/kmem any more beyond attackers using it to load kernel rootkits.". RHEL documents in a BZ [5] "it served no practical purpose other than to serve as a potential security problem or to enable binary module drivers to access structures/functions they shouldn't be touching" 2) /proc/kcore is a decent interface to have a controlled way to read kernel memory for debugging puposes. (will need some extensions to deal with memory offlining/unplug, memory ballooning, and poisoned pages, though) 3) It might be useful for corner case debugging [1]. KDB/KGDB might be a better fit, especially, to write random memory; harder to shoot yourself into the foot. 4) "Kernel Memory Editor" [4] hasn't seen any updates since 2000 and seems to be incompatible with 64bit [1]. For educational purposes, /proc/kcore might be used to monitor value updates -- or older kernels can be used. 5) It's broken on arm64, and therefore, completely disabled there. Looks like it's essentially unused and has been replaced by better suited interfaces for individual tasks (/proc/kcore, KDB/KGDB). Let's just remove it. [1] https://lwn.net/Articles/147901/ [2] https://www.linuxjournal.com/article/10505 [3] https://wiki.ubuntu.com/Security/Features#A.2Fdev.2Fkmem_disabled [4] https://sourceforge.net/projects/kme/ [5] https://bugzilla.redhat.com/show_bug.cgi?id=154796 Link: https://lkml.kernel.org/r/20210324102351.6932-1-david@redhat.com Link: https://lkml.kernel.org/r/20210324102351.6932-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Alexander A. Klimov" <grandmaster@al2klimov.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexandre Belloni <alexandre.belloni@bootlin.com> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Brian Cain <bcain@codeaurora.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Chris Zankel <chris@zankel.net> Cc: Corentin Labbe <clabbe@baylibre.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greentime Hu <green.hu@gmail.com> Cc: Gregory Clement <gregory.clement@bootlin.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Hillf Danton <hdanton@sina.com> Cc: huang ying <huang.ying.caritas@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: James Troup <james.troup@canonical.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kairui Song <kasong@redhat.com> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Liviu Dudau <liviu.dudau@arm.com> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: openrisc@lists.librecores.org Cc: Palmer Dabbelt <palmerdabbelt@google.com> Cc: Paul Mackerras <paulus@samba.org> Cc: "Pavel Machek (CIP)" <pavel@denx.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Pierre Morel <pmorel@linux.ibm.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Rich Felker <dalias@libc.org> Cc: Robert Richter <rric@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com> Cc: sparclinux@vger.kernel.org Cc: Stafford Horne <shorne@gmail.com> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Theodore Dubois <tblodt@icloud.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: William Cohen <wcohen@redhat.com> Cc: Xiaoming Ni <nixiaoming@huawei.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 2773bf00 Tue Sep 27 03:03:58 MDT 2016 Miklos Szeredi <mszeredi@redhat.com> fs: rename "rename2" i_op to "rename" Generated patch: sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2` sed -i "s/\brename2\b/rename/g" `git grep -wl rename2` Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> diff 0b81d077 Fri May 15 17:26:10 MDT 2015 Jaegeuk Kim <jaegeuk@kernel.org> fs crypto: move per-file encryption from f2fs tree to fs/crypto This patch adds the renamed functions moved from the f2fs crypto files. 1. definitions for per-file encryption used by ext4 and f2fs. 2. crypto.c for encrypt/decrypt functions a. IO preparation: - fscrypt_get_ctx / fscrypt_release_ctx b. before IOs: - fscrypt_encrypt_page - fscrypt_decrypt_page - fscrypt_zeroout_range c. after IOs: - fscrypt_decrypt_bio_pages - fscrypt_pullback_bio_page - fscrypt_restore_control_page 3. policy.c supporting context management. a. For ioctls: - fscrypt_process_policy - fscrypt_get_policy b. For context permission - fscrypt_has_permitted_context - fscrypt_inherit_context 4. keyinfo.c to handle permissions - fscrypt_get_encryption_info - fscrypt_free_encryption_info 5. fname.c to support filename encryption a. general wrapper functions - fscrypt_fname_disk_to_usr - fscrypt_fname_usr_to_disk - fscrypt_setup_filename - fscrypt_free_filename b. specific filename handling functions - fscrypt_fname_alloc_buffer - fscrypt_fname_free_buffer 6. Makefile and Kconfig Cc: Al Viro <viro@ftp.linux.org.uk> Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Ildar Muslukhov <ildarm@google.com> Signed-off-by: Uday Savagaonkar <savagaon@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> diff 0b81d077 Fri May 15 17:26:10 MDT 2015 Jaegeuk Kim <jaegeuk@kernel.org> fs crypto: move per-file encryption from f2fs tree to fs/crypto This patch adds the renamed functions moved from the f2fs crypto files. 1. definitions for per-file encryption used by ext4 and f2fs. 2. crypto.c for encrypt/decrypt functions a. IO preparation: - fscrypt_get_ctx / fscrypt_release_ctx b. before IOs: - fscrypt_encrypt_page - fscrypt_decrypt_page - fscrypt_zeroout_range c. after IOs: - fscrypt_decrypt_bio_pages - fscrypt_pullback_bio_page - fscrypt_restore_control_page 3. policy.c supporting context management. a. For ioctls: - fscrypt_process_policy - fscrypt_get_policy b. For context permission - fscrypt_has_permitted_context - fscrypt_inherit_context 4. keyinfo.c to handle permissions - fscrypt_get_encryption_info - fscrypt_free_encryption_info 5. fname.c to support filename encryption a. general wrapper functions - fscrypt_fname_disk_to_usr - fscrypt_fname_usr_to_disk - fscrypt_setup_filename - fscrypt_free_filename b. specific filename handling functions - fscrypt_fname_alloc_buffer - fscrypt_fname_free_buffer 6. Makefile and Kconfig Cc: Al Viro <viro@ftp.linux.org.uk> Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Ildar Muslukhov <ildarm@google.com> Signed-off-by: Uday Savagaonkar <savagaon@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> diff 0b81d077 Fri May 15 17:26:10 MDT 2015 Jaegeuk Kim <jaegeuk@kernel.org> fs crypto: move per-file encryption from f2fs tree to fs/crypto This patch adds the renamed functions moved from the f2fs crypto files. 1. definitions for per-file encryption used by ext4 and f2fs. 2. crypto.c for encrypt/decrypt functions a. IO preparation: - fscrypt_get_ctx / fscrypt_release_ctx b. before IOs: - fscrypt_encrypt_page - fscrypt_decrypt_page - fscrypt_zeroout_range c. after IOs: - fscrypt_decrypt_bio_pages - fscrypt_pullback_bio_page - fscrypt_restore_control_page 3. policy.c supporting context management. a. For ioctls: - fscrypt_process_policy - fscrypt_get_policy b. For context permission - fscrypt_has_permitted_context - fscrypt_inherit_context 4. keyinfo.c to handle permissions - fscrypt_get_encryption_info - fscrypt_free_encryption_info 5. fname.c to support filename encryption a. general wrapper functions - fscrypt_fname_disk_to_usr - fscrypt_fname_usr_to_disk - fscrypt_setup_filename - fscrypt_free_filename b. specific filename handling functions - fscrypt_fname_alloc_buffer - fscrypt_fname_free_buffer 6. Makefile and Kconfig Cc: Al Viro <viro@ftp.linux.org.uk> Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Ildar Muslukhov <ildarm@google.com> Signed-off-by: Uday Savagaonkar <savagaon@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> diff 680baacb Sat May 02 11:32:22 MDT 2015 Al Viro <viro@zeniv.linux.org.uk> new ->follow_link() and ->put_link() calling conventions a) instead of storing the symlink body (via nd_set_link()) and returning an opaque pointer later passed to ->put_link(), ->follow_link() _stores_ that opaque pointer (into void * passed by address by caller) and returns the symlink body. Returning ERR_PTR() on error, NULL on jump (procfs magic symlinks) and pointer to symlink body for normal symlinks. Stored pointer is ignored in all cases except the last one. Storing NULL for opaque pointer (or not storing it at all) means no call of ->put_link(). b) the body used to be passed to ->put_link() implicitly (via nameidata). Now only the opaque pointer is. In the cases when we used the symlink body to free stuff, ->follow_link() now should store it as opaque pointer in addition to returning it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 0ae45f63 Sun Feb 01 22:37:00 MST 2015 Theodore Ts'o <tytso@mit.edu> vfs: add support for a lazytime mount option Add a new mount option which enables a new "lazytime" mode. This mode causes atime, mtime, and ctime updates to only be made to the in-memory version of the inode. The on-disk times will only get updated when (a) if the inode needs to be updated for some non-time related change, (b) if userspace calls fsync(), syncfs() or sync(), or (c) just before an undeleted inode is evicted from memory. This is OK according to POSIX because there are no guarantees after a crash unless userspace explicitly requests via a fsync(2) call. For workloads which feature a large number of random write to a preallocated file, the lazytime mount option significantly reduces writes to the inode table. The repeated 4k writes to a single block will result in undesirable stress on flash devices and SMR disk drives. Even on conventional HDD's, the repeated writes to the inode table block will trigger Adjacent Track Interference (ATI) remediation latencies, which very negatively impact long tail latencies --- which is a very big deal for web serving tiers (for example). Google-Bug-Id: 18297052 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff e4a0d3e7 Thu Sep 18 09:56:17 MDT 2014 Pavel Emelyanov <xemul@parallels.com> aio: Make it possible to remap aio ring There are actually two issues this patch addresses. Let me start with the one I tried to solve in the beginning. So, in the checkpoint-restore project (criu) we try to dump tasks' state and restore one back exactly as it was. One of the tasks' state bits is rings set up with io_setup() call. There's (almost) no problems in dumping them, there's a problem restoring them -- if I dump a task with aio ring originally mapped at address A, I want to restore one back at exactly the same address A. Unfortunately, the io_setup() does not allow for that -- it mmaps the ring at whatever place mm finds appropriate (it calls do_mmap_pgoff() with zero address and without the MAP_FIXED flag). To make restore possible I'm going to mremap() the freshly created ring into the address A (under which it was seen before dump). The problem is that the ring's virtual address is passed back to the user-space as the context ID and this ID is then used as search key by all the other io_foo() calls. Reworking this ID to be just some integer doesn't seem to work, as this value is already used by libaio as a pointer using which this library accesses memory for aio meta-data. So, to make restore work we need to make sure that a) ring is mapped at desired virtual address b) kioctx->user_id matches this value Having said that, the patch makes mremap() on aio region update the kioctx's user_id and mmap_base values. Here appears the 2nd issue I mentioned in the beginning of this mail. If (regardless of the C/R dances I do) someone creates an io context with io_setup(), then mremap()-s the ring and then destroys the context, the kill_ioctx() routine will call munmap() on wrong (old) address. This will result in a) aio ring remaining in memory and b) some other vma get unexpectedly unmapped. What do you think? Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> diff e4a0d3e7 Thu Sep 18 09:56:17 MDT 2014 Pavel Emelyanov <xemul@parallels.com> aio: Make it possible to remap aio ring There are actually two issues this patch addresses. Let me start with the one I tried to solve in the beginning. So, in the checkpoint-restore project (criu) we try to dump tasks' state and restore one back exactly as it was. One of the tasks' state bits is rings set up with io_setup() call. There's (almost) no problems in dumping them, there's a problem restoring them -- if I dump a task with aio ring originally mapped at address A, I want to restore one back at exactly the same address A. Unfortunately, the io_setup() does not allow for that -- it mmaps the ring at whatever place mm finds appropriate (it calls do_mmap_pgoff() with zero address and without the MAP_FIXED flag). To make restore possible I'm going to mremap() the freshly created ring into the address A (under which it was seen before dump). The problem is that the ring's virtual address is passed back to the user-space as the context ID and this ID is then used as search key by all the other io_foo() calls. Reworking this ID to be just some integer doesn't seem to work, as this value is already used by libaio as a pointer using which this library accesses memory for aio meta-data. So, to make restore work we need to make sure that a) ring is mapped at desired virtual address b) kioctx->user_id matches this value Having said that, the patch makes mremap() on aio region update the kioctx's user_id and mmap_base values. Here appears the 2nd issue I mentioned in the beginning of this mail. If (regardless of the C/R dances I do) someone creates an io context with io_setup(), then mremap()-s the ring and then destroys the context, the kill_ioctx() routine will call munmap() on wrong (old) address. This will result in a) aio ring remaining in memory and b) some other vma get unexpectedly unmapped. What do you think? Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> |
/linux-master/mm/ | ||
H A D | filemap.c | diff 0cb8fd4d Thu Jun 08 19:08:20 MDT 2023 Hugh Dickins <hughd@google.com> mm/migrate: remove cruft from migration_entry_wait()s migration_entry_wait_on_locked() does not need to take a mapped pte pointer, its callers can do the unmap first. Annotate it with __releases(ptl) to reduce sparse warnings. Fold __migration_entry_wait_huge() into migration_entry_wait_huge(). Fold __migration_entry_wait() into migration_entry_wait(), preferring the tighter pte_offset_map_lock() to pte_offset_map() and pte_lockptr(). Link: https://lkml.kernel.org/r/b0e2a532-cdf2-561b-e999-f3b13b8d6d3@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 9144785b Tue May 25 08:11:25 MDT 2021 Matthew Wilcox (Oracle) <willy@infradead.org> filemap: Remove PageHWPoison check from next_uptodate_page() Pages are individually marked as suffering from hardware poisoning. Checking that the head page is not hardware poisoned doesn't make sense; we might be after a subpage. We check each page individually before we use it, so this was an optimisation gone wrong. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com> diff 57b2847d Wed Feb 24 01:03:31 MST 2021 Muchun Song <songmuchun@bytedance.com> mm: memcontrol: convert NR_SHMEM_THPS account to pages Currently we use struct per_cpu_nodestat to cache the vmstat counters, which leads to inaccurate statistics especially THP vmstat counters. In the systems with hundreds of processors it can be GBs of memory. For example, for a 96 CPUs system, the threshold is the maximum number of 125. And the per cpu counters can cache 23.4375 GB in total. The THP page is already a form of batched addition (it will add 512 worth of memory in one go) so skipping the batching seems like sensible. Although every THP stats update overflows the per-cpu counter, resorting to atomic global updates. But it can make the statistics more accuracy for the THP vmstat counters. So we convert the NR_SHMEM_THPS account to pages. This patch is consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page arrival"). Doing this also can make the unit of vmstat counters more unified. Finally, the unit of the vmstat counters are pages, kB and bytes. The B/KB suffix can tell us that the unit is bytes or kB. The rest which is without suffix are pages. Link: https://lkml.kernel.org/r/20201228164110.2838-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: NeilBrown <neilb@suse.de> Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com> Cc: Rafael. J. Wysocki <rafael@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff bf9ecead Wed Feb 24 01:03:27 MST 2021 Muchun Song <songmuchun@bytedance.com> mm: memcontrol: convert NR_FILE_THPS account to pages Currently we use struct per_cpu_nodestat to cache the vmstat counters, which leads to inaccurate statistics especially THP vmstat counters. In the systems with if hundreds of processors it can be GBs of memory. For example, for a 96 CPUs system, the threshold is the maximum number of 125. And the per cpu counters can cache 23.4375 GB in total. The THP page is already a form of batched addition (it will add 512 worth of memory in one go) so skipping the batching seems like sensible. Although every THP stats update overflows the per-cpu counter, resorting to atomic global updates. But it can make the statistics more accuracy for the THP vmstat counters. So we convert the NR_FILE_THPS account to pages. This patch is consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page arrival"). Doing this also can make the unit of vmstat counters more unified. Finally, the unit of the vmstat counters are pages, kB and bytes. The B/KB suffix can tell us that the unit is bytes or kB. The rest which is without suffix are pages. Link: https://lkml.kernel.org/r/20201228164110.2838-4-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: NeilBrown <neilb@suse.de> Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com> Cc: Rafael. J. Wysocki <rafael@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 073861ed Tue Nov 24 09:46:43 MST 2020 Hugh Dickins <hughd@google.com> mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback) Twice now, when exercising ext4 looped on shmem huge pages, I have crashed on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling end_page_writeback() calling wake_up_page() on tail of a shmem huge page, no longer an ext4 page at all. The problem is that PageWriteback is not accompanied by a page reference (as the NOTE at the end of test_clear_page_writeback() acknowledges): as soon as TestClearPageWriteback has been done, that page could be removed from page cache, freed, and reused for something else by the time that wake_up_page() is reached. https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/ Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail check; but I'm paranoid about even looking at an unreferenced struct page, lest its memory might itself have already been reused or hotremoved (and wake_up_page_bit() may modify that memory with its ClearPageWaiters()). Then on crashing a second time, realized there's a stronger reason against that approach. If my testing just occasionally crashes on that check, when the page is reused for part of a compound page, wouldn't it be much more common for the page to get reused as an order-0 page before reaching wake_up_page()? And on rare occasions, might that reused page already be marked PageWriteback by its new user, and already be waited upon? What would that look like? It would look like BUG_ON(PageWriteback) after wait_on_page_writeback() in write_cache_pages() (though I have never seen that crash myself). Matthew Wilcox explaining this to himself: "page is allocated, added to page cache, dirtied, writeback starts, --- thread A --- filesystem calls end_page_writeback() test_clear_page_writeback() --- context switch to thread B --- truncate_inode_pages_range() finds the page, it doesn't have writeback set, we delete it from the page cache. Page gets reallocated, dirtied, writeback starts again. Then we call write_cache_pages(), see PageWriteback() set, call wait_on_page_writeback() --- context switch back to thread A --- wake_up_page(page, PG_writeback); ... thread B is woken, but because the wakeup was for the old use of the page, PageWriteback is still set. Devious" And prior to 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") this would have been much less likely: before that, wake_page_function()'s non-exclusive case would stop walking and not wake if it found Writeback already set again; whereas now the non-exclusive case proceeds to wake. I have not thought of a fix that does not add a little overhead: the simplest fix is for end_page_writeback() to get_page() before calling test_clear_page_writeback(), then put_page() after wake_up_page(). Was there a chance of missed wakeups before, since a page freed before reaching wake_up_page() would have PageWaiters cleared? I think not, because each waiter does hold a reference on the page. This bug comes when the old use of the page, the one we do TestClearPageWriteback on, had *no* waiters, so no additional page reference beyond the page cache (and whoever racily freed it). The reuse of the page has a waiter holding a reference, and its own PageWriteback set; but the belated wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback). Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com Reported-by: Qian Cai <cai@lca.pw> Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org # v5.8+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 073861ed Tue Nov 24 09:46:43 MST 2020 Hugh Dickins <hughd@google.com> mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback) Twice now, when exercising ext4 looped on shmem huge pages, I have crashed on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling end_page_writeback() calling wake_up_page() on tail of a shmem huge page, no longer an ext4 page at all. The problem is that PageWriteback is not accompanied by a page reference (as the NOTE at the end of test_clear_page_writeback() acknowledges): as soon as TestClearPageWriteback has been done, that page could be removed from page cache, freed, and reused for something else by the time that wake_up_page() is reached. https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/ Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail check; but I'm paranoid about even looking at an unreferenced struct page, lest its memory might itself have already been reused or hotremoved (and wake_up_page_bit() may modify that memory with its ClearPageWaiters()). Then on crashing a second time, realized there's a stronger reason against that approach. If my testing just occasionally crashes on that check, when the page is reused for part of a compound page, wouldn't it be much more common for the page to get reused as an order-0 page before reaching wake_up_page()? And on rare occasions, might that reused page already be marked PageWriteback by its new user, and already be waited upon? What would that look like? It would look like BUG_ON(PageWriteback) after wait_on_page_writeback() in write_cache_pages() (though I have never seen that crash myself). Matthew Wilcox explaining this to himself: "page is allocated, added to page cache, dirtied, writeback starts, --- thread A --- filesystem calls end_page_writeback() test_clear_page_writeback() --- context switch to thread B --- truncate_inode_pages_range() finds the page, it doesn't have writeback set, we delete it from the page cache. Page gets reallocated, dirtied, writeback starts again. Then we call write_cache_pages(), see PageWriteback() set, call wait_on_page_writeback() --- context switch back to thread A --- wake_up_page(page, PG_writeback); ... thread B is woken, but because the wakeup was for the old use of the page, PageWriteback is still set. Devious" And prior to 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") this would have been much less likely: before that, wake_page_function()'s non-exclusive case would stop walking and not wake if it found Writeback already set again; whereas now the non-exclusive case proceeds to wake. I have not thought of a fix that does not add a little overhead: the simplest fix is for end_page_writeback() to get_page() before calling test_clear_page_writeback(), then put_page() after wake_up_page(). Was there a chance of missed wakeups before, since a page freed before reaching wake_up_page() would have PageWaiters cleared? I think not, because each waiter does hold a reference on the page. This bug comes when the old use of the page, the one we do TestClearPageWriteback on, had *no* waiters, so no additional page reference beyond the page cache (and whoever racily freed it). The reuse of the page has a waiter holding a reference, and its own PageWriteback set; but the belated wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback). Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com Reported-by: Qian Cai <cai@lca.pw> Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org # v5.8+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 4101196b Mon Sep 23 16:34:52 MDT 2019 Matthew Wilcox (Oracle) <willy@infradead.org> mm: page cache: store only head pages in i_pages Transparent Huge Pages are currently stored in i_pages as pointers to consecutive subpages. This patch changes that to storing consecutive pointers to the head page in preparation for storing huge pages more efficiently in i_pages. Large parts of this are "inspired" by Kirill's patch https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/ Kirill and Huang Ying contributed several fixes. [willy@infradead.org: use compound_nr, squish uninit-var warning] Link: http://lkml.kernel.org/r/20190731210400.7419-1-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Acked-by: Jan Kara <jack@suse.cz> Reviewed-by: Kirill Shutemov <kirill@shutemov.name> Reviewed-by: Song Liu <songliubraving@fb.com> Tested-by: Song Liu <songliubraving@fb.com> Tested-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Tested-by: Qian Cai <cai@lca.pw> Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Song Liu <songliubraving@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 69bf4b6b Fri Jul 05 20:55:18 MDT 2019 Linus Torvalds <torvalds@linux-foundation.org> Revert "mm: page cache: store only head pages in i_pages" This reverts commit 5fd4ca2d84b249f0858ce28cf637cf25b61a398f. Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in __delete_from_swap_cache() to trigger: page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363 anon flags: 0x17fffe00080034(uptodate|lru|active|swapbacked) raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689 raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000 page dumped because: VM_BUG_ON_PAGE(entry != page) page->mem_cgroup:ffff978433ace000 ------------[ cut here ]------------ kernel BUG at mm/swap_state.c:170! invalid opcode: 0000 [#1] SMP NOPTI CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019 RIP: 0010:__delete_from_swap_cache+0x20d/0x240 Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff <0f> 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f RSP: 0018:ffffa982036e7980 EFLAGS: 00010046 RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006 RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900 RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535 R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120 R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000 FS: 0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0 Call Trace: delete_from_swap_cache+0x46/0xa0 try_to_free_swap+0xbc/0x110 swap_writepage+0x13/0x70 pageout.isra.0+0x13c/0x350 shrink_page_list+0xc14/0xdf0 shrink_inactive_list+0x1e5/0x3c0 shrink_node_memcg+0x202/0x760 shrink_node+0xe0/0x470 balance_pgdat+0x2d1/0x510 kswapd+0x220/0x420 kthread+0xfb/0x130 ret_from_fork+0x22/0x40 and it's not immediately obvious why it happens. It's too late in the rc cycle to do anything but revert for now. Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/ Reported-and-bisected-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Suggested-by: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 69bf4b6b Fri Jul 05 20:55:18 MDT 2019 Linus Torvalds <torvalds@linux-foundation.org> Revert "mm: page cache: store only head pages in i_pages" This reverts commit 5fd4ca2d84b249f0858ce28cf637cf25b61a398f. Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in __delete_from_swap_cache() to trigger: page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363 anon flags: 0x17fffe00080034(uptodate|lru|active|swapbacked) raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689 raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000 page dumped because: VM_BUG_ON_PAGE(entry != page) page->mem_cgroup:ffff978433ace000 ------------[ cut here ]------------ kernel BUG at mm/swap_state.c:170! invalid opcode: 0000 [#1] SMP NOPTI CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019 RIP: 0010:__delete_from_swap_cache+0x20d/0x240 Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff <0f> 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f RSP: 0018:ffffa982036e7980 EFLAGS: 00010046 RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006 RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900 RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535 R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120 R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000 FS: 0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0 Call Trace: delete_from_swap_cache+0x46/0xa0 try_to_free_swap+0xbc/0x110 swap_writepage+0x13/0x70 pageout.isra.0+0x13c/0x350 shrink_page_list+0xc14/0xdf0 shrink_inactive_list+0x1e5/0x3c0 shrink_node_memcg+0x202/0x760 shrink_node+0xe0/0x470 balance_pgdat+0x2d1/0x510 kswapd+0x220/0x420 kthread+0xfb/0x130 ret_from_fork+0x22/0x40 and it's not immediately obvious why it happens. It's too late in the rc cycle to do anything but revert for now. Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/ Reported-and-bisected-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Suggested-by: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 69bf4b6b Fri Jul 05 20:55:18 MDT 2019 Linus Torvalds <torvalds@linux-foundation.org> Revert "mm: page cache: store only head pages in i_pages" This reverts commit 5fd4ca2d84b249f0858ce28cf637cf25b61a398f. Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in __delete_from_swap_cache() to trigger: page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363 anon flags: 0x17fffe00080034(uptodate|lru|active|swapbacked) raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689 raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000 page dumped because: VM_BUG_ON_PAGE(entry != page) page->mem_cgroup:ffff978433ace000 ------------[ cut here ]------------ kernel BUG at mm/swap_state.c:170! invalid opcode: 0000 [#1] SMP NOPTI CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019 RIP: 0010:__delete_from_swap_cache+0x20d/0x240 Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff <0f> 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f RSP: 0018:ffffa982036e7980 EFLAGS: 00010046 RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006 RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900 RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535 R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120 R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000 FS: 0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0 Call Trace: delete_from_swap_cache+0x46/0xa0 try_to_free_swap+0xbc/0x110 swap_writepage+0x13/0x70 pageout.isra.0+0x13c/0x350 shrink_page_list+0xc14/0xdf0 shrink_inactive_list+0x1e5/0x3c0 shrink_node_memcg+0x202/0x760 shrink_node+0xe0/0x470 balance_pgdat+0x2d1/0x510 kswapd+0x220/0x420 kthread+0xfb/0x130 ret_from_fork+0x22/0x40 and it's not immediately obvious why it happens. It's too late in the rc cycle to do anything but revert for now. Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/ Reported-and-bisected-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Suggested-by: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 69bf4b6b Fri Jul 05 20:55:18 MDT 2019 Linus Torvalds <torvalds@linux-foundation.org> Revert "mm: page cache: store only head pages in i_pages" This reverts commit 5fd4ca2d84b249f0858ce28cf637cf25b61a398f. Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in __delete_from_swap_cache() to trigger: page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363 anon flags: 0x17fffe00080034(uptodate|lru|active|swapbacked) raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689 raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000 page dumped because: VM_BUG_ON_PAGE(entry != page) page->mem_cgroup:ffff978433ace000 ------------[ cut here ]------------ kernel BUG at mm/swap_state.c:170! invalid opcode: 0000 [#1] SMP NOPTI CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019 RIP: 0010:__delete_from_swap_cache+0x20d/0x240 Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff <0f> 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f RSP: 0018:ffffa982036e7980 EFLAGS: 00010046 RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006 RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900 RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535 R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120 R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000 FS: 0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0 Call Trace: delete_from_swap_cache+0x46/0xa0 try_to_free_swap+0xbc/0x110 swap_writepage+0x13/0x70 pageout.isra.0+0x13c/0x350 shrink_page_list+0xc14/0xdf0 shrink_inactive_list+0x1e5/0x3c0 shrink_node_memcg+0x202/0x760 shrink_node+0xe0/0x470 balance_pgdat+0x2d1/0x510 kswapd+0x220/0x420 kthread+0xfb/0x130 ret_from_fork+0x22/0x40 and it's not immediately obvious why it happens. It's too late in the rc cycle to do anything but revert for now. Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/ Reported-and-bisected-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Suggested-by: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 69bf4b6b Fri Jul 05 20:55:18 MDT 2019 Linus Torvalds <torvalds@linux-foundation.org> Revert "mm: page cache: store only head pages in i_pages" This reverts commit 5fd4ca2d84b249f0858ce28cf637cf25b61a398f. Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in __delete_from_swap_cache() to trigger: page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363 anon flags: 0x17fffe00080034(uptodate|lru|active|swapbacked) raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689 raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000 page dumped because: VM_BUG_ON_PAGE(entry != page) page->mem_cgroup:ffff978433ace000 ------------[ cut here ]------------ kernel BUG at mm/swap_state.c:170! invalid opcode: 0000 [#1] SMP NOPTI CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019 RIP: 0010:__delete_from_swap_cache+0x20d/0x240 Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff <0f> 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f RSP: 0018:ffffa982036e7980 EFLAGS: 00010046 RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006 RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900 RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535 R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120 R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000 FS: 0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0 Call Trace: delete_from_swap_cache+0x46/0xa0 try_to_free_swap+0xbc/0x110 swap_writepage+0x13/0x70 pageout.isra.0+0x13c/0x350 shrink_page_list+0xc14/0xdf0 shrink_inactive_list+0x1e5/0x3c0 shrink_node_memcg+0x202/0x760 shrink_node+0xe0/0x470 balance_pgdat+0x2d1/0x510 kswapd+0x220/0x420 kthread+0xfb/0x130 ret_from_fork+0x22/0x40 and it's not immediately obvious why it happens. It's too late in the rc cycle to do anything but revert for now. Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/ Reported-and-bisected-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Suggested-by: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
H A D | shmem.c | diff 3022fd7a Fri Sep 29 21:32:40 MDT 2023 Hugh Dickins <hughd@google.com> shmem: _add_to_page_cache() before shmem_inode_acct_blocks() There has been a recurring problem, that when a tmpfs volume is being filled by racing threads, some fail with ENOSPC (or consequent SIGBUS or EFAULT) even though all allocations were within the permitted size. This was a problem since early days, but magnified and complicated by the addition of huge pages. We have often worked around it by adding some slop to the tmpfs size, but it's hard to say how much is needed, and some users prefer not to do that e.g. keeping sparse files in a tightly tailored tmpfs helps to prevent accidental writing to holes. This comes from the allocation sequence: 1. check page cache for existing folio 2. check and reserve from vm_enough_memory 3. check and account from size of tmpfs 4. if huge, check page cache for overlapping folio 5. allocate physical folio, huge or small 6. check and charge from mem cgroup limit 7. add to page cache (but maybe another folio already got in). Concurrent tasks allocating at the same position could deplete the size allowance and fail. Doing vm_enough_memory and size checks before the folio allocation was intentional (to limit the load on the page allocator from this source) and still has some virtue; but memory cgroup never did that, so I think it's better reordered to favour predictable behaviour. 1. check page cache for existing folio 2. if huge, check page cache for overlapping folio 3. allocate physical folio, huge or small 4. check and charge from mem cgroup limit 5. add to page cache (but maybe another folio already got in) 6. check and reserve from vm_enough_memory 7. check and account from size of tmpfs. The folio lock held from allocation onwards ensures that the !uptodate folio cannot be used by others, and can safely be deleted from the cache if checks 6 or 7 subsequently fail (and those waiting on folio lock already check that the folio was not truncated once they get the lock); and the early addition to page cache ensures that racers find it before they try to duplicate the accounting. Seize the opportunity to tidy up shmem_get_folio_gfp()'s ENOSPC retrying, which can be combined inside the new shmem_alloc_and_add_folio(): doing 2 splits twice (once huge, once nonhuge) is not exactly equivalent to trying 5 splits (and giving up early on huge), but let's keep it simple unless more complication proves necessary. Userfaultfd is a foreign country: they do things differently there, and for good reason - to avoid mmap_lock deadlock. Leave ordering in shmem_mfill_atomic_pte() untouched for now, but I would rather like to mesh it better with shmem_get_folio_gfp() in the future. Link: https://lkml.kernel.org/r/22ddd06-d919-33b-1219-56335c1bf28e@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Carlos Maiolino <cem@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <dchinner@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Tim Chen <tim.c.chen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 1f514bee Thu Mar 09 16:05:40 MST 2023 Luis Chamberlain <mcgrof@kernel.org> shmem: remove check for folio lock on writepage() Patch series "tmpfs: add the option to disable swap", v2. I'm doing this work as part of future experimentation with tmpfs and the page cache, but given a common complaint found about tmpfs is the innability to work without the page cache I figured this might be useful to others. It turns out it is -- at least Christian Brauner indicates systemd uses ramfs for a few use-cases because they don't want to use swap and so having this option would let them move over to using tmpfs for those small use cases, see systemd-creds(1). To see if you hit swap: mkswap /dev/nvme2n1 swapon /dev/nvme2n1 free -h With swap - what we see today ============================= mount -t tmpfs -o size=5G tmpfs /data-tmpfs/ dd if=/dev/urandom of=/data-tmpfs/5g-rand2 bs=1G count=5 free -h total used free shared buff/cache available Mem: 3.7Gi 2.6Gi 1.2Gi 2.2Gi 2.2Gi 1.2Gi Swap: 99Gi 2.8Gi 97Gi Without swap ============= free -h total used free shared buff/cache available Mem: 3.7Gi 387Mi 3.4Gi 2.1Mi 57Mi 3.3Gi Swap: 99Gi 0B 99Gi mount -t tmpfs -o size=5G -o noswap tmpfs /data-tmpfs/ dd if=/dev/urandom of=/data-tmpfs/5g-rand2 bs=1G count=5 free -h total used free shared buff/cache available Mem: 3.7Gi 2.6Gi 1.2Gi 2.3Gi 2.3Gi 1.1Gi Swap: 99Gi 21Mi 99Gi The mix and match remount testing ================================= # Cannot disable swap after it was first enabled: mount -t tmpfs -o size=5G tmpfs /data-tmpfs/ mount -t tmpfs -o remount -o size=5G -o noswap tmpfs /data-tmpfs/ mount: /data-tmpfs: mount point not mounted or bad option. dmesg(1) may have more information after failed mount system call. dmesg -c tmpfs: Cannot disable swap on remount # Remount with the same noswap option is OK: mount -t tmpfs -o size=5G -o noswap tmpfs /data-tmpfs/ mount -t tmpfs -o remount -o size=5G -o noswap tmpfs /data-tmpfs/ dmesg -c # Trying to enable swap with a remount after it first disabled: mount -t tmpfs -o size=5G -o noswap tmpfs /data-tmpfs/ mount -t tmpfs -o remount -o size=5G tmpfs /data-tmpfs/ mount: /data-tmpfs: mount point not mounted or bad option. dmesg(1) may have more information after failed mount system call. dmesg -c tmpfs: Cannot enable swap on remount if it was disabled on first mount This patch (of 6): Matthew notes we should not need to check the folio lock on the writepage() callback so remove it. This sanity check has been lingering since linux-history days. We remove this as we tidy up the writepage() callback to make things a bit clearer. Link: https://lkml.kernel.org/r/20230309230545.2930737-1-mcgrof@kernel.org Link: https://lkml.kernel.org/r/20230309230545.2930737-2-mcgrof@kernel.org Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Suggested-by: Matthew Wilcox <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Tested-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Adam Manzanares <a.manzanares@samsung.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 5dc21f0c Tue Oct 25 16:01:08 MDT 2022 Ira Weiny <ira.weiny@intel.com> mm/shmem: ensure proper fallback if page faults The kernel test robot flagged a recursive lock as a result of a conversion from kmap_atomic() to kmap_local_folio()[Link] The cause was due to the code depending on the kmap_atomic() side effect of disabling page faults. In that case the code expects the fault to fail and take the fallback case. git archaeology implied that the recursion may not be an actual bug.[1] However, depending on the implementation of the mmap_lock and the condition of the call there may still be a deadlock.[2] So this is not purely a lockdep issue. Considering a single threaded call stack there are 3 options. 1) Different mm's are in play (no issue) 2) Readlock implementation is recursive and same mm is in play (no issue) 3) Readlock implementation is _not_ recursive (issue) The mmap_lock is recursive so with a single thread there is no issue. However, Matthew pointed out a deadlock scenario when you consider additional process' and threads thusly. "The readlock implementation is only recursive if nobody else has taken a write lock. If you have a multithreaded process, one of the other threads can call mmap() and that will prevent recursion (due to fairness). Even if it's a different process that you're trying to acquire the mmap read lock on, you can still get into a deadly embrace. eg: process A thread 1 takes read lock on own mmap_lock process A thread 2 calls mmap, blocks taking write lock process B thread 1 takes page fault, read lock on own mmap lock process B thread 2 calls mmap, blocks taking write lock process A thread 1 blocks taking read lock on process B process B thread 1 blocks taking read lock on process A Now all four threads are blocked waiting for each other." Regardless using pagefault_disable() ensures that no matter what locking implementation is used a deadlock will not occur. Add an explicit pagefault_disable() and a big comment to explain this for future souls looking at this code. [1] https://lore.kernel.org/all/Y1MymJ%2FINb45AdaY@iweiny-desk3/ [2] https://lore.kernel.org/lkml/Y1bXBtGTCym77%2FoD@casper.infradead.org/ Link: https://lkml.kernel.org/r/20221025220108.2366043-1-ira.weiny@intel.com Link: https://lore.kernel.org/r/202210211215.9dc6efb5-yujie.liu@intel.com Fixes: 7a7256d5f512 ("shmem: convert shmem_mfill_atomic_pte() to use a folio") Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: kernel test robot <yujie.liu@intel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 5dc21f0c Tue Oct 25 16:01:08 MDT 2022 Ira Weiny <ira.weiny@intel.com> mm/shmem: ensure proper fallback if page faults The kernel test robot flagged a recursive lock as a result of a conversion from kmap_atomic() to kmap_local_folio()[Link] The cause was due to the code depending on the kmap_atomic() side effect of disabling page faults. In that case the code expects the fault to fail and take the fallback case. git archaeology implied that the recursion may not be an actual bug.[1] However, depending on the implementation of the mmap_lock and the condition of the call there may still be a deadlock.[2] So this is not purely a lockdep issue. Considering a single threaded call stack there are 3 options. 1) Different mm's are in play (no issue) 2) Readlock implementation is recursive and same mm is in play (no issue) 3) Readlock implementation is _not_ recursive (issue) The mmap_lock is recursive so with a single thread there is no issue. However, Matthew pointed out a deadlock scenario when you consider additional process' and threads thusly. "The readlock implementation is only recursive if nobody else has taken a write lock. If you have a multithreaded process, one of the other threads can call mmap() and that will prevent recursion (due to fairness). Even if it's a different process that you're trying to acquire the mmap read lock on, you can still get into a deadly embrace. eg: process A thread 1 takes read lock on own mmap_lock process A thread 2 calls mmap, blocks taking write lock process B thread 1 takes page fault, read lock on own mmap lock process B thread 2 calls mmap, blocks taking write lock process A thread 1 blocks taking read lock on process B process B thread 1 blocks taking read lock on process A Now all four threads are blocked waiting for each other." Regardless using pagefault_disable() ensures that no matter what locking implementation is used a deadlock will not occur. Add an explicit pagefault_disable() and a big comment to explain this for future souls looking at this code. [1] https://lore.kernel.org/all/Y1MymJ%2FINb45AdaY@iweiny-desk3/ [2] https://lore.kernel.org/lkml/Y1bXBtGTCym77%2FoD@casper.infradead.org/ Link: https://lkml.kernel.org/r/20221025220108.2366043-1-ira.weiny@intel.com Link: https://lore.kernel.org/r/202210211215.9dc6efb5-yujie.liu@intel.com Fixes: 7a7256d5f512 ("shmem: convert shmem_mfill_atomic_pte() to use a folio") Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: kernel test robot <yujie.liu@intel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 5dc21f0c Tue Oct 25 16:01:08 MDT 2022 Ira Weiny <ira.weiny@intel.com> mm/shmem: ensure proper fallback if page faults The kernel test robot flagged a recursive lock as a result of a conversion from kmap_atomic() to kmap_local_folio()[Link] The cause was due to the code depending on the kmap_atomic() side effect of disabling page faults. In that case the code expects the fault to fail and take the fallback case. git archaeology implied that the recursion may not be an actual bug.[1] However, depending on the implementation of the mmap_lock and the condition of the call there may still be a deadlock.[2] So this is not purely a lockdep issue. Considering a single threaded call stack there are 3 options. 1) Different mm's are in play (no issue) 2) Readlock implementation is recursive and same mm is in play (no issue) 3) Readlock implementation is _not_ recursive (issue) The mmap_lock is recursive so with a single thread there is no issue. However, Matthew pointed out a deadlock scenario when you consider additional process' and threads thusly. "The readlock implementation is only recursive if nobody else has taken a write lock. If you have a multithreaded process, one of the other threads can call mmap() and that will prevent recursion (due to fairness). Even if it's a different process that you're trying to acquire the mmap read lock on, you can still get into a deadly embrace. eg: process A thread 1 takes read lock on own mmap_lock process A thread 2 calls mmap, blocks taking write lock process B thread 1 takes page fault, read lock on own mmap lock process B thread 2 calls mmap, blocks taking write lock process A thread 1 blocks taking read lock on process B process B thread 1 blocks taking read lock on process A Now all four threads are blocked waiting for each other." Regardless using pagefault_disable() ensures that no matter what locking implementation is used a deadlock will not occur. Add an explicit pagefault_disable() and a big comment to explain this for future souls looking at this code. [1] https://lore.kernel.org/all/Y1MymJ%2FINb45AdaY@iweiny-desk3/ [2] https://lore.kernel.org/lkml/Y1bXBtGTCym77%2FoD@casper.infradead.org/ Link: https://lkml.kernel.org/r/20221025220108.2366043-1-ira.weiny@intel.com Link: https://lore.kernel.org/r/202210211215.9dc6efb5-yujie.liu@intel.com Fixes: 7a7256d5f512 ("shmem: convert shmem_mfill_atomic_pte() to use a folio") Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: kernel test robot <yujie.liu@intel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 5dc21f0c Tue Oct 25 16:01:08 MDT 2022 Ira Weiny <ira.weiny@intel.com> mm/shmem: ensure proper fallback if page faults The kernel test robot flagged a recursive lock as a result of a conversion from kmap_atomic() to kmap_local_folio()[Link] The cause was due to the code depending on the kmap_atomic() side effect of disabling page faults. In that case the code expects the fault to fail and take the fallback case. git archaeology implied that the recursion may not be an actual bug.[1] However, depending on the implementation of the mmap_lock and the condition of the call there may still be a deadlock.[2] So this is not purely a lockdep issue. Considering a single threaded call stack there are 3 options. 1) Different mm's are in play (no issue) 2) Readlock implementation is recursive and same mm is in play (no issue) 3) Readlock implementation is _not_ recursive (issue) The mmap_lock is recursive so with a single thread there is no issue. However, Matthew pointed out a deadlock scenario when you consider additional process' and threads thusly. "The readlock implementation is only recursive if nobody else has taken a write lock. If you have a multithreaded process, one of the other threads can call mmap() and that will prevent recursion (due to fairness). Even if it's a different process that you're trying to acquire the mmap read lock on, you can still get into a deadly embrace. eg: process A thread 1 takes read lock on own mmap_lock process A thread 2 calls mmap, blocks taking write lock process B thread 1 takes page fault, read lock on own mmap lock process B thread 2 calls mmap, blocks taking write lock process A thread 1 blocks taking read lock on process B process B thread 1 blocks taking read lock on process A Now all four threads are blocked waiting for each other." Regardless using pagefault_disable() ensures that no matter what locking implementation is used a deadlock will not occur. Add an explicit pagefault_disable() and a big comment to explain this for future souls looking at this code. [1] https://lore.kernel.org/all/Y1MymJ%2FINb45AdaY@iweiny-desk3/ [2] https://lore.kernel.org/lkml/Y1bXBtGTCym77%2FoD@casper.infradead.org/ Link: https://lkml.kernel.org/r/20221025220108.2366043-1-ira.weiny@intel.com Link: https://lore.kernel.org/r/202210211215.9dc6efb5-yujie.liu@intel.com Fixes: 7a7256d5f512 ("shmem: convert shmem_mfill_atomic_pte() to use a folio") Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: kernel test robot <yujie.liu@intel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff d14f5efa Thu May 12 20:52:25 MDT 2022 Luo Meng <luomeng12@huawei.com> tmpfs: fix undefined-behaviour in shmem_reconfigure() When shmem_reconfigure() calls __percpu_counter_compare(), the second parameter is unsigned long long. But in the definition of __percpu_counter_compare(), the second parameter is s64. So when __percpu_counter_compare() executes abs(count - rhs), UBSAN shows the following warning: ================================================================================ UBSAN: Undefined behaviour in lib/percpu_counter.c:209:6 signed integer overflow: 0 - -9223372036854775808 cannot be represented in type 'long long int' CPU: 1 PID: 9636 Comm: syz-executor.2 Tainted: G ---------r- - 4.18.0 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Call Trace: __dump_stack home/install/linux-rh-3-10/lib/dump_stack.c:77 [inline] dump_stack+0x125/0x1ae home/install/linux-rh-3-10/lib/dump_stack.c:117 ubsan_epilogue+0xe/0x81 home/install/linux-rh-3-10/lib/ubsan.c:159 handle_overflow+0x19d/0x1ec home/install/linux-rh-3-10/lib/ubsan.c:190 __percpu_counter_compare+0x124/0x140 home/install/linux-rh-3-10/lib/percpu_counter.c:209 percpu_counter_compare home/install/linux-rh-3-10/./include/linux/percpu_counter.h:50 [inline] shmem_remount_fs+0x1ce/0x6b0 home/install/linux-rh-3-10/mm/shmem.c:3530 do_remount_sb+0x11b/0x530 home/install/linux-rh-3-10/fs/super.c:888 do_remount home/install/linux-rh-3-10/fs/namespace.c:2344 [inline] do_mount+0xf8d/0x26b0 home/install/linux-rh-3-10/fs/namespace.c:2844 ksys_mount+0xad/0x120 home/install/linux-rh-3-10/fs/namespace.c:3075 __do_sys_mount home/install/linux-rh-3-10/fs/namespace.c:3089 [inline] __se_sys_mount home/install/linux-rh-3-10/fs/namespace.c:3086 [inline] __x64_sys_mount+0xbf/0x160 home/install/linux-rh-3-10/fs/namespace.c:3086 do_syscall_64+0xca/0x5c0 home/install/linux-rh-3-10/arch/x86/entry/common.c:298 entry_SYSCALL_64_after_hwframe+0x6a/0xdf RIP: 0033:0x46b5e9 Code: 5d db fa ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 2b db fa ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f54d5f22c68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 000000000077bf60 RCX: 000000000046b5e9 RDX: 0000000000000000 RSI: 0000000020000000 RDI: 0000000000000000 RBP: 000000000077bf60 R08: 0000000020000140 R09: 0000000000000000 R10: 00000000026740a4 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffd1fb1592f R14: 00007f54d5f239c0 R15: 000000000077bf6c ================================================================================ [akpm@linux-foundation.org: tweak error message text] Link: https://lkml.kernel.org/r/20220513025225.2678727-1-luomeng12@huawei.com Signed-off-by: Luo Meng <luomeng12@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff d14f5efa Thu May 12 20:52:25 MDT 2022 Luo Meng <luomeng12@huawei.com> tmpfs: fix undefined-behaviour in shmem_reconfigure() When shmem_reconfigure() calls __percpu_counter_compare(), the second parameter is unsigned long long. But in the definition of __percpu_counter_compare(), the second parameter is s64. So when __percpu_counter_compare() executes abs(count - rhs), UBSAN shows the following warning: ================================================================================ UBSAN: Undefined behaviour in lib/percpu_counter.c:209:6 signed integer overflow: 0 - -9223372036854775808 cannot be represented in type 'long long int' CPU: 1 PID: 9636 Comm: syz-executor.2 Tainted: G ---------r- - 4.18.0 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Call Trace: __dump_stack home/install/linux-rh-3-10/lib/dump_stack.c:77 [inline] dump_stack+0x125/0x1ae home/install/linux-rh-3-10/lib/dump_stack.c:117 ubsan_epilogue+0xe/0x81 home/install/linux-rh-3-10/lib/ubsan.c:159 handle_overflow+0x19d/0x1ec home/install/linux-rh-3-10/lib/ubsan.c:190 __percpu_counter_compare+0x124/0x140 home/install/linux-rh-3-10/lib/percpu_counter.c:209 percpu_counter_compare home/install/linux-rh-3-10/./include/linux/percpu_counter.h:50 [inline] shmem_remount_fs+0x1ce/0x6b0 home/install/linux-rh-3-10/mm/shmem.c:3530 do_remount_sb+0x11b/0x530 home/install/linux-rh-3-10/fs/super.c:888 do_remount home/install/linux-rh-3-10/fs/namespace.c:2344 [inline] do_mount+0xf8d/0x26b0 home/install/linux-rh-3-10/fs/namespace.c:2844 ksys_mount+0xad/0x120 home/install/linux-rh-3-10/fs/namespace.c:3075 __do_sys_mount home/install/linux-rh-3-10/fs/namespace.c:3089 [inline] __se_sys_mount home/install/linux-rh-3-10/fs/namespace.c:3086 [inline] __x64_sys_mount+0xbf/0x160 home/install/linux-rh-3-10/fs/namespace.c:3086 do_syscall_64+0xca/0x5c0 home/install/linux-rh-3-10/arch/x86/entry/common.c:298 entry_SYSCALL_64_after_hwframe+0x6a/0xdf RIP: 0033:0x46b5e9 Code: 5d db fa ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 2b db fa ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f54d5f22c68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 000000000077bf60 RCX: 000000000046b5e9 RDX: 0000000000000000 RSI: 0000000020000000 RDI: 0000000000000000 RBP: 000000000077bf60 R08: 0000000020000140 R09: 0000000000000000 R10: 00000000026740a4 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffd1fb1592f R14: 00007f54d5f239c0 R15: 000000000077bf6c ================================================================================ [akpm@linux-foundation.org: tweak error message text] Link: https://lkml.kernel.org/r/20220513025225.2678727-1-luomeng12@huawei.com Signed-off-by: Luo Meng <luomeng12@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff d14f5efa Thu May 12 20:52:25 MDT 2022 Luo Meng <luomeng12@huawei.com> tmpfs: fix undefined-behaviour in shmem_reconfigure() When shmem_reconfigure() calls __percpu_counter_compare(), the second parameter is unsigned long long. But in the definition of __percpu_counter_compare(), the second parameter is s64. So when __percpu_counter_compare() executes abs(count - rhs), UBSAN shows the following warning: ================================================================================ UBSAN: Undefined behaviour in lib/percpu_counter.c:209:6 signed integer overflow: 0 - -9223372036854775808 cannot be represented in type 'long long int' CPU: 1 PID: 9636 Comm: syz-executor.2 Tainted: G ---------r- - 4.18.0 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Call Trace: __dump_stack home/install/linux-rh-3-10/lib/dump_stack.c:77 [inline] dump_stack+0x125/0x1ae home/install/linux-rh-3-10/lib/dump_stack.c:117 ubsan_epilogue+0xe/0x81 home/install/linux-rh-3-10/lib/ubsan.c:159 handle_overflow+0x19d/0x1ec home/install/linux-rh-3-10/lib/ubsan.c:190 __percpu_counter_compare+0x124/0x140 home/install/linux-rh-3-10/lib/percpu_counter.c:209 percpu_counter_compare home/install/linux-rh-3-10/./include/linux/percpu_counter.h:50 [inline] shmem_remount_fs+0x1ce/0x6b0 home/install/linux-rh-3-10/mm/shmem.c:3530 do_remount_sb+0x11b/0x530 home/install/linux-rh-3-10/fs/super.c:888 do_remount home/install/linux-rh-3-10/fs/namespace.c:2344 [inline] do_mount+0xf8d/0x26b0 home/install/linux-rh-3-10/fs/namespace.c:2844 ksys_mount+0xad/0x120 home/install/linux-rh-3-10/fs/namespace.c:3075 __do_sys_mount home/install/linux-rh-3-10/fs/namespace.c:3089 [inline] __se_sys_mount home/install/linux-rh-3-10/fs/namespace.c:3086 [inline] __x64_sys_mount+0xbf/0x160 home/install/linux-rh-3-10/fs/namespace.c:3086 do_syscall_64+0xca/0x5c0 home/install/linux-rh-3-10/arch/x86/entry/common.c:298 entry_SYSCALL_64_after_hwframe+0x6a/0xdf RIP: 0033:0x46b5e9 Code: 5d db fa ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 2b db fa ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f54d5f22c68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 000000000077bf60 RCX: 000000000046b5e9 RDX: 0000000000000000 RSI: 0000000020000000 RDI: 0000000000000000 RBP: 000000000077bf60 R08: 0000000020000140 R09: 0000000000000000 R10: 00000000026740a4 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffd1fb1592f R14: 00007f54d5f239c0 R15: 000000000077bf6c ================================================================================ [akpm@linux-foundation.org: tweak error message text] Link: https://lkml.kernel.org/r/20220513025225.2678727-1-luomeng12@huawei.com Signed-off-by: Luo Meng <luomeng12@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 62c9827c Fri Jan 14 15:05:23 MST 2022 Gang Li <ligang.bdlg@bytedance.com> shmem: fix a race between shmem_unused_huge_shrink and shmem_evict_inode Fix a data race in commit 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure"). Here are call traces causing race: Call Trace 1: shmem_unused_huge_shrink+0x3ae/0x410 ? __list_lru_walk_one.isra.5+0x33/0x160 super_cache_scan+0x17c/0x190 shrink_slab.part.55+0x1ef/0x3f0 shrink_node+0x10e/0x330 kswapd+0x380/0x740 kthread+0xfc/0x130 ? mem_cgroup_shrink_node+0x170/0x170 ? kthread_create_on_node+0x70/0x70 ret_from_fork+0x1f/0x30 Call Trace 2: shmem_evict_inode+0xd8/0x190 evict+0xbe/0x1c0 do_unlinkat+0x137/0x330 do_syscall_64+0x76/0x120 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 A simple explanation: Image there are 3 items in the local list (@list). In the first traversal, A is not deleted from @list. 1) A->B->C ^ | pos (leave) In the second traversal, B is deleted from @list. Concurrently, A is deleted from @list through shmem_evict_inode() since last reference counter of inode is dropped by other thread. Then the @list is corrupted. 2) A->B->C ^ ^ | | evict pos (drop) We should make sure the inode is either on the global list or deleted from any local list before iput(). Fixed by moving inodes back to global list before we put them. [akpm@linux-foundation.org: coding style fixes] Link: https://lkml.kernel.org/r/20211125064502.99983-1-ligang.bdlg@bytedance.com Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure") Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 62c9827c Fri Jan 14 15:05:23 MST 2022 Gang Li <ligang.bdlg@bytedance.com> shmem: fix a race between shmem_unused_huge_shrink and shmem_evict_inode Fix a data race in commit 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure"). Here are call traces causing race: Call Trace 1: shmem_unused_huge_shrink+0x3ae/0x410 ? __list_lru_walk_one.isra.5+0x33/0x160 super_cache_scan+0x17c/0x190 shrink_slab.part.55+0x1ef/0x3f0 shrink_node+0x10e/0x330 kswapd+0x380/0x740 kthread+0xfc/0x130 ? mem_cgroup_shrink_node+0x170/0x170 ? kthread_create_on_node+0x70/0x70 ret_from_fork+0x1f/0x30 Call Trace 2: shmem_evict_inode+0xd8/0x190 evict+0xbe/0x1c0 do_unlinkat+0x137/0x330 do_syscall_64+0x76/0x120 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 A simple explanation: Image there are 3 items in the local list (@list). In the first traversal, A is not deleted from @list. 1) A->B->C ^ | pos (leave) In the second traversal, B is deleted from @list. Concurrently, A is deleted from @list through shmem_evict_inode() since last reference counter of inode is dropped by other thread. Then the @list is corrupted. 2) A->B->C ^ ^ | | evict pos (drop) We should make sure the inode is either on the global list or deleted from any local list before iput(). Fixed by moving inodes back to global list before we put them. [akpm@linux-foundation.org: coding style fixes] Link: https://lkml.kernel.org/r/20211125064502.99983-1-ligang.bdlg@bytedance.com Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure") Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 62c9827c Fri Jan 14 15:05:23 MST 2022 Gang Li <ligang.bdlg@bytedance.com> shmem: fix a race between shmem_unused_huge_shrink and shmem_evict_inode Fix a data race in commit 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure"). Here are call traces causing race: Call Trace 1: shmem_unused_huge_shrink+0x3ae/0x410 ? __list_lru_walk_one.isra.5+0x33/0x160 super_cache_scan+0x17c/0x190 shrink_slab.part.55+0x1ef/0x3f0 shrink_node+0x10e/0x330 kswapd+0x380/0x740 kthread+0xfc/0x130 ? mem_cgroup_shrink_node+0x170/0x170 ? kthread_create_on_node+0x70/0x70 ret_from_fork+0x1f/0x30 Call Trace 2: shmem_evict_inode+0xd8/0x190 evict+0xbe/0x1c0 do_unlinkat+0x137/0x330 do_syscall_64+0x76/0x120 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 A simple explanation: Image there are 3 items in the local list (@list). In the first traversal, A is not deleted from @list. 1) A->B->C ^ | pos (leave) In the second traversal, B is deleted from @list. Concurrently, A is deleted from @list through shmem_evict_inode() since last reference counter of inode is dropped by other thread. Then the @list is corrupted. 2) A->B->C ^ ^ | | evict pos (drop) We should make sure the inode is either on the global list or deleted from any local list before iput(). Fixed by moving inodes back to global list before we put them. [akpm@linux-foundation.org: coding style fixes] Link: https://lkml.kernel.org/r/20211125064502.99983-1-ligang.bdlg@bytedance.com Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure") Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
Completed in 1602 milliseconds