Searched +hist:1025774 +hist:c (Results 1 - 25 of 26) sorted by relevance
/linux-master/fs/omfs/ | ||
H A D | file.c | diff 4d8cbf6d Fri Jul 14 13:56:37 MDT 2023 Gustavo A. R. Silva <gustavoars@kernel.org> fs: omfs: Use flexible-array member in struct omfs_extent Memory for 'struct omfs_extent' and a 'e_extent_count' number of extent entries is indirectly allocated through 'bh->b_data', which is a pointer to data within the page. This implies that the member 'e_entry' (which is the start of extent entries) functions more like an array than a single object of type 'struct omfs_extent_entry'. So we better turn this object into a proper array, in this case a flexible-array member, and with that, fix the following -Wstringop-overflow warning seen after building s390 architecture with allyesconfig (GCC 13): fs/omfs/file.c: In function 'omfs_grow_extent': include/linux/fortify-string.h:57:33: warning: writing 16 bytes into a region of size 0 [-Wstringop-overflow=] 57 | #define __underlying_memcpy __builtin_memcpy | ^ include/linux/fortify-string.h:648:9: note: in expansion of macro '__underlying_memcpy' 648 | __underlying_##op(p, q, __fortify_size); \ | ^~~~~~~~~~~~~ include/linux/fortify-string.h:693:26: note: in expansion of macro '__fortify_memcpy_chk' 693 | #define memcpy(p, q, s) __fortify_memcpy_chk(p, q, s, \ | ^~~~~~~~~~~~~~~~~~~~ fs/omfs/file.c:170:9: note: in expansion of macro 'memcpy' 170 | memcpy(terminator, entry, sizeof(struct omfs_extent_entry)); | ^~~~~~ In file included from fs/omfs/omfs.h:8, from fs/omfs/file.c:11: fs/omfs/omfs_fs.h:80:34: note: at offset 16 into destination object 'e_entry' of size 16 80 | struct omfs_extent_entry e_entry; /* start of extent entries */ | ^~~~~~~ There are some binary differences before and after changes, but this are expected due to the change in the size of 'struct omfs_extent' and the necessary adjusments. This helps with the ongoing efforts to globally enable -Wstringop-overflow. Link: https://github.com/KSPP/linux/issues/330 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> diff 4d8cbf6d Fri Jul 14 13:56:37 MDT 2023 Gustavo A. R. Silva <gustavoars@kernel.org> fs: omfs: Use flexible-array member in struct omfs_extent Memory for 'struct omfs_extent' and a 'e_extent_count' number of extent entries is indirectly allocated through 'bh->b_data', which is a pointer to data within the page. This implies that the member 'e_entry' (which is the start of extent entries) functions more like an array than a single object of type 'struct omfs_extent_entry'. So we better turn this object into a proper array, in this case a flexible-array member, and with that, fix the following -Wstringop-overflow warning seen after building s390 architecture with allyesconfig (GCC 13): fs/omfs/file.c: In function 'omfs_grow_extent': include/linux/fortify-string.h:57:33: warning: writing 16 bytes into a region of size 0 [-Wstringop-overflow=] 57 | #define __underlying_memcpy __builtin_memcpy | ^ include/linux/fortify-string.h:648:9: note: in expansion of macro '__underlying_memcpy' 648 | __underlying_##op(p, q, __fortify_size); \ | ^~~~~~~~~~~~~ include/linux/fortify-string.h:693:26: note: in expansion of macro '__fortify_memcpy_chk' 693 | #define memcpy(p, q, s) __fortify_memcpy_chk(p, q, s, \ | ^~~~~~~~~~~~~~~~~~~~ fs/omfs/file.c:170:9: note: in expansion of macro 'memcpy' 170 | memcpy(terminator, entry, sizeof(struct omfs_extent_entry)); | ^~~~~~ In file included from fs/omfs/omfs.h:8, from fs/omfs/file.c:11: fs/omfs/omfs_fs.h:80:34: note: at offset 16 into destination object 'e_entry' of size 16 80 | struct omfs_extent_entry e_entry; /* start of extent entries */ | ^~~~~~~ There are some binary differences before and after changes, but this are expected due to the change in the size of 'struct omfs_extent' and the necessary adjusments. This helps with the ongoing efforts to globally enable -Wstringop-overflow. Link: https://github.com/KSPP/linux/issues/330 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> diff 4d8cbf6d Fri Jul 14 13:56:37 MDT 2023 Gustavo A. R. Silva <gustavoars@kernel.org> fs: omfs: Use flexible-array member in struct omfs_extent Memory for 'struct omfs_extent' and a 'e_extent_count' number of extent entries is indirectly allocated through 'bh->b_data', which is a pointer to data within the page. This implies that the member 'e_entry' (which is the start of extent entries) functions more like an array than a single object of type 'struct omfs_extent_entry'. So we better turn this object into a proper array, in this case a flexible-array member, and with that, fix the following -Wstringop-overflow warning seen after building s390 architecture with allyesconfig (GCC 13): fs/omfs/file.c: In function 'omfs_grow_extent': include/linux/fortify-string.h:57:33: warning: writing 16 bytes into a region of size 0 [-Wstringop-overflow=] 57 | #define __underlying_memcpy __builtin_memcpy | ^ include/linux/fortify-string.h:648:9: note: in expansion of macro '__underlying_memcpy' 648 | __underlying_##op(p, q, __fortify_size); \ | ^~~~~~~~~~~~~ include/linux/fortify-string.h:693:26: note: in expansion of macro '__fortify_memcpy_chk' 693 | #define memcpy(p, q, s) __fortify_memcpy_chk(p, q, s, \ | ^~~~~~~~~~~~~~~~~~~~ fs/omfs/file.c:170:9: note: in expansion of macro 'memcpy' 170 | memcpy(terminator, entry, sizeof(struct omfs_extent_entry)); | ^~~~~~ In file included from fs/omfs/omfs.h:8, from fs/omfs/file.c:11: fs/omfs/omfs_fs.h:80:34: note: at offset 16 into destination object 'e_entry' of size 16 80 | struct omfs_extent_entry e_entry; /* start of extent entries */ | ^~~~~~~ There are some binary differences before and after changes, but this are expected due to the change in the size of 'struct omfs_extent' and the necessary adjusments. This helps with the ongoing efforts to globally enable -Wstringop-overflow. Link: https://github.com/KSPP/linux/issues/330 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> diff 3f649ab7 Wed Jun 03 14:09:38 MDT 2020 Kees Cook <keescook@chromium.org> treewide: Remove uninitialized_var() usage Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: Kees Cook <keescook@chromium.org> diff 46e4edbf Thu Jun 23 15:59:32 MDT 2011 Jesper Juhl <jj@chaosbits.net> Remove unneeded version.h includes from fs/ It was pointed out by 'make versioncheck' that some includes of linux/version.h were not needed in fs/ (fs/btrfs/ctree.h and fs/omfs/file.c). This patch removes them. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 1025774c Fri Jun 04 03:30:02 MDT 2010 Christoph Hellwig <hch@lst.de> remove inode_setattr Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 1025774c Fri Jun 04 03:30:02 MDT 2010 Christoph Hellwig <hch@lst.de> remove inode_setattr Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff ffc18879 Tue Aug 03 13:19:30 MDT 2010 Bill Pemberton <wfp5p@virginia.edu> omfs: fix uninitialized variable warning quiet the warning: fs/omfs/file.c: In function 'omfs_get_block': fs/omfs/file.c:225: warning: 'new_block' may be used uninitialized in this function new_block is used properly by the call to omfs_grow_extent() Signed-off-by: Bill Pemberton <wfp5p@virginia.edu> Signed-off-by: Bob Copeland <me@bobcopeland.com> diff ffc18879 Tue Aug 03 13:19:30 MDT 2010 Bill Pemberton <wfp5p@virginia.edu> omfs: fix uninitialized variable warning quiet the warning: fs/omfs/file.c: In function 'omfs_get_block': fs/omfs/file.c:225: warning: 'new_block' may be used uninitialized in this function new_block is used properly by the call to omfs_grow_extent() Signed-off-by: Bill Pemberton <wfp5p@virginia.edu> Signed-off-by: Bob Copeland <me@bobcopeland.com> diff d406f66d Tue Jul 29 23:33:46 MDT 2008 Harvey Harrison <harvey.harrison@gmail.com> omfs: sparse annotations Missing cpu_to_be64 on some constant assignments. fs/omfs/dir.c:107:16: warning: incorrect type in assignment (different base types) fs/omfs/dir.c:107:16: expected restricted __be64 [usertype] i_sibling fs/omfs/dir.c:107:16: got unsigned long long fs/omfs/file.c:33:13: warning: incorrect type in assignment (different base types) fs/omfs/file.c:33:13: expected restricted __be64 [usertype] e_next fs/omfs/file.c:33:13: got unsigned long long fs/omfs/file.c:36:24: warning: incorrect type in assignment (different base types) fs/omfs/file.c:36:24: expected restricted __be64 [usertype] e_cluster fs/omfs/file.c:36:24: got unsigned long long fs/omfs/file.c:37:23: warning: incorrect type in assignment (different base types) fs/omfs/file.c:37:23: expected restricted __be64 [usertype] e_blocks fs/omfs/file.c:37:23: got unsigned long long fs/omfs/bitmap.c:74:18: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:74:18: expected unsigned long volatile *addr fs/omfs/bitmap.c:74:18: got long *<noident> fs/omfs/bitmap.c:77:20: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:77:20: expected unsigned long volatile *addr fs/omfs/bitmap.c:77:20: got long *<noident> fs/omfs/bitmap.c:112:17: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:112:17: expected unsigned long volatile *addr fs/omfs/bitmap.c:112:17: got long *<noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff d406f66d Tue Jul 29 23:33:46 MDT 2008 Harvey Harrison <harvey.harrison@gmail.com> omfs: sparse annotations Missing cpu_to_be64 on some constant assignments. fs/omfs/dir.c:107:16: warning: incorrect type in assignment (different base types) fs/omfs/dir.c:107:16: expected restricted __be64 [usertype] i_sibling fs/omfs/dir.c:107:16: got unsigned long long fs/omfs/file.c:33:13: warning: incorrect type in assignment (different base types) fs/omfs/file.c:33:13: expected restricted __be64 [usertype] e_next fs/omfs/file.c:33:13: got unsigned long long fs/omfs/file.c:36:24: warning: incorrect type in assignment (different base types) fs/omfs/file.c:36:24: expected restricted __be64 [usertype] e_cluster fs/omfs/file.c:36:24: got unsigned long long fs/omfs/file.c:37:23: warning: incorrect type in assignment (different base types) fs/omfs/file.c:37:23: expected restricted __be64 [usertype] e_blocks fs/omfs/file.c:37:23: got unsigned long long fs/omfs/bitmap.c:74:18: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:74:18: expected unsigned long volatile *addr fs/omfs/bitmap.c:74:18: got long *<noident> fs/omfs/bitmap.c:77:20: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:77:20: expected unsigned long volatile *addr fs/omfs/bitmap.c:77:20: got long *<noident> fs/omfs/bitmap.c:112:17: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:112:17: expected unsigned long volatile *addr fs/omfs/bitmap.c:112:17: got long *<noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff d406f66d Tue Jul 29 23:33:46 MDT 2008 Harvey Harrison <harvey.harrison@gmail.com> omfs: sparse annotations Missing cpu_to_be64 on some constant assignments. fs/omfs/dir.c:107:16: warning: incorrect type in assignment (different base types) fs/omfs/dir.c:107:16: expected restricted __be64 [usertype] i_sibling fs/omfs/dir.c:107:16: got unsigned long long fs/omfs/file.c:33:13: warning: incorrect type in assignment (different base types) fs/omfs/file.c:33:13: expected restricted __be64 [usertype] e_next fs/omfs/file.c:33:13: got unsigned long long fs/omfs/file.c:36:24: warning: incorrect type in assignment (different base types) fs/omfs/file.c:36:24: expected restricted __be64 [usertype] e_cluster fs/omfs/file.c:36:24: got unsigned long long fs/omfs/file.c:37:23: warning: incorrect type in assignment (different base types) fs/omfs/file.c:37:23: expected restricted __be64 [usertype] e_blocks fs/omfs/file.c:37:23: got unsigned long long fs/omfs/bitmap.c:74:18: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:74:18: expected unsigned long volatile *addr fs/omfs/bitmap.c:74:18: got long *<noident> fs/omfs/bitmap.c:77:20: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:77:20: expected unsigned long volatile *addr fs/omfs/bitmap.c:77:20: got long *<noident> fs/omfs/bitmap.c:112:17: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:112:17: expected unsigned long volatile *addr fs/omfs/bitmap.c:112:17: got long *<noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff d406f66d Tue Jul 29 23:33:46 MDT 2008 Harvey Harrison <harvey.harrison@gmail.com> omfs: sparse annotations Missing cpu_to_be64 on some constant assignments. fs/omfs/dir.c:107:16: warning: incorrect type in assignment (different base types) fs/omfs/dir.c:107:16: expected restricted __be64 [usertype] i_sibling fs/omfs/dir.c:107:16: got unsigned long long fs/omfs/file.c:33:13: warning: incorrect type in assignment (different base types) fs/omfs/file.c:33:13: expected restricted __be64 [usertype] e_next fs/omfs/file.c:33:13: got unsigned long long fs/omfs/file.c:36:24: warning: incorrect type in assignment (different base types) fs/omfs/file.c:36:24: expected restricted __be64 [usertype] e_cluster fs/omfs/file.c:36:24: got unsigned long long fs/omfs/file.c:37:23: warning: incorrect type in assignment (different base types) fs/omfs/file.c:37:23: expected restricted __be64 [usertype] e_blocks fs/omfs/file.c:37:23: got unsigned long long fs/omfs/bitmap.c:74:18: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:74:18: expected unsigned long volatile *addr fs/omfs/bitmap.c:74:18: got long *<noident> fs/omfs/bitmap.c:77:20: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:77:20: expected unsigned long volatile *addr fs/omfs/bitmap.c:77:20: got long *<noident> fs/omfs/bitmap.c:112:17: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:112:17: expected unsigned long volatile *addr fs/omfs/bitmap.c:112:17: got long *<noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff d406f66d Tue Jul 29 23:33:46 MDT 2008 Harvey Harrison <harvey.harrison@gmail.com> omfs: sparse annotations Missing cpu_to_be64 on some constant assignments. fs/omfs/dir.c:107:16: warning: incorrect type in assignment (different base types) fs/omfs/dir.c:107:16: expected restricted __be64 [usertype] i_sibling fs/omfs/dir.c:107:16: got unsigned long long fs/omfs/file.c:33:13: warning: incorrect type in assignment (different base types) fs/omfs/file.c:33:13: expected restricted __be64 [usertype] e_next fs/omfs/file.c:33:13: got unsigned long long fs/omfs/file.c:36:24: warning: incorrect type in assignment (different base types) fs/omfs/file.c:36:24: expected restricted __be64 [usertype] e_cluster fs/omfs/file.c:36:24: got unsigned long long fs/omfs/file.c:37:23: warning: incorrect type in assignment (different base types) fs/omfs/file.c:37:23: expected restricted __be64 [usertype] e_blocks fs/omfs/file.c:37:23: got unsigned long long fs/omfs/bitmap.c:74:18: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:74:18: expected unsigned long volatile *addr fs/omfs/bitmap.c:74:18: got long *<noident> fs/omfs/bitmap.c:77:20: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:77:20: expected unsigned long volatile *addr fs/omfs/bitmap.c:77:20: got long *<noident> fs/omfs/bitmap.c:112:17: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:112:17: expected unsigned long volatile *addr fs/omfs/bitmap.c:112:17: got long *<noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff d406f66d Tue Jul 29 23:33:46 MDT 2008 Harvey Harrison <harvey.harrison@gmail.com> omfs: sparse annotations Missing cpu_to_be64 on some constant assignments. fs/omfs/dir.c:107:16: warning: incorrect type in assignment (different base types) fs/omfs/dir.c:107:16: expected restricted __be64 [usertype] i_sibling fs/omfs/dir.c:107:16: got unsigned long long fs/omfs/file.c:33:13: warning: incorrect type in assignment (different base types) fs/omfs/file.c:33:13: expected restricted __be64 [usertype] e_next fs/omfs/file.c:33:13: got unsigned long long fs/omfs/file.c:36:24: warning: incorrect type in assignment (different base types) fs/omfs/file.c:36:24: expected restricted __be64 [usertype] e_cluster fs/omfs/file.c:36:24: got unsigned long long fs/omfs/file.c:37:23: warning: incorrect type in assignment (different base types) fs/omfs/file.c:37:23: expected restricted __be64 [usertype] e_blocks fs/omfs/file.c:37:23: got unsigned long long fs/omfs/bitmap.c:74:18: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:74:18: expected unsigned long volatile *addr fs/omfs/bitmap.c:74:18: got long *<noident> fs/omfs/bitmap.c:77:20: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:77:20: expected unsigned long volatile *addr fs/omfs/bitmap.c:77:20: got long *<noident> fs/omfs/bitmap.c:112:17: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:112:17: expected unsigned long volatile *addr fs/omfs/bitmap.c:112:17: got long *<noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff d406f66d Tue Jul 29 23:33:46 MDT 2008 Harvey Harrison <harvey.harrison@gmail.com> omfs: sparse annotations Missing cpu_to_be64 on some constant assignments. fs/omfs/dir.c:107:16: warning: incorrect type in assignment (different base types) fs/omfs/dir.c:107:16: expected restricted __be64 [usertype] i_sibling fs/omfs/dir.c:107:16: got unsigned long long fs/omfs/file.c:33:13: warning: incorrect type in assignment (different base types) fs/omfs/file.c:33:13: expected restricted __be64 [usertype] e_next fs/omfs/file.c:33:13: got unsigned long long fs/omfs/file.c:36:24: warning: incorrect type in assignment (different base types) fs/omfs/file.c:36:24: expected restricted __be64 [usertype] e_cluster fs/omfs/file.c:36:24: got unsigned long long fs/omfs/file.c:37:23: warning: incorrect type in assignment (different base types) fs/omfs/file.c:37:23: expected restricted __be64 [usertype] e_blocks fs/omfs/file.c:37:23: got unsigned long long fs/omfs/bitmap.c:74:18: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:74:18: expected unsigned long volatile *addr fs/omfs/bitmap.c:74:18: got long *<noident> fs/omfs/bitmap.c:77:20: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:77:20: expected unsigned long volatile *addr fs/omfs/bitmap.c:77:20: got long *<noident> fs/omfs/bitmap.c:112:17: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:112:17: expected unsigned long volatile *addr fs/omfs/bitmap.c:112:17: got long *<noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff d406f66d Tue Jul 29 23:33:46 MDT 2008 Harvey Harrison <harvey.harrison@gmail.com> omfs: sparse annotations Missing cpu_to_be64 on some constant assignments. fs/omfs/dir.c:107:16: warning: incorrect type in assignment (different base types) fs/omfs/dir.c:107:16: expected restricted __be64 [usertype] i_sibling fs/omfs/dir.c:107:16: got unsigned long long fs/omfs/file.c:33:13: warning: incorrect type in assignment (different base types) fs/omfs/file.c:33:13: expected restricted __be64 [usertype] e_next fs/omfs/file.c:33:13: got unsigned long long fs/omfs/file.c:36:24: warning: incorrect type in assignment (different base types) fs/omfs/file.c:36:24: expected restricted __be64 [usertype] e_cluster fs/omfs/file.c:36:24: got unsigned long long fs/omfs/file.c:37:23: warning: incorrect type in assignment (different base types) fs/omfs/file.c:37:23: expected restricted __be64 [usertype] e_blocks fs/omfs/file.c:37:23: got unsigned long long fs/omfs/bitmap.c:74:18: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:74:18: expected unsigned long volatile *addr fs/omfs/bitmap.c:74:18: got long *<noident> fs/omfs/bitmap.c:77:20: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:77:20: expected unsigned long volatile *addr fs/omfs/bitmap.c:77:20: got long *<noident> fs/omfs/bitmap.c:112:17: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:112:17: expected unsigned long volatile *addr fs/omfs/bitmap.c:112:17: got long *<noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff d406f66d Tue Jul 29 23:33:46 MDT 2008 Harvey Harrison <harvey.harrison@gmail.com> omfs: sparse annotations Missing cpu_to_be64 on some constant assignments. fs/omfs/dir.c:107:16: warning: incorrect type in assignment (different base types) fs/omfs/dir.c:107:16: expected restricted __be64 [usertype] i_sibling fs/omfs/dir.c:107:16: got unsigned long long fs/omfs/file.c:33:13: warning: incorrect type in assignment (different base types) fs/omfs/file.c:33:13: expected restricted __be64 [usertype] e_next fs/omfs/file.c:33:13: got unsigned long long fs/omfs/file.c:36:24: warning: incorrect type in assignment (different base types) fs/omfs/file.c:36:24: expected restricted __be64 [usertype] e_cluster fs/omfs/file.c:36:24: got unsigned long long fs/omfs/file.c:37:23: warning: incorrect type in assignment (different base types) fs/omfs/file.c:37:23: expected restricted __be64 [usertype] e_blocks fs/omfs/file.c:37:23: got unsigned long long fs/omfs/bitmap.c:74:18: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:74:18: expected unsigned long volatile *addr fs/omfs/bitmap.c:74:18: got long *<noident> fs/omfs/bitmap.c:77:20: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:77:20: expected unsigned long volatile *addr fs/omfs/bitmap.c:77:20: got long *<noident> fs/omfs/bitmap.c:112:17: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:112:17: expected unsigned long volatile *addr fs/omfs/bitmap.c:112:17: got long *<noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff d406f66d Tue Jul 29 23:33:46 MDT 2008 Harvey Harrison <harvey.harrison@gmail.com> omfs: sparse annotations Missing cpu_to_be64 on some constant assignments. fs/omfs/dir.c:107:16: warning: incorrect type in assignment (different base types) fs/omfs/dir.c:107:16: expected restricted __be64 [usertype] i_sibling fs/omfs/dir.c:107:16: got unsigned long long fs/omfs/file.c:33:13: warning: incorrect type in assignment (different base types) fs/omfs/file.c:33:13: expected restricted __be64 [usertype] e_next fs/omfs/file.c:33:13: got unsigned long long fs/omfs/file.c:36:24: warning: incorrect type in assignment (different base types) fs/omfs/file.c:36:24: expected restricted __be64 [usertype] e_cluster fs/omfs/file.c:36:24: got unsigned long long fs/omfs/file.c:37:23: warning: incorrect type in assignment (different base types) fs/omfs/file.c:37:23: expected restricted __be64 [usertype] e_blocks fs/omfs/file.c:37:23: got unsigned long long fs/omfs/bitmap.c:74:18: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:74:18: expected unsigned long volatile *addr fs/omfs/bitmap.c:74:18: got long *<noident> fs/omfs/bitmap.c:77:20: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:77:20: expected unsigned long volatile *addr fs/omfs/bitmap.c:77:20: got long *<noident> fs/omfs/bitmap.c:112:17: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:112:17: expected unsigned long volatile *addr fs/omfs/bitmap.c:112:17: got long *<noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff d406f66d Tue Jul 29 23:33:46 MDT 2008 Harvey Harrison <harvey.harrison@gmail.com> omfs: sparse annotations Missing cpu_to_be64 on some constant assignments. fs/omfs/dir.c:107:16: warning: incorrect type in assignment (different base types) fs/omfs/dir.c:107:16: expected restricted __be64 [usertype] i_sibling fs/omfs/dir.c:107:16: got unsigned long long fs/omfs/file.c:33:13: warning: incorrect type in assignment (different base types) fs/omfs/file.c:33:13: expected restricted __be64 [usertype] e_next fs/omfs/file.c:33:13: got unsigned long long fs/omfs/file.c:36:24: warning: incorrect type in assignment (different base types) fs/omfs/file.c:36:24: expected restricted __be64 [usertype] e_cluster fs/omfs/file.c:36:24: got unsigned long long fs/omfs/file.c:37:23: warning: incorrect type in assignment (different base types) fs/omfs/file.c:37:23: expected restricted __be64 [usertype] e_blocks fs/omfs/file.c:37:23: got unsigned long long fs/omfs/bitmap.c:74:18: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:74:18: expected unsigned long volatile *addr fs/omfs/bitmap.c:74:18: got long *<noident> fs/omfs/bitmap.c:77:20: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:77:20: expected unsigned long volatile *addr fs/omfs/bitmap.c:77:20: got long *<noident> fs/omfs/bitmap.c:112:17: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:112:17: expected unsigned long volatile *addr fs/omfs/bitmap.c:112:17: got long *<noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff d406f66d Tue Jul 29 23:33:46 MDT 2008 Harvey Harrison <harvey.harrison@gmail.com> omfs: sparse annotations Missing cpu_to_be64 on some constant assignments. fs/omfs/dir.c:107:16: warning: incorrect type in assignment (different base types) fs/omfs/dir.c:107:16: expected restricted __be64 [usertype] i_sibling fs/omfs/dir.c:107:16: got unsigned long long fs/omfs/file.c:33:13: warning: incorrect type in assignment (different base types) fs/omfs/file.c:33:13: expected restricted __be64 [usertype] e_next fs/omfs/file.c:33:13: got unsigned long long fs/omfs/file.c:36:24: warning: incorrect type in assignment (different base types) fs/omfs/file.c:36:24: expected restricted __be64 [usertype] e_cluster fs/omfs/file.c:36:24: got unsigned long long fs/omfs/file.c:37:23: warning: incorrect type in assignment (different base types) fs/omfs/file.c:37:23: expected restricted __be64 [usertype] e_blocks fs/omfs/file.c:37:23: got unsigned long long fs/omfs/bitmap.c:74:18: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:74:18: expected unsigned long volatile *addr fs/omfs/bitmap.c:74:18: got long *<noident> fs/omfs/bitmap.c:77:20: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:77:20: expected unsigned long volatile *addr fs/omfs/bitmap.c:77:20: got long *<noident> fs/omfs/bitmap.c:112:17: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:112:17: expected unsigned long volatile *addr fs/omfs/bitmap.c:112:17: got long *<noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff d406f66d Tue Jul 29 23:33:46 MDT 2008 Harvey Harrison <harvey.harrison@gmail.com> omfs: sparse annotations Missing cpu_to_be64 on some constant assignments. fs/omfs/dir.c:107:16: warning: incorrect type in assignment (different base types) fs/omfs/dir.c:107:16: expected restricted __be64 [usertype] i_sibling fs/omfs/dir.c:107:16: got unsigned long long fs/omfs/file.c:33:13: warning: incorrect type in assignment (different base types) fs/omfs/file.c:33:13: expected restricted __be64 [usertype] e_next fs/omfs/file.c:33:13: got unsigned long long fs/omfs/file.c:36:24: warning: incorrect type in assignment (different base types) fs/omfs/file.c:36:24: expected restricted __be64 [usertype] e_cluster fs/omfs/file.c:36:24: got unsigned long long fs/omfs/file.c:37:23: warning: incorrect type in assignment (different base types) fs/omfs/file.c:37:23: expected restricted __be64 [usertype] e_blocks fs/omfs/file.c:37:23: got unsigned long long fs/omfs/bitmap.c:74:18: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:74:18: expected unsigned long volatile *addr fs/omfs/bitmap.c:74:18: got long *<noident> fs/omfs/bitmap.c:77:20: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:77:20: expected unsigned long volatile *addr fs/omfs/bitmap.c:77:20: got long *<noident> fs/omfs/bitmap.c:112:17: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:112:17: expected unsigned long volatile *addr fs/omfs/bitmap.c:112:17: got long *<noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff d406f66d Tue Jul 29 23:33:46 MDT 2008 Harvey Harrison <harvey.harrison@gmail.com> omfs: sparse annotations Missing cpu_to_be64 on some constant assignments. fs/omfs/dir.c:107:16: warning: incorrect type in assignment (different base types) fs/omfs/dir.c:107:16: expected restricted __be64 [usertype] i_sibling fs/omfs/dir.c:107:16: got unsigned long long fs/omfs/file.c:33:13: warning: incorrect type in assignment (different base types) fs/omfs/file.c:33:13: expected restricted __be64 [usertype] e_next fs/omfs/file.c:33:13: got unsigned long long fs/omfs/file.c:36:24: warning: incorrect type in assignment (different base types) fs/omfs/file.c:36:24: expected restricted __be64 [usertype] e_cluster fs/omfs/file.c:36:24: got unsigned long long fs/omfs/file.c:37:23: warning: incorrect type in assignment (different base types) fs/omfs/file.c:37:23: expected restricted __be64 [usertype] e_blocks fs/omfs/file.c:37:23: got unsigned long long fs/omfs/bitmap.c:74:18: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:74:18: expected unsigned long volatile *addr fs/omfs/bitmap.c:74:18: got long *<noident> fs/omfs/bitmap.c:77:20: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:77:20: expected unsigned long volatile *addr fs/omfs/bitmap.c:77:20: got long *<noident> fs/omfs/bitmap.c:112:17: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:112:17: expected unsigned long volatile *addr fs/omfs/bitmap.c:112:17: got long *<noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff d406f66d Tue Jul 29 23:33:46 MDT 2008 Harvey Harrison <harvey.harrison@gmail.com> omfs: sparse annotations Missing cpu_to_be64 on some constant assignments. fs/omfs/dir.c:107:16: warning: incorrect type in assignment (different base types) fs/omfs/dir.c:107:16: expected restricted __be64 [usertype] i_sibling fs/omfs/dir.c:107:16: got unsigned long long fs/omfs/file.c:33:13: warning: incorrect type in assignment (different base types) fs/omfs/file.c:33:13: expected restricted __be64 [usertype] e_next fs/omfs/file.c:33:13: got unsigned long long fs/omfs/file.c:36:24: warning: incorrect type in assignment (different base types) fs/omfs/file.c:36:24: expected restricted __be64 [usertype] e_cluster fs/omfs/file.c:36:24: got unsigned long long fs/omfs/file.c:37:23: warning: incorrect type in assignment (different base types) fs/omfs/file.c:37:23: expected restricted __be64 [usertype] e_blocks fs/omfs/file.c:37:23: got unsigned long long fs/omfs/bitmap.c:74:18: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:74:18: expected unsigned long volatile *addr fs/omfs/bitmap.c:74:18: got long *<noident> fs/omfs/bitmap.c:77:20: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:77:20: expected unsigned long volatile *addr fs/omfs/bitmap.c:77:20: got long *<noident> fs/omfs/bitmap.c:112:17: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:112:17: expected unsigned long volatile *addr fs/omfs/bitmap.c:112:17: got long *<noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff d406f66d Tue Jul 29 23:33:46 MDT 2008 Harvey Harrison <harvey.harrison@gmail.com> omfs: sparse annotations Missing cpu_to_be64 on some constant assignments. fs/omfs/dir.c:107:16: warning: incorrect type in assignment (different base types) fs/omfs/dir.c:107:16: expected restricted __be64 [usertype] i_sibling fs/omfs/dir.c:107:16: got unsigned long long fs/omfs/file.c:33:13: warning: incorrect type in assignment (different base types) fs/omfs/file.c:33:13: expected restricted __be64 [usertype] e_next fs/omfs/file.c:33:13: got unsigned long long fs/omfs/file.c:36:24: warning: incorrect type in assignment (different base types) fs/omfs/file.c:36:24: expected restricted __be64 [usertype] e_cluster fs/omfs/file.c:36:24: got unsigned long long fs/omfs/file.c:37:23: warning: incorrect type in assignment (different base types) fs/omfs/file.c:37:23: expected restricted __be64 [usertype] e_blocks fs/omfs/file.c:37:23: got unsigned long long fs/omfs/bitmap.c:74:18: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:74:18: expected unsigned long volatile *addr fs/omfs/bitmap.c:74:18: got long *<noident> fs/omfs/bitmap.c:77:20: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:77:20: expected unsigned long volatile *addr fs/omfs/bitmap.c:77:20: got long *<noident> fs/omfs/bitmap.c:112:17: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:112:17: expected unsigned long volatile *addr fs/omfs/bitmap.c:112:17: got long *<noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff d406f66d Tue Jul 29 23:33:46 MDT 2008 Harvey Harrison <harvey.harrison@gmail.com> omfs: sparse annotations Missing cpu_to_be64 on some constant assignments. fs/omfs/dir.c:107:16: warning: incorrect type in assignment (different base types) fs/omfs/dir.c:107:16: expected restricted __be64 [usertype] i_sibling fs/omfs/dir.c:107:16: got unsigned long long fs/omfs/file.c:33:13: warning: incorrect type in assignment (different base types) fs/omfs/file.c:33:13: expected restricted __be64 [usertype] e_next fs/omfs/file.c:33:13: got unsigned long long fs/omfs/file.c:36:24: warning: incorrect type in assignment (different base types) fs/omfs/file.c:36:24: expected restricted __be64 [usertype] e_cluster fs/omfs/file.c:36:24: got unsigned long long fs/omfs/file.c:37:23: warning: incorrect type in assignment (different base types) fs/omfs/file.c:37:23: expected restricted __be64 [usertype] e_blocks fs/omfs/file.c:37:23: got unsigned long long fs/omfs/bitmap.c:74:18: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:74:18: expected unsigned long volatile *addr fs/omfs/bitmap.c:74:18: got long *<noident> fs/omfs/bitmap.c:77:20: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:77:20: expected unsigned long volatile *addr fs/omfs/bitmap.c:77:20: got long *<noident> fs/omfs/bitmap.c:112:17: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:112:17: expected unsigned long volatile *addr fs/omfs/bitmap.c:112:17: got long *<noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff d406f66d Tue Jul 29 23:33:46 MDT 2008 Harvey Harrison <harvey.harrison@gmail.com> omfs: sparse annotations Missing cpu_to_be64 on some constant assignments. fs/omfs/dir.c:107:16: warning: incorrect type in assignment (different base types) fs/omfs/dir.c:107:16: expected restricted __be64 [usertype] i_sibling fs/omfs/dir.c:107:16: got unsigned long long fs/omfs/file.c:33:13: warning: incorrect type in assignment (different base types) fs/omfs/file.c:33:13: expected restricted __be64 [usertype] e_next fs/omfs/file.c:33:13: got unsigned long long fs/omfs/file.c:36:24: warning: incorrect type in assignment (different base types) fs/omfs/file.c:36:24: expected restricted __be64 [usertype] e_cluster fs/omfs/file.c:36:24: got unsigned long long fs/omfs/file.c:37:23: warning: incorrect type in assignment (different base types) fs/omfs/file.c:37:23: expected restricted __be64 [usertype] e_blocks fs/omfs/file.c:37:23: got unsigned long long fs/omfs/bitmap.c:74:18: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:74:18: expected unsigned long volatile *addr fs/omfs/bitmap.c:74:18: got long *<noident> fs/omfs/bitmap.c:77:20: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:77:20: expected unsigned long volatile *addr fs/omfs/bitmap.c:77:20: got long *<noident> fs/omfs/bitmap.c:112:17: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:112:17: expected unsigned long volatile *addr fs/omfs/bitmap.c:112:17: got long *<noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff d406f66d Tue Jul 29 23:33:46 MDT 2008 Harvey Harrison <harvey.harrison@gmail.com> omfs: sparse annotations Missing cpu_to_be64 on some constant assignments. fs/omfs/dir.c:107:16: warning: incorrect type in assignment (different base types) fs/omfs/dir.c:107:16: expected restricted __be64 [usertype] i_sibling fs/omfs/dir.c:107:16: got unsigned long long fs/omfs/file.c:33:13: warning: incorrect type in assignment (different base types) fs/omfs/file.c:33:13: expected restricted __be64 [usertype] e_next fs/omfs/file.c:33:13: got unsigned long long fs/omfs/file.c:36:24: warning: incorrect type in assignment (different base types) fs/omfs/file.c:36:24: expected restricted __be64 [usertype] e_cluster fs/omfs/file.c:36:24: got unsigned long long fs/omfs/file.c:37:23: warning: incorrect type in assignment (different base types) fs/omfs/file.c:37:23: expected restricted __be64 [usertype] e_blocks fs/omfs/file.c:37:23: got unsigned long long fs/omfs/bitmap.c:74:18: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:74:18: expected unsigned long volatile *addr fs/omfs/bitmap.c:74:18: got long *<noident> fs/omfs/bitmap.c:77:20: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:77:20: expected unsigned long volatile *addr fs/omfs/bitmap.c:77:20: got long *<noident> fs/omfs/bitmap.c:112:17: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:112:17: expected unsigned long volatile *addr fs/omfs/bitmap.c:112:17: got long *<noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff d406f66d Tue Jul 29 23:33:46 MDT 2008 Harvey Harrison <harvey.harrison@gmail.com> omfs: sparse annotations Missing cpu_to_be64 on some constant assignments. fs/omfs/dir.c:107:16: warning: incorrect type in assignment (different base types) fs/omfs/dir.c:107:16: expected restricted __be64 [usertype] i_sibling fs/omfs/dir.c:107:16: got unsigned long long fs/omfs/file.c:33:13: warning: incorrect type in assignment (different base types) fs/omfs/file.c:33:13: expected restricted __be64 [usertype] e_next fs/omfs/file.c:33:13: got unsigned long long fs/omfs/file.c:36:24: warning: incorrect type in assignment (different base types) fs/omfs/file.c:36:24: expected restricted __be64 [usertype] e_cluster fs/omfs/file.c:36:24: got unsigned long long fs/omfs/file.c:37:23: warning: incorrect type in assignment (different base types) fs/omfs/file.c:37:23: expected restricted __be64 [usertype] e_blocks fs/omfs/file.c:37:23: got unsigned long long fs/omfs/bitmap.c:74:18: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:74:18: expected unsigned long volatile *addr fs/omfs/bitmap.c:74:18: got long *<noident> fs/omfs/bitmap.c:77:20: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:77:20: expected unsigned long volatile *addr fs/omfs/bitmap.c:77:20: got long *<noident> fs/omfs/bitmap.c:112:17: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:112:17: expected unsigned long volatile *addr fs/omfs/bitmap.c:112:17: got long *<noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff d406f66d Tue Jul 29 23:33:46 MDT 2008 Harvey Harrison <harvey.harrison@gmail.com> omfs: sparse annotations Missing cpu_to_be64 on some constant assignments. fs/omfs/dir.c:107:16: warning: incorrect type in assignment (different base types) fs/omfs/dir.c:107:16: expected restricted __be64 [usertype] i_sibling fs/omfs/dir.c:107:16: got unsigned long long fs/omfs/file.c:33:13: warning: incorrect type in assignment (different base types) fs/omfs/file.c:33:13: expected restricted __be64 [usertype] e_next fs/omfs/file.c:33:13: got unsigned long long fs/omfs/file.c:36:24: warning: incorrect type in assignment (different base types) fs/omfs/file.c:36:24: expected restricted __be64 [usertype] e_cluster fs/omfs/file.c:36:24: got unsigned long long fs/omfs/file.c:37:23: warning: incorrect type in assignment (different base types) fs/omfs/file.c:37:23: expected restricted __be64 [usertype] e_blocks fs/omfs/file.c:37:23: got unsigned long long fs/omfs/bitmap.c:74:18: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:74:18: expected unsigned long volatile *addr fs/omfs/bitmap.c:74:18: got long *<noident> fs/omfs/bitmap.c:77:20: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:77:20: expected unsigned long volatile *addr fs/omfs/bitmap.c:77:20: got long *<noident> fs/omfs/bitmap.c:112:17: warning: incorrect type in argument 2 (different signedness) fs/omfs/bitmap.c:112:17: expected unsigned long volatile *addr fs/omfs/bitmap.c:112:17: got long *<noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
/linux-master/fs/sysv/ | ||
H A D | file.c | diff b2441318 Wed Nov 01 08:07:57 MDT 2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org> License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> diff 1025774c Fri Jun 04 03:30:02 MDT 2010 Christoph Hellwig <hch@lst.de> remove inode_setattr Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 1025774c Fri Jun 04 03:30:02 MDT 2010 Christoph Hellwig <hch@lst.de> remove inode_setattr Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
/linux-master/fs/minix/ | ||
H A D | file.c | diff b2441318 Wed Nov 01 08:07:57 MDT 2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org> License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> diff 1025774c Fri Jun 04 03:30:02 MDT 2010 Christoph Hellwig <hch@lst.de> remove inode_setattr Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 1025774c Fri Jun 04 03:30:02 MDT 2010 Christoph Hellwig <hch@lst.de> remove inode_setattr Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
/linux-master/fs/hpfs/ | ||
H A D | inode.c | diff b2441318 Wed Nov 01 08:07:57 MDT 2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org> License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> diff 1025774c Fri Jun 04 03:30:02 MDT 2010 Christoph Hellwig <hch@lst.de> remove inode_setattr Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 1025774c Fri Jun 04 03:30:02 MDT 2010 Christoph Hellwig <hch@lst.de> remove inode_setattr Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 5a0e3ad6 Wed Mar 24 02:04:11 MDT 2010 Tejun Heo <tj@kernel.org> include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> diff 5a0e3ad6 Wed Mar 24 02:04:11 MDT 2010 Tejun Heo <tj@kernel.org> include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> diff 5a0e3ad6 Wed Mar 24 02:04:11 MDT 2010 Tejun Heo <tj@kernel.org> include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> diff 5a0e3ad6 Wed Mar 24 02:04:11 MDT 2010 Tejun Heo <tj@kernel.org> include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> diff 18debbbc Wed Dec 06 21:37:05 MST 2006 Randy Dunlap <randy.dunlap@oracle.com> [PATCH] hpfs: fix printk format warnings Fix hpfs printk warnings: fs/hpfs/dir.c:87: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/dir.c:147: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long int' fs/hpfs/dir.c:148: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long int' fs/hpfs/dnode.c:537: warning: format '%08x' expects type 'unsigned int', but argument 5 has type 'long unsigned int' fs/hpfs/dnode.c:854: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'loff_t' fs/hpfs/ea.c:247: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/inode.c:254: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/map.c:129: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:135: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:140: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:147: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:154: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> diff 18debbbc Wed Dec 06 21:37:05 MST 2006 Randy Dunlap <randy.dunlap@oracle.com> [PATCH] hpfs: fix printk format warnings Fix hpfs printk warnings: fs/hpfs/dir.c:87: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/dir.c:147: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long int' fs/hpfs/dir.c:148: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long int' fs/hpfs/dnode.c:537: warning: format '%08x' expects type 'unsigned int', but argument 5 has type 'long unsigned int' fs/hpfs/dnode.c:854: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'loff_t' fs/hpfs/ea.c:247: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/inode.c:254: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/map.c:129: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:135: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:140: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:147: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:154: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> diff 18debbbc Wed Dec 06 21:37:05 MST 2006 Randy Dunlap <randy.dunlap@oracle.com> [PATCH] hpfs: fix printk format warnings Fix hpfs printk warnings: fs/hpfs/dir.c:87: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/dir.c:147: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long int' fs/hpfs/dir.c:148: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long int' fs/hpfs/dnode.c:537: warning: format '%08x' expects type 'unsigned int', but argument 5 has type 'long unsigned int' fs/hpfs/dnode.c:854: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'loff_t' fs/hpfs/ea.c:247: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/inode.c:254: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/map.c:129: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:135: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:140: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:147: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:154: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> diff 18debbbc Wed Dec 06 21:37:05 MST 2006 Randy Dunlap <randy.dunlap@oracle.com> [PATCH] hpfs: fix printk format warnings Fix hpfs printk warnings: fs/hpfs/dir.c:87: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/dir.c:147: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long int' fs/hpfs/dir.c:148: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long int' fs/hpfs/dnode.c:537: warning: format '%08x' expects type 'unsigned int', but argument 5 has type 'long unsigned int' fs/hpfs/dnode.c:854: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'loff_t' fs/hpfs/ea.c:247: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/inode.c:254: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/map.c:129: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:135: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:140: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:147: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:154: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> diff 18debbbc Wed Dec 06 21:37:05 MST 2006 Randy Dunlap <randy.dunlap@oracle.com> [PATCH] hpfs: fix printk format warnings Fix hpfs printk warnings: fs/hpfs/dir.c:87: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/dir.c:147: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long int' fs/hpfs/dir.c:148: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long int' fs/hpfs/dnode.c:537: warning: format '%08x' expects type 'unsigned int', but argument 5 has type 'long unsigned int' fs/hpfs/dnode.c:854: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'loff_t' fs/hpfs/ea.c:247: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/inode.c:254: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/map.c:129: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:135: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:140: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:147: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:154: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> diff 18debbbc Wed Dec 06 21:37:05 MST 2006 Randy Dunlap <randy.dunlap@oracle.com> [PATCH] hpfs: fix printk format warnings Fix hpfs printk warnings: fs/hpfs/dir.c:87: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/dir.c:147: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long int' fs/hpfs/dir.c:148: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long int' fs/hpfs/dnode.c:537: warning: format '%08x' expects type 'unsigned int', but argument 5 has type 'long unsigned int' fs/hpfs/dnode.c:854: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'loff_t' fs/hpfs/ea.c:247: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/inode.c:254: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/map.c:129: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:135: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:140: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:147: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:154: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> diff 18debbbc Wed Dec 06 21:37:05 MST 2006 Randy Dunlap <randy.dunlap@oracle.com> [PATCH] hpfs: fix printk format warnings Fix hpfs printk warnings: fs/hpfs/dir.c:87: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/dir.c:147: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long int' fs/hpfs/dir.c:148: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long int' fs/hpfs/dnode.c:537: warning: format '%08x' expects type 'unsigned int', but argument 5 has type 'long unsigned int' fs/hpfs/dnode.c:854: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'loff_t' fs/hpfs/ea.c:247: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/inode.c:254: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/map.c:129: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:135: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:140: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:147: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:154: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> diff 18debbbc Wed Dec 06 21:37:05 MST 2006 Randy Dunlap <randy.dunlap@oracle.com> [PATCH] hpfs: fix printk format warnings Fix hpfs printk warnings: fs/hpfs/dir.c:87: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/dir.c:147: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long int' fs/hpfs/dir.c:148: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long int' fs/hpfs/dnode.c:537: warning: format '%08x' expects type 'unsigned int', but argument 5 has type 'long unsigned int' fs/hpfs/dnode.c:854: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'loff_t' fs/hpfs/ea.c:247: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/inode.c:254: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/map.c:129: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:135: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:140: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:147: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:154: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> diff 18debbbc Wed Dec 06 21:37:05 MST 2006 Randy Dunlap <randy.dunlap@oracle.com> [PATCH] hpfs: fix printk format warnings Fix hpfs printk warnings: fs/hpfs/dir.c:87: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/dir.c:147: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long int' fs/hpfs/dir.c:148: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long int' fs/hpfs/dnode.c:537: warning: format '%08x' expects type 'unsigned int', but argument 5 has type 'long unsigned int' fs/hpfs/dnode.c:854: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'loff_t' fs/hpfs/ea.c:247: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/inode.c:254: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/map.c:129: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:135: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:140: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:147: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:154: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> diff 18debbbc Wed Dec 06 21:37:05 MST 2006 Randy Dunlap <randy.dunlap@oracle.com> [PATCH] hpfs: fix printk format warnings Fix hpfs printk warnings: fs/hpfs/dir.c:87: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/dir.c:147: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long int' fs/hpfs/dir.c:148: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long int' fs/hpfs/dnode.c:537: warning: format '%08x' expects type 'unsigned int', but argument 5 has type 'long unsigned int' fs/hpfs/dnode.c:854: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'loff_t' fs/hpfs/ea.c:247: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/inode.c:254: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/map.c:129: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:135: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:140: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:147: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:154: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> diff 18debbbc Wed Dec 06 21:37:05 MST 2006 Randy Dunlap <randy.dunlap@oracle.com> [PATCH] hpfs: fix printk format warnings Fix hpfs printk warnings: fs/hpfs/dir.c:87: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/dir.c:147: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long int' fs/hpfs/dir.c:148: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long int' fs/hpfs/dnode.c:537: warning: format '%08x' expects type 'unsigned int', but argument 5 has type 'long unsigned int' fs/hpfs/dnode.c:854: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'loff_t' fs/hpfs/ea.c:247: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/inode.c:254: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/map.c:129: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:135: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:140: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:147: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:154: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> diff 18debbbc Wed Dec 06 21:37:05 MST 2006 Randy Dunlap <randy.dunlap@oracle.com> [PATCH] hpfs: fix printk format warnings Fix hpfs printk warnings: fs/hpfs/dir.c:87: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/dir.c:147: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long int' fs/hpfs/dir.c:148: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long int' fs/hpfs/dnode.c:537: warning: format '%08x' expects type 'unsigned int', but argument 5 has type 'long unsigned int' fs/hpfs/dnode.c:854: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'loff_t' fs/hpfs/ea.c:247: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/inode.c:254: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'long unsigned int' fs/hpfs/map.c:129: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:135: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:140: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:147: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' fs/hpfs/map.c:154: warning: format '%08x' expects type 'unsigned int', but argument 3 has type 'ino_t' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> |
/linux-master/fs/affs/ | ||
H A D | inode.c | diff b2441318 Wed Nov 01 08:07:57 MDT 2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org> License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> diff 5b825c3a Thu Feb 02 09:54:15 MST 2017 Ingo Molnar <mingo@kernel.org> sched/headers: Prepare to remove <linux/cred.h> inclusion from <linux/sched.h> Add #include <linux/cred.h> dependencies to all .c files rely on sched.h doing that for them. Note that even if the count where we need to add extra headers seems high, it's still a net win, because <linux/sched.h> is included in over 2,200 files ... Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> diff 78f444f6 Tue Jun 30 15:57:55 MDT 2015 Fabian Frederick <fabf@skynet.be> fs/affs/inode.c: remove unneeded initialization bh is initialized unconditionally in affs_add_entry() Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 0cdfe18a Tue Feb 17 14:46:30 MST 2015 Fabian Frederick <fabf@skynet.be> fs/affs/inode.c: remove double extern affs_symlink_inode_operations affs_symlink_inode_operations was already declared extern in affs.h Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 3bc75993 Mon Oct 13 16:53:52 MDT 2014 Fabian Frederick <fabf@skynet.be> fs/affs/inode.c: remove unused variable head is set to AFFS_HEAD(bh) but never used. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 1025774c Fri Jun 04 03:30:02 MDT 2010 Christoph Hellwig <hch@lst.de> remove inode_setattr Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 1025774c Fri Jun 04 03:30:02 MDT 2010 Christoph Hellwig <hch@lst.de> remove inode_setattr Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 5a0e3ad6 Wed Mar 24 02:04:11 MDT 2010 Tejun Heo <tj@kernel.org> include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> diff 5a0e3ad6 Wed Mar 24 02:04:11 MDT 2010 Tejun Heo <tj@kernel.org> include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> diff 5a0e3ad6 Wed Mar 24 02:04:11 MDT 2010 Tejun Heo <tj@kernel.org> include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> diff 5a0e3ad6 Wed Mar 24 02:04:11 MDT 2010 Tejun Heo <tj@kernel.org> include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> |
/linux-master/fs/gfs2/ | ||
H A D | xattr.c | diff 1025774c Fri Jun 04 03:30:02 MDT 2010 Christoph Hellwig <hch@lst.de> remove inode_setattr Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 1025774c Fri Jun 04 03:30:02 MDT 2010 Christoph Hellwig <hch@lst.de> remove inode_setattr Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 479c427d Thu Oct 01 17:00:00 MDT 2009 Steven Whitehouse <swhiteho@redhat.com> GFS2: Clean up ACLs To prepare for support for caching of ACLs, this cleans up the GFS2 ACL support by pushing the xattr code back into xattr.c and changing the acl_get function into one which only returns ACLs so that we can drop the caching function into it shortly. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> |
H A D | inode.c | diff a4122a95 Tue Apr 06 16:59:03 MDT 2021 Andreas Gruenbacher <agruenba@redhat.com> gfs2: Make gfs2_setattr_simple static This function is only used in inode.c. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> diff e3a77eeb Wed Nov 25 13:14:15 MST 2020 Andreas Gruenbacher <agruenba@redhat.com> gfs2: Make inode operations static The inode operations are not used outside inode.c. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> diff 21039132 Tue Mar 10 07:31:41 MDT 2020 Al Viro <viro@zeniv.linux.org.uk> gfs2_atomic_open(): fix O_EXCL|O_CREAT handling on cold dcache with the way fs/namei.c:do_last() had been done, ->atomic_open() instances needed to recognize the case when existing file got found with O_EXCL|O_CREAT, either by falling back to finish_no_open() or failing themselves. gfs2 one didn't. Fixes: 6d4ade986f9c (GFS2: Add atomic_open support) Cc: stable@kernel.org # v3.11 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 628e366d Mon Jun 04 06:56:51 MDT 2018 Andreas Gruenbacher <agruenba@redhat.com> gfs2: Iomap cleanups and improvements Clean up gfs2_iomap_alloc and gfs2_iomap_get. Document how gfs2_iomap_alloc works: it now needs to be called separately after gfs2_iomap_get where necessary; this will be used later by iomap write. Move gfs2_iomap_ops into bmap.c. Introduce a new gfs2_iomap_get_alloc helper and use it in fallocate_chunk: gfs2_iomap_begin will become unsuitable for fallocate with proper iomap write support. In gfs2_block_map and fallocate_chunk, zero-initialize struct iomap. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> diff 83998ccd Wed Feb 28 00:48:53 MST 2018 Andreas Gruenbacher <agruenba@redhat.com> gfs2: Dirty source inode during rename Mark the source inode dirty during a rename instead of just updating the underlying buffer head. Otherwise, fsync may find the inode clean and will then skip flushing the journal. A subsequent power failure will cause the rename to be lost. This happens in command sequences like: xfs_io -f -c 'pwrite 0 4096' -c 'fsync' foo mv foo bar xfs_io -c 'fsync' bar # power failure Fixes xfstests generic/322, generic/376. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> diff 83998ccd Wed Feb 28 00:48:53 MST 2018 Andreas Gruenbacher <agruenba@redhat.com> gfs2: Dirty source inode during rename Mark the source inode dirty during a rename instead of just updating the underlying buffer head. Otherwise, fsync may find the inode clean and will then skip flushing the journal. A subsequent power failure will cause the rename to be lost. This happens in command sequences like: xfs_io -f -c 'pwrite 0 4096' -c 'fsync' foo mv foo bar xfs_io -c 'fsync' bar # power failure Fixes xfstests generic/322, generic/376. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> diff 83998ccd Wed Feb 28 00:48:53 MST 2018 Andreas Gruenbacher <agruenba@redhat.com> gfs2: Dirty source inode during rename Mark the source inode dirty during a rename instead of just updating the underlying buffer head. Otherwise, fsync may find the inode clean and will then skip flushing the journal. A subsequent power failure will cause the rename to be lost. This happens in command sequences like: xfs_io -f -c 'pwrite 0 4096' -c 'fsync' foo mv foo bar xfs_io -c 'fsync' bar # power failure Fixes xfstests generic/322, generic/376. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> diff 2eb5909d Mon Jan 29 10:00:23 MST 2018 Bob Peterson <rpeterso@redhat.com> GFS2: Don't try to end a non-existent transaction in unlink Before this patch, if function gfs2_unlink failed to get a valid transaction (for example, not enough journal blocks) it would go to label out_end_trans which did gfs2_trans_end. But if the trans_begin failed, there's no transaction to end, and trying to do so results in: kernel BUG at fs/gfs2/trans.c:117! This patch changes the goto so that it does not try to end a non-existent transaction. Signed-off-by: Bob Peterson <rpeterso@redhat.com> diff a528d35e Tue Jan 31 09:46:22 MST 2017 David Howells <dhowells@redhat.com> statx: Add a system call to make enhanced file info available Add a system call to make extended file information available, including file creation and some attribute flags where available through the underlying filesystem. The getattr inode operation is altered to take two additional arguments: a u32 request_mask and an unsigned int flags that indicate the synchronisation mode. This change is propagated to the vfs_getattr*() function. Functions like vfs_stat() are now inline wrappers around new functions vfs_statx() and vfs_statx_fd() to reduce stack usage. ======== OVERVIEW ======== The idea was initially proposed as a set of xattrs that could be retrieved with getxattr(), but the general preference proved to be for a new syscall with an extended stat structure. A number of requests were gathered for features to be included. The following have been included: (1) Make the fields a consistent size on all arches and make them large. (2) Spare space, request flags and information flags are provided for future expansion. (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an __s64). (4) Creation time: The SMB protocol carries the creation time, which could be exported by Samba, which will in turn help CIFS make use of FS-Cache as that can be used for coherency data (stx_btime). This is also specified in NFSv4 as a recommended attribute and could be exported by NFSD [Steve French]. (5) Lightweight stat: Ask for just those details of interest, and allow a netfs (such as NFS) to approximate anything not of interest, possibly without going to the server [Trond Myklebust, Ulrich Drepper, Andreas Dilger] (AT_STATX_DONT_SYNC). (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks its cached attributes are up to date [Trond Myklebust] (AT_STATX_FORCE_SYNC). And the following have been left out for future extension: (7) Data version number: Could be used by userspace NFS servers [Aneesh Kumar]. Can also be used to modify fill_post_wcc() in NFSD which retrieves i_version directly, but has just called vfs_getattr(). It could get it from the kstat struct if it used vfs_xgetattr() instead. (There's disagreement on the exact semantics of a single field, since not all filesystems do this the same way). (8) BSD stat compatibility: Including more fields from the BSD stat such as creation time (st_btime) and inode generation number (st_gen) [Jeremy Allison, Bernd Schubert]. (9) Inode generation number: Useful for FUSE and userspace NFS servers [Bernd Schubert]. (This was asked for but later deemed unnecessary with the open-by-handle capability available and caused disagreement as to whether it's a security hole or not). (10) Extra coherency data may be useful in making backups [Andreas Dilger]. (No particular data were offered, but things like last backup timestamp, the data version number and the DOS archive bit would come into this category). (11) Allow the filesystem to indicate what it can/cannot provide: A filesystem can now say it doesn't support a standard stat feature if that isn't available, so if, for instance, inode numbers or UIDs don't exist or are fabricated locally... (This requires a separate system call - I have an fsinfo() call idea for this). (12) Store a 16-byte volume ID in the superblock that can be returned in struct xstat [Steve French]. (Deferred to fsinfo). (13) Include granularity fields in the time data to indicate the granularity of each of the times (NFSv4 time_delta) [Steve French]. (Deferred to fsinfo). (14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags. Note that the Linux IOC flags are a mess and filesystems such as Ext4 define flags that aren't in linux/fs.h, so translation in the kernel may be a necessity (or, possibly, we provide the filesystem type too). (Some attributes are made available in stx_attributes, but the general feeling was that the IOC flags were to ext[234]-specific and shouldn't be exposed through statx this way). (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer, Michael Kerrisk]. (Deferred, probably to fsinfo. Finding out if there's an ACL or seclabal might require extra filesystem operations). (16) Femtosecond-resolution timestamps [Dave Chinner]. (A __reserved field has been left in the statx_timestamp struct for this - if there proves to be a need). (17) A set multiple attributes syscall to go with this. =============== NEW SYSTEM CALL =============== The new system call is: int ret = statx(int dfd, const char *filename, unsigned int flags, unsigned int mask, struct statx *buffer); The dfd, filename and flags parameters indicate the file to query, in a similar way to fstatat(). There is no equivalent of lstat() as that can be emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is also no equivalent of fstat() as that can be emulated by passing a NULL filename to statx() with the fd of interest in dfd. Whether or not statx() synchronises the attributes with the backing store can be controlled by OR'ing a value into the flags argument (this typically only affects network filesystems): (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this respect. (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise its attributes with the server - which might require data writeback to occur to get the timestamps correct. (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a network filesystem. The resulting values should be considered approximate. mask is a bitmask indicating the fields in struct statx that are of interest to the caller. The user should set this to STATX_BASIC_STATS to get the basic set returned by stat(). It should be noted that asking for more information may entail extra I/O operations. buffer points to the destination for the data. This must be 256 bytes in size. ====================== MAIN ATTRIBUTES RECORD ====================== The following structures are defined in which to return the main attribute set: struct statx_timestamp { __s64 tv_sec; __s32 tv_nsec; __s32 __reserved; }; struct statx { __u32 stx_mask; __u32 stx_blksize; __u64 stx_attributes; __u32 stx_nlink; __u32 stx_uid; __u32 stx_gid; __u16 stx_mode; __u16 __spare0[1]; __u64 stx_ino; __u64 stx_size; __u64 stx_blocks; __u64 __spare1[1]; struct statx_timestamp stx_atime; struct statx_timestamp stx_btime; struct statx_timestamp stx_ctime; struct statx_timestamp stx_mtime; __u32 stx_rdev_major; __u32 stx_rdev_minor; __u32 stx_dev_major; __u32 stx_dev_minor; __u64 __spare2[14]; }; The defined bits in request_mask and stx_mask are: STATX_TYPE Want/got stx_mode & S_IFMT STATX_MODE Want/got stx_mode & ~S_IFMT STATX_NLINK Want/got stx_nlink STATX_UID Want/got stx_uid STATX_GID Want/got stx_gid STATX_ATIME Want/got stx_atime{,_ns} STATX_MTIME Want/got stx_mtime{,_ns} STATX_CTIME Want/got stx_ctime{,_ns} STATX_INO Want/got stx_ino STATX_SIZE Want/got stx_size STATX_BLOCKS Want/got stx_blocks STATX_BASIC_STATS [The stuff in the normal stat struct] STATX_BTIME Want/got stx_btime{,_ns} STATX_ALL [All currently available stuff] stx_btime is the file creation time, stx_mask is a bitmask indicating the data provided and __spares*[] are where as-yet undefined fields can be placed. Time fields are structures with separate seconds and nanoseconds fields plus a reserved field in case we want to add even finer resolution. Note that times will be negative if before 1970; in such a case, the nanosecond fields will also be negative if not zero. The bits defined in the stx_attributes field convey information about a file, how it is accessed, where it is and what it does. The following attributes map to FS_*_FL flags and are the same numerical value: STATX_ATTR_COMPRESSED File is compressed by the fs STATX_ATTR_IMMUTABLE File is marked immutable STATX_ATTR_APPEND File is append-only STATX_ATTR_NODUMP File is not to be dumped STATX_ATTR_ENCRYPTED File requires key to decrypt in fs Within the kernel, the supported flags are listed by: KSTAT_ATTR_FS_IOC_FLAGS [Are any other IOC flags of sufficient general interest to be exposed through this interface?] New flags include: STATX_ATTR_AUTOMOUNT Object is an automount trigger These are for the use of GUI tools that might want to mark files specially, depending on what they are. Fields in struct statx come in a number of classes: (0) stx_dev_*, stx_blksize. These are local system information and are always available. (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino, stx_size, stx_blocks. These will be returned whether the caller asks for them or not. The corresponding bits in stx_mask will be set to indicate whether they actually have valid values. If the caller didn't ask for them, then they may be approximated. For example, NFS won't waste any time updating them from the server, unless as a byproduct of updating something requested. If the values don't actually exist for the underlying object (such as UID or GID on a DOS file), then the bit won't be set in the stx_mask, even if the caller asked for the value. In such a case, the returned value will be a fabrication. Note that there are instances where the type might not be valid, for instance Windows reparse points. (2) stx_rdev_*. This will be set only if stx_mode indicates we're looking at a blockdev or a chardev, otherwise will be 0. (3) stx_btime. Similar to (1), except this will be set to 0 if it doesn't exist. ======= TESTING ======= The following test program can be used to test the statx system call: samples/statx/test-statx.c Just compile and run, passing it paths to the files you want to examine. The file is built automatically if CONFIG_SAMPLES is enabled. Here's some example output. Firstly, an NFS directory that crosses to another FSID. Note that the AUTOMOUNT attribute is set because transiting this directory will cause d_automount to be invoked by the VFS. [root@andromeda ~]# /tmp/test-statx -A /warthog/data statx(/warthog/data) = 0 results=7ff Size: 4096 Blocks: 8 IO Block: 1048576 directory Device: 00:26 Inode: 1703937 Links: 125 Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041 Access: 2016-11-24 09:02:12.219699527+0000 Modify: 2016-11-17 10:44:36.225653653+0000 Change: 2016-11-17 10:44:36.225653653+0000 Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------) Secondly, the result of automounting on that directory. [root@andromeda ~]# /tmp/test-statx /warthog/data statx(/warthog/data) = 0 results=7ff Size: 4096 Blocks: 8 IO Block: 1048576 directory Device: 00:27 Inode: 2 Links: 125 Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041 Access: 2016-11-24 09:02:12.219699527+0000 Modify: 2016-11-17 10:44:36.225653653+0000 Change: 2016-11-17 10:44:36.225653653+0000 Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 5b825c3a Thu Feb 02 09:54:15 MST 2017 Ingo Molnar <mingo@kernel.org> sched/headers: Prepare to remove <linux/cred.h> inclusion from <linux/sched.h> Add #include <linux/cred.h> dependencies to all .c files rely on sched.h doing that for them. Note that even if the count where we need to add extra headers seems high, it's still a net win, because <linux/sched.h> is included in over 2,200 files ... Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
/linux-master/fs/jfs/ | ||
H A D | file.c | diff 71e7b535 Tue Jun 21 08:14:52 MDT 2022 Christian Brauner <brauner@kernel.org> quota: port quota helpers mount ids Port the is_quota_modification() and dqout_transfer() helper to type safe vfs{g,u}id_t. Since these helpers are only called by a few filesystems don't introduce a new helper but simply extend the existing helpers to pass down the mount's idmapping. Note, that this is a non-functional change, i.e. nothing will have happened here or at the end of this series to how quota are done! This a change necessary because we will at the end of this series make ownership changes easier to reason about by keeping the original value in struct iattr for both non-idmapped and idmapped mounts. For now we always pass the initial idmapping which makes the idmapping functions these helpers call nops. This is done because we currently always pass the actual value to be written to i_{g,u}id via struct iattr. While this allowed us to treat the {g,u}id values in struct iattr as values that can be directly written to inode->i_{g,u}id it also increases the potential for confusion for filesystems. Now that we are have dedicated types to prevent this confusion we will ultimately only map the value from the idmapped mount into a filesystem value that can be written to inode->i_{g,u}id when the filesystem actually updates the inode. So pass down the initial idmapping until we finished that conversion at which point we pass down the mount's idmapping. Since struct iattr uses an anonymous union with overlapping types as supported by the C standard, filesystems that haven't converted to ia_vfs{g,u}id won't see any difference and things will continue to work as before. In other words, no functional changes intended with this change. Link: https://lore.kernel.org/r/20220621141454.2914719-7-brauner@kernel.org Cc: Seth Forshee <sforshee@digitalocean.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.cz> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> CC: linux-fsdevel@vger.kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Seth Forshee <sforshee@digitalocean.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> diff 5955102c Fri Jan 22 13:40:57 MST 2016 Al Viro <viro@zeniv.linux.org.uk> wrappers for ->i_mutex access parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested}, inode_foo(inode) being mutex_foo(&inode->i_mutex). Please, use those for access to ->i_mutex; over the coming cycle ->i_mutex will become rwsem, with ->lookup() done with it held only shared. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 0ae45f63 Sun Feb 01 22:37:00 MST 2015 Theodore Ts'o <tytso@mit.edu> vfs: add support for a lazytime mount option Add a new mount option which enables a new "lazytime" mode. This mode causes atime, mtime, and ctime updates to only be made to the in-memory version of the inode. The on-disk times will only get updated when (a) if the inode needs to be updated for some non-time related change, (b) if userspace calls fsync(), syncfs() or sync(), or (c) just before an undeleted inode is evicted from memory. This is OK according to POSIX because there are no guarantees after a crash unless userspace explicitly requests via a fsync(2) call. For workloads which feature a large number of random write to a preallocated file, the lazytime mount option significantly reduces writes to the inode table. The repeated 4k writes to a single block will result in undesirable stress on flash devices and SMR disk drives. Even on conventional HDD's, the repeated writes to the inode table block will trigger Adjacent Track Interference (ATI) remediation latencies, which very negatively impact long tail latencies --- which is a very big deal for web serving tiers (for example). Google-Bug-Id: 18297052 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 4e34e719 Sat Jul 23 09:37:31 MDT 2011 Christoph Hellwig <hch@lst.de> fs: take the ACL checks to common code Replace the ->check_acl method with a ->get_acl method that simply reads an ACL from disk after having a cache miss. This means we can replace the ACL checking boilerplate code with a single implementation in namei.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 1025774c Fri Jun 04 03:30:02 MDT 2010 Christoph Hellwig <hch@lst.de> remove inode_setattr Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 1025774c Fri Jun 04 03:30:02 MDT 2010 Christoph Hellwig <hch@lst.de> remove inode_setattr Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
/linux-master/fs/udf/ | ||
H A D | file.c | diff 174cb748 Mon Jan 23 16:29:57 MST 2023 Jan Kara <jack@suse.cz> udf: Move udf_adinicb_readpage() to inode.c udf_adinicb_readpage() is only called from aops functions, move it to the same file as its callers and also drop the stale comment - invalidate_lock is protecting us against races with truncate. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> diff 5955102c Fri Jan 22 13:40:57 MST 2016 Al Viro <viro@zeniv.linux.org.uk> wrappers for ->i_mutex access parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested}, inode_foo(inode) being mutex_foo(&inode->i_mutex). Please, use those for access to ->i_mutex; over the coming cycle ->i_mutex will become rwsem, with ->lookup() done with it held only shared. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 3309dd04 Wed Apr 08 22:55:47 MDT 2015 Al Viro <viro@zeniv.linux.org.uk> switch generic_write_checks() to iocb and iter ... returning -E... upon error and amount of data left in iter after (possible) truncation upon success. Note, that normal case gives a non-zero (positive) return value, so any tests for != 0 _must_ be updated. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Conflicts: fs/ext4/file.c diff 1025774c Fri Jun 04 03:30:02 MDT 2010 Christoph Hellwig <hch@lst.de> remove inode_setattr Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 1025774c Fri Jun 04 03:30:02 MDT 2010 Christoph Hellwig <hch@lst.de> remove inode_setattr Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff de67445f Sat May 22 11:13:09 MDT 2010 Huang Weiyi <weiyi.huang@gmail.com> udf: remove duplicated #include Remove duplicated #include('s) in fs/udf/file.c Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> diff e8edc6e0 Sun May 20 15:22:52 MDT 2007 Alexey Dobriyan <adobriyan@gmail.com> Detach sched.h from mm.h First thing mm.h does is including sched.h solely for can_do_mlock() inline function which has "current" dereference inside. By dealing with can_do_mlock() mm.h can be detached from sched.h which is good. See below, why. This patch a) removes unconditional inclusion of sched.h from mm.h b) makes can_do_mlock() normal function in mm/mlock.c c) exports can_do_mlock() to not break compilation d) adds sched.h inclusions back to files that were getting it indirectly. e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were getting them indirectly Net result is: a) mm.h users would get less code to open, read, preprocess, parse, ... if they don't need sched.h b) sched.h stops being dependency for significant number of files: on x86_64 allmodconfig touching sched.h results in recompile of 4083 files, after patch it's only 3744 (-8.3%). Cross-compile tested on all arm defconfigs, all mips defconfigs, all powerpc defconfigs, alpha alpha-up arm i386 i386-up i386-defconfig i386-allnoconfig ia64 ia64-up m68k mips parisc parisc-up powerpc powerpc-up s390 s390-up sparc sparc-up sparc64 sparc64-up um-x86_64 x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig as well as my two usual configs. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff e8edc6e0 Sun May 20 15:22:52 MDT 2007 Alexey Dobriyan <adobriyan@gmail.com> Detach sched.h from mm.h First thing mm.h does is including sched.h solely for can_do_mlock() inline function which has "current" dereference inside. By dealing with can_do_mlock() mm.h can be detached from sched.h which is good. See below, why. This patch a) removes unconditional inclusion of sched.h from mm.h b) makes can_do_mlock() normal function in mm/mlock.c c) exports can_do_mlock() to not break compilation d) adds sched.h inclusions back to files that were getting it indirectly. e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were getting them indirectly Net result is: a) mm.h users would get less code to open, read, preprocess, parse, ... if they don't need sched.h b) sched.h stops being dependency for significant number of files: on x86_64 allmodconfig touching sched.h results in recompile of 4083 files, after patch it's only 3744 (-8.3%). Cross-compile tested on all arm defconfigs, all mips defconfigs, all powerpc defconfigs, alpha alpha-up arm i386 i386-up i386-defconfig i386-allnoconfig ia64 ia64-up m68k mips parisc parisc-up powerpc powerpc-up s390 s390-up sparc sparc-up sparc64 sparc64-up um-x86_64 x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig as well as my two usual configs. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
/linux-master/fs/ocfs2/dlmfs/ | ||
H A D | dlmfs.c | diff ca322fb6 Thu Jan 30 23:11:40 MST 2020 Masahiro Yamada <masahiroy@kernel.org> ocfs2: make local header paths relative to C files Gang He reports the failure of building fs/ocfs2/ as an external module of the kernel installed on the system: $ cd fs/ocfs2 $ make -C /lib/modules/`uname -r`/build M=`pwd` modules If you want to make it work reliably, I'd recommend to remove ccflags-y from the Makefiles, and to make header paths relative to the C files. I think this is the correct usage of the #include "..." directive. Link: http://lkml.kernel.org/r/20191227022950.14804-1-ghe@suse.com Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Gang He <ghe@suse.com> Reported-by: Gang He <ghe@suse.com> Reviewed-by: Gang He <ghe@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ca322fb6 Thu Jan 30 23:11:40 MST 2020 Masahiro Yamada <masahiroy@kernel.org> ocfs2: make local header paths relative to C files Gang He reports the failure of building fs/ocfs2/ as an external module of the kernel installed on the system: $ cd fs/ocfs2 $ make -C /lib/modules/`uname -r`/build M=`pwd` modules If you want to make it work reliably, I'd recommend to remove ccflags-y from the Makefiles, and to make header paths relative to the C files. I think this is the correct usage of the #include "..." directive. Link: http://lkml.kernel.org/r/20191227022950.14804-1-ghe@suse.com Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Gang He <ghe@suse.com> Reported-by: Gang He <ghe@suse.com> Reviewed-by: Gang He <ghe@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ca322fb6 Thu Jan 30 23:11:40 MST 2020 Masahiro Yamada <masahiroy@kernel.org> ocfs2: make local header paths relative to C files Gang He reports the failure of building fs/ocfs2/ as an external module of the kernel installed on the system: $ cd fs/ocfs2 $ make -C /lib/modules/`uname -r`/build M=`pwd` modules If you want to make it work reliably, I'd recommend to remove ccflags-y from the Makefiles, and to make header paths relative to the C files. I think this is the correct usage of the #include "..." directive. Link: http://lkml.kernel.org/r/20191227022950.14804-1-ghe@suse.com Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Gang He <ghe@suse.com> Reported-by: Gang He <ghe@suse.com> Reviewed-by: Gang He <ghe@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff e4dca7b7 Tue Oct 17 20:04:42 MDT 2017 Kees Cook <keescook@chromium.org> treewide: Fix function prototypes for module_param_call() Several function prototypes for the set/get functions defined by module_param_call() have a slightly wrong argument types. This fixes those in an effort to clean up the calls when running under type-enforced compiler instrumentation for CFI. This is the result of running the following semantic patch: @match_module_param_call_function@ declarer name module_param_call; identifier _name, _set_func, _get_func; expression _arg, _mode; @@ module_param_call(_name, _set_func, _get_func, _arg, _mode); @fix_set_prototype depends on match_module_param_call_function@ identifier match_module_param_call_function._set_func; identifier _val, _param; type _val_type, _param_type; @@ int _set_func( -_val_type _val +const char * _val , -_param_type _param +const struct kernel_param * _param ) { ... } @fix_get_prototype depends on match_module_param_call_function@ identifier match_module_param_call_function._get_func; identifier _val, _param; type _val_type, _param_type; @@ int _get_func( -_val_type _val +char * _val , -_param_type _param +const struct kernel_param * _param ) { ... } Two additional by-hand changes are included for places where the above Coccinelle script didn't notice them: drivers/platform/x86/thinkpad_acpi.c fs/lockd/svc.c Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jessica Yu <jeyu@kernel.org> diff e4dca7b7 Tue Oct 17 20:04:42 MDT 2017 Kees Cook <keescook@chromium.org> treewide: Fix function prototypes for module_param_call() Several function prototypes for the set/get functions defined by module_param_call() have a slightly wrong argument types. This fixes those in an effort to clean up the calls when running under type-enforced compiler instrumentation for CFI. This is the result of running the following semantic patch: @match_module_param_call_function@ declarer name module_param_call; identifier _name, _set_func, _get_func; expression _arg, _mode; @@ module_param_call(_name, _set_func, _get_func, _arg, _mode); @fix_set_prototype depends on match_module_param_call_function@ identifier match_module_param_call_function._set_func; identifier _val, _param; type _val_type, _param_type; @@ int _set_func( -_val_type _val +const char * _val , -_param_type _param +const struct kernel_param * _param ) { ... } @fix_get_prototype depends on match_module_param_call_function@ identifier match_module_param_call_function._get_func; identifier _val, _param; type _val_type, _param_type; @@ int _get_func( -_val_type _val +char * _val , -_param_type _param +const struct kernel_param * _param ) { ... } Two additional by-hand changes are included for places where the above Coccinelle script didn't notice them: drivers/platform/x86/thinkpad_acpi.c fs/lockd/svc.c Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jessica Yu <jeyu@kernel.org> diff 1025774c Fri Jun 04 03:30:02 MDT 2010 Christoph Hellwig <hch@lst.de> remove inode_setattr Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 1025774c Fri Jun 04 03:30:02 MDT 2010 Christoph Hellwig <hch@lst.de> remove inode_setattr Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
/linux-master/fs/hfs/ | ||
H A D | inode.c | diff 8d824e69 Tue Nov 29 23:59:59 MST 2022 ZhangPeng <zhangpeng362@huawei.com> hfs: fix OOB Read in __hfs_brec_find Syzbot reported a OOB read bug: ================================================================== BUG: KASAN: slab-out-of-bounds in hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 Read of size 1 at addr ffff88807eb62c4e by task kworker/u4:1/11 CPU: 1 PID: 11 Comm: kworker/u4:1 Not tainted 6.1.0-rc6-syzkaller-00308-g644e9524388a #0 Workqueue: writeback wb_workfn (flush-7:0) Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1b1/0x28e lib/dump_stack.c:106 print_address_description+0x74/0x340 mm/kasan/report.c:284 print_report+0x107/0x1f0 mm/kasan/report.c:395 kasan_report+0xcd/0x100 mm/kasan/report.c:495 hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 __hfs_brec_find+0x213/0x5c0 fs/hfs/bfind.c:75 hfs_brec_find+0x276/0x520 fs/hfs/bfind.c:138 hfs_write_inode+0x34c/0xb40 fs/hfs/inode.c:462 write_inode fs/fs-writeback.c:1440 [inline] If the input inode of hfs_write_inode() is incorrect: struct inode struct hfs_inode_info struct hfs_cat_key struct hfs_name u8 len # len is greater than HFS_NAMELEN(31) which is the maximum length of an HFS filename OOB read occurred: hfs_write_inode() hfs_brec_find() __hfs_brec_find() hfs_cat_keycmp() hfs_strcmp() # OOB read occurred due to len is too large Fix this by adding a Check on len in hfs_write_inode() before calling hfs_brec_find(). Link: https://lkml.kernel.org/r/20221130065959.2168236-1-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reported-by: <syzbot+e836ff7133ac02be825f@syzkaller.appspotmail.com> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Viacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 8d824e69 Tue Nov 29 23:59:59 MST 2022 ZhangPeng <zhangpeng362@huawei.com> hfs: fix OOB Read in __hfs_brec_find Syzbot reported a OOB read bug: ================================================================== BUG: KASAN: slab-out-of-bounds in hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 Read of size 1 at addr ffff88807eb62c4e by task kworker/u4:1/11 CPU: 1 PID: 11 Comm: kworker/u4:1 Not tainted 6.1.0-rc6-syzkaller-00308-g644e9524388a #0 Workqueue: writeback wb_workfn (flush-7:0) Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1b1/0x28e lib/dump_stack.c:106 print_address_description+0x74/0x340 mm/kasan/report.c:284 print_report+0x107/0x1f0 mm/kasan/report.c:395 kasan_report+0xcd/0x100 mm/kasan/report.c:495 hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 __hfs_brec_find+0x213/0x5c0 fs/hfs/bfind.c:75 hfs_brec_find+0x276/0x520 fs/hfs/bfind.c:138 hfs_write_inode+0x34c/0xb40 fs/hfs/inode.c:462 write_inode fs/fs-writeback.c:1440 [inline] If the input inode of hfs_write_inode() is incorrect: struct inode struct hfs_inode_info struct hfs_cat_key struct hfs_name u8 len # len is greater than HFS_NAMELEN(31) which is the maximum length of an HFS filename OOB read occurred: hfs_write_inode() hfs_brec_find() __hfs_brec_find() hfs_cat_keycmp() hfs_strcmp() # OOB read occurred due to len is too large Fix this by adding a Check on len in hfs_write_inode() before calling hfs_brec_find(). Link: https://lkml.kernel.org/r/20221130065959.2168236-1-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reported-by: <syzbot+e836ff7133ac02be825f@syzkaller.appspotmail.com> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Viacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 8d824e69 Tue Nov 29 23:59:59 MST 2022 ZhangPeng <zhangpeng362@huawei.com> hfs: fix OOB Read in __hfs_brec_find Syzbot reported a OOB read bug: ================================================================== BUG: KASAN: slab-out-of-bounds in hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 Read of size 1 at addr ffff88807eb62c4e by task kworker/u4:1/11 CPU: 1 PID: 11 Comm: kworker/u4:1 Not tainted 6.1.0-rc6-syzkaller-00308-g644e9524388a #0 Workqueue: writeback wb_workfn (flush-7:0) Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1b1/0x28e lib/dump_stack.c:106 print_address_description+0x74/0x340 mm/kasan/report.c:284 print_report+0x107/0x1f0 mm/kasan/report.c:395 kasan_report+0xcd/0x100 mm/kasan/report.c:495 hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 __hfs_brec_find+0x213/0x5c0 fs/hfs/bfind.c:75 hfs_brec_find+0x276/0x520 fs/hfs/bfind.c:138 hfs_write_inode+0x34c/0xb40 fs/hfs/inode.c:462 write_inode fs/fs-writeback.c:1440 [inline] If the input inode of hfs_write_inode() is incorrect: struct inode struct hfs_inode_info struct hfs_cat_key struct hfs_name u8 len # len is greater than HFS_NAMELEN(31) which is the maximum length of an HFS filename OOB read occurred: hfs_write_inode() hfs_brec_find() __hfs_brec_find() hfs_cat_keycmp() hfs_strcmp() # OOB read occurred due to len is too large Fix this by adding a Check on len in hfs_write_inode() before calling hfs_brec_find(). Link: https://lkml.kernel.org/r/20221130065959.2168236-1-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reported-by: <syzbot+e836ff7133ac02be825f@syzkaller.appspotmail.com> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Viacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 8d824e69 Tue Nov 29 23:59:59 MST 2022 ZhangPeng <zhangpeng362@huawei.com> hfs: fix OOB Read in __hfs_brec_find Syzbot reported a OOB read bug: ================================================================== BUG: KASAN: slab-out-of-bounds in hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 Read of size 1 at addr ffff88807eb62c4e by task kworker/u4:1/11 CPU: 1 PID: 11 Comm: kworker/u4:1 Not tainted 6.1.0-rc6-syzkaller-00308-g644e9524388a #0 Workqueue: writeback wb_workfn (flush-7:0) Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1b1/0x28e lib/dump_stack.c:106 print_address_description+0x74/0x340 mm/kasan/report.c:284 print_report+0x107/0x1f0 mm/kasan/report.c:395 kasan_report+0xcd/0x100 mm/kasan/report.c:495 hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 __hfs_brec_find+0x213/0x5c0 fs/hfs/bfind.c:75 hfs_brec_find+0x276/0x520 fs/hfs/bfind.c:138 hfs_write_inode+0x34c/0xb40 fs/hfs/inode.c:462 write_inode fs/fs-writeback.c:1440 [inline] If the input inode of hfs_write_inode() is incorrect: struct inode struct hfs_inode_info struct hfs_cat_key struct hfs_name u8 len # len is greater than HFS_NAMELEN(31) which is the maximum length of an HFS filename OOB read occurred: hfs_write_inode() hfs_brec_find() __hfs_brec_find() hfs_cat_keycmp() hfs_strcmp() # OOB read occurred due to len is too large Fix this by adding a Check on len in hfs_write_inode() before calling hfs_brec_find(). Link: https://lkml.kernel.org/r/20221130065959.2168236-1-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reported-by: <syzbot+e836ff7133ac02be825f@syzkaller.appspotmail.com> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Viacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 8d824e69 Tue Nov 29 23:59:59 MST 2022 ZhangPeng <zhangpeng362@huawei.com> hfs: fix OOB Read in __hfs_brec_find Syzbot reported a OOB read bug: ================================================================== BUG: KASAN: slab-out-of-bounds in hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 Read of size 1 at addr ffff88807eb62c4e by task kworker/u4:1/11 CPU: 1 PID: 11 Comm: kworker/u4:1 Not tainted 6.1.0-rc6-syzkaller-00308-g644e9524388a #0 Workqueue: writeback wb_workfn (flush-7:0) Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1b1/0x28e lib/dump_stack.c:106 print_address_description+0x74/0x340 mm/kasan/report.c:284 print_report+0x107/0x1f0 mm/kasan/report.c:395 kasan_report+0xcd/0x100 mm/kasan/report.c:495 hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 __hfs_brec_find+0x213/0x5c0 fs/hfs/bfind.c:75 hfs_brec_find+0x276/0x520 fs/hfs/bfind.c:138 hfs_write_inode+0x34c/0xb40 fs/hfs/inode.c:462 write_inode fs/fs-writeback.c:1440 [inline] If the input inode of hfs_write_inode() is incorrect: struct inode struct hfs_inode_info struct hfs_cat_key struct hfs_name u8 len # len is greater than HFS_NAMELEN(31) which is the maximum length of an HFS filename OOB read occurred: hfs_write_inode() hfs_brec_find() __hfs_brec_find() hfs_cat_keycmp() hfs_strcmp() # OOB read occurred due to len is too large Fix this by adding a Check on len in hfs_write_inode() before calling hfs_brec_find(). Link: https://lkml.kernel.org/r/20221130065959.2168236-1-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reported-by: <syzbot+e836ff7133ac02be825f@syzkaller.appspotmail.com> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Viacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 8d824e69 Tue Nov 29 23:59:59 MST 2022 ZhangPeng <zhangpeng362@huawei.com> hfs: fix OOB Read in __hfs_brec_find Syzbot reported a OOB read bug: ================================================================== BUG: KASAN: slab-out-of-bounds in hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 Read of size 1 at addr ffff88807eb62c4e by task kworker/u4:1/11 CPU: 1 PID: 11 Comm: kworker/u4:1 Not tainted 6.1.0-rc6-syzkaller-00308-g644e9524388a #0 Workqueue: writeback wb_workfn (flush-7:0) Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1b1/0x28e lib/dump_stack.c:106 print_address_description+0x74/0x340 mm/kasan/report.c:284 print_report+0x107/0x1f0 mm/kasan/report.c:395 kasan_report+0xcd/0x100 mm/kasan/report.c:495 hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 __hfs_brec_find+0x213/0x5c0 fs/hfs/bfind.c:75 hfs_brec_find+0x276/0x520 fs/hfs/bfind.c:138 hfs_write_inode+0x34c/0xb40 fs/hfs/inode.c:462 write_inode fs/fs-writeback.c:1440 [inline] If the input inode of hfs_write_inode() is incorrect: struct inode struct hfs_inode_info struct hfs_cat_key struct hfs_name u8 len # len is greater than HFS_NAMELEN(31) which is the maximum length of an HFS filename OOB read occurred: hfs_write_inode() hfs_brec_find() __hfs_brec_find() hfs_cat_keycmp() hfs_strcmp() # OOB read occurred due to len is too large Fix this by adding a Check on len in hfs_write_inode() before calling hfs_brec_find(). Link: https://lkml.kernel.org/r/20221130065959.2168236-1-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reported-by: <syzbot+e836ff7133ac02be825f@syzkaller.appspotmail.com> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Viacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 8d824e69 Tue Nov 29 23:59:59 MST 2022 ZhangPeng <zhangpeng362@huawei.com> hfs: fix OOB Read in __hfs_brec_find Syzbot reported a OOB read bug: ================================================================== BUG: KASAN: slab-out-of-bounds in hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 Read of size 1 at addr ffff88807eb62c4e by task kworker/u4:1/11 CPU: 1 PID: 11 Comm: kworker/u4:1 Not tainted 6.1.0-rc6-syzkaller-00308-g644e9524388a #0 Workqueue: writeback wb_workfn (flush-7:0) Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1b1/0x28e lib/dump_stack.c:106 print_address_description+0x74/0x340 mm/kasan/report.c:284 print_report+0x107/0x1f0 mm/kasan/report.c:395 kasan_report+0xcd/0x100 mm/kasan/report.c:495 hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 __hfs_brec_find+0x213/0x5c0 fs/hfs/bfind.c:75 hfs_brec_find+0x276/0x520 fs/hfs/bfind.c:138 hfs_write_inode+0x34c/0xb40 fs/hfs/inode.c:462 write_inode fs/fs-writeback.c:1440 [inline] If the input inode of hfs_write_inode() is incorrect: struct inode struct hfs_inode_info struct hfs_cat_key struct hfs_name u8 len # len is greater than HFS_NAMELEN(31) which is the maximum length of an HFS filename OOB read occurred: hfs_write_inode() hfs_brec_find() __hfs_brec_find() hfs_cat_keycmp() hfs_strcmp() # OOB read occurred due to len is too large Fix this by adding a Check on len in hfs_write_inode() before calling hfs_brec_find(). Link: https://lkml.kernel.org/r/20221130065959.2168236-1-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reported-by: <syzbot+e836ff7133ac02be825f@syzkaller.appspotmail.com> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Viacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 8d824e69 Tue Nov 29 23:59:59 MST 2022 ZhangPeng <zhangpeng362@huawei.com> hfs: fix OOB Read in __hfs_brec_find Syzbot reported a OOB read bug: ================================================================== BUG: KASAN: slab-out-of-bounds in hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 Read of size 1 at addr ffff88807eb62c4e by task kworker/u4:1/11 CPU: 1 PID: 11 Comm: kworker/u4:1 Not tainted 6.1.0-rc6-syzkaller-00308-g644e9524388a #0 Workqueue: writeback wb_workfn (flush-7:0) Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1b1/0x28e lib/dump_stack.c:106 print_address_description+0x74/0x340 mm/kasan/report.c:284 print_report+0x107/0x1f0 mm/kasan/report.c:395 kasan_report+0xcd/0x100 mm/kasan/report.c:495 hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 __hfs_brec_find+0x213/0x5c0 fs/hfs/bfind.c:75 hfs_brec_find+0x276/0x520 fs/hfs/bfind.c:138 hfs_write_inode+0x34c/0xb40 fs/hfs/inode.c:462 write_inode fs/fs-writeback.c:1440 [inline] If the input inode of hfs_write_inode() is incorrect: struct inode struct hfs_inode_info struct hfs_cat_key struct hfs_name u8 len # len is greater than HFS_NAMELEN(31) which is the maximum length of an HFS filename OOB read occurred: hfs_write_inode() hfs_brec_find() __hfs_brec_find() hfs_cat_keycmp() hfs_strcmp() # OOB read occurred due to len is too large Fix this by adding a Check on len in hfs_write_inode() before calling hfs_brec_find(). Link: https://lkml.kernel.org/r/20221130065959.2168236-1-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reported-by: <syzbot+e836ff7133ac02be825f@syzkaller.appspotmail.com> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Viacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 8d824e69 Tue Nov 29 23:59:59 MST 2022 ZhangPeng <zhangpeng362@huawei.com> hfs: fix OOB Read in __hfs_brec_find Syzbot reported a OOB read bug: ================================================================== BUG: KASAN: slab-out-of-bounds in hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 Read of size 1 at addr ffff88807eb62c4e by task kworker/u4:1/11 CPU: 1 PID: 11 Comm: kworker/u4:1 Not tainted 6.1.0-rc6-syzkaller-00308-g644e9524388a #0 Workqueue: writeback wb_workfn (flush-7:0) Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1b1/0x28e lib/dump_stack.c:106 print_address_description+0x74/0x340 mm/kasan/report.c:284 print_report+0x107/0x1f0 mm/kasan/report.c:395 kasan_report+0xcd/0x100 mm/kasan/report.c:495 hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 __hfs_brec_find+0x213/0x5c0 fs/hfs/bfind.c:75 hfs_brec_find+0x276/0x520 fs/hfs/bfind.c:138 hfs_write_inode+0x34c/0xb40 fs/hfs/inode.c:462 write_inode fs/fs-writeback.c:1440 [inline] If the input inode of hfs_write_inode() is incorrect: struct inode struct hfs_inode_info struct hfs_cat_key struct hfs_name u8 len # len is greater than HFS_NAMELEN(31) which is the maximum length of an HFS filename OOB read occurred: hfs_write_inode() hfs_brec_find() __hfs_brec_find() hfs_cat_keycmp() hfs_strcmp() # OOB read occurred due to len is too large Fix this by adding a Check on len in hfs_write_inode() before calling hfs_brec_find(). Link: https://lkml.kernel.org/r/20221130065959.2168236-1-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reported-by: <syzbot+e836ff7133ac02be825f@syzkaller.appspotmail.com> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Viacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 8d824e69 Tue Nov 29 23:59:59 MST 2022 ZhangPeng <zhangpeng362@huawei.com> hfs: fix OOB Read in __hfs_brec_find Syzbot reported a OOB read bug: ================================================================== BUG: KASAN: slab-out-of-bounds in hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 Read of size 1 at addr ffff88807eb62c4e by task kworker/u4:1/11 CPU: 1 PID: 11 Comm: kworker/u4:1 Not tainted 6.1.0-rc6-syzkaller-00308-g644e9524388a #0 Workqueue: writeback wb_workfn (flush-7:0) Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1b1/0x28e lib/dump_stack.c:106 print_address_description+0x74/0x340 mm/kasan/report.c:284 print_report+0x107/0x1f0 mm/kasan/report.c:395 kasan_report+0xcd/0x100 mm/kasan/report.c:495 hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 __hfs_brec_find+0x213/0x5c0 fs/hfs/bfind.c:75 hfs_brec_find+0x276/0x520 fs/hfs/bfind.c:138 hfs_write_inode+0x34c/0xb40 fs/hfs/inode.c:462 write_inode fs/fs-writeback.c:1440 [inline] If the input inode of hfs_write_inode() is incorrect: struct inode struct hfs_inode_info struct hfs_cat_key struct hfs_name u8 len # len is greater than HFS_NAMELEN(31) which is the maximum length of an HFS filename OOB read occurred: hfs_write_inode() hfs_brec_find() __hfs_brec_find() hfs_cat_keycmp() hfs_strcmp() # OOB read occurred due to len is too large Fix this by adding a Check on len in hfs_write_inode() before calling hfs_brec_find(). Link: https://lkml.kernel.org/r/20221130065959.2168236-1-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reported-by: <syzbot+e836ff7133ac02be825f@syzkaller.appspotmail.com> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Viacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 8d824e69 Tue Nov 29 23:59:59 MST 2022 ZhangPeng <zhangpeng362@huawei.com> hfs: fix OOB Read in __hfs_brec_find Syzbot reported a OOB read bug: ================================================================== BUG: KASAN: slab-out-of-bounds in hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 Read of size 1 at addr ffff88807eb62c4e by task kworker/u4:1/11 CPU: 1 PID: 11 Comm: kworker/u4:1 Not tainted 6.1.0-rc6-syzkaller-00308-g644e9524388a #0 Workqueue: writeback wb_workfn (flush-7:0) Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1b1/0x28e lib/dump_stack.c:106 print_address_description+0x74/0x340 mm/kasan/report.c:284 print_report+0x107/0x1f0 mm/kasan/report.c:395 kasan_report+0xcd/0x100 mm/kasan/report.c:495 hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 __hfs_brec_find+0x213/0x5c0 fs/hfs/bfind.c:75 hfs_brec_find+0x276/0x520 fs/hfs/bfind.c:138 hfs_write_inode+0x34c/0xb40 fs/hfs/inode.c:462 write_inode fs/fs-writeback.c:1440 [inline] If the input inode of hfs_write_inode() is incorrect: struct inode struct hfs_inode_info struct hfs_cat_key struct hfs_name u8 len # len is greater than HFS_NAMELEN(31) which is the maximum length of an HFS filename OOB read occurred: hfs_write_inode() hfs_brec_find() __hfs_brec_find() hfs_cat_keycmp() hfs_strcmp() # OOB read occurred due to len is too large Fix this by adding a Check on len in hfs_write_inode() before calling hfs_brec_find(). Link: https://lkml.kernel.org/r/20221130065959.2168236-1-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reported-by: <syzbot+e836ff7133ac02be825f@syzkaller.appspotmail.com> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Viacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
/linux-master/fs/ | ||
H A D | attr.c | diff 92383111 Thu Feb 15 03:31:10 MST 2024 Roberto Sassu <roberto.sassu@huawei.com> evm: Move to LSM infrastructure As for IMA, move hardcoded EVM function calls from various places in the kernel to the LSM infrastructure, by introducing a new LSM named 'evm' (last and always enabled like 'ima'). The order in the Makefile ensures that 'evm' hooks are executed after 'ima' ones. Make EVM functions as static (except for evm_inode_init_security(), which is exported), and register them as hook implementations in init_evm_lsm(). Also move the inline functions evm_inode_remove_acl(), evm_inode_post_remove_acl(), and evm_inode_post_set_acl() from the public evm.h header to evm_main.c. Unlike before (see commit to move IMA to the LSM infrastructure), evm_inode_post_setattr(), evm_inode_post_set_acl(), evm_inode_post_remove_acl(), and evm_inode_post_removexattr() are not executed for private inodes. Finally, add the LSM_ID_EVM case in lsm_list_modules_test.c Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Reviewed-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Stefan Berger <stefanb@linux.ibm.com> Reviewed-by: Mimi Zohar <zohar@linux.ibm.com> Acked-by: Mimi Zohar <zohar@linux.ibm.com> Signed-off-by: Paul Moore <paul@paul-moore.com> diff 92383111 Thu Feb 15 03:31:10 MST 2024 Roberto Sassu <roberto.sassu@huawei.com> evm: Move to LSM infrastructure As for IMA, move hardcoded EVM function calls from various places in the kernel to the LSM infrastructure, by introducing a new LSM named 'evm' (last and always enabled like 'ima'). The order in the Makefile ensures that 'evm' hooks are executed after 'ima' ones. Make EVM functions as static (except for evm_inode_init_security(), which is exported), and register them as hook implementations in init_evm_lsm(). Also move the inline functions evm_inode_remove_acl(), evm_inode_post_remove_acl(), and evm_inode_post_set_acl() from the public evm.h header to evm_main.c. Unlike before (see commit to move IMA to the LSM infrastructure), evm_inode_post_setattr(), evm_inode_post_set_acl(), evm_inode_post_remove_acl(), and evm_inode_post_removexattr() are not executed for private inodes. Finally, add the LSM_ID_EVM case in lsm_list_modules_test.c Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Reviewed-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Stefan Berger <stefanb@linux.ibm.com> Reviewed-by: Mimi Zohar <zohar@linux.ibm.com> Acked-by: Mimi Zohar <zohar@linux.ibm.com> Signed-off-by: Paul Moore <paul@paul-moore.com> diff 84594c9e Thu Feb 15 03:31:09 MST 2024 Roberto Sassu <roberto.sassu@huawei.com> ima: Move IMA-Appraisal to LSM infrastructure A few additional IMA hooks are needed to reset the cached appraisal status, causing the file's integrity to be re-evaluated on next access. Register these IMA-appraisal only functions separately from the rest of IMA functions, as appraisal is a separate feature not necessarily enabled in the kernel configuration. Reuse the same approach as for other IMA functions, move hardcoded calls from various places in the kernel to the LSM infrastructure. Declare the functions as static and register them as hook implementations in init_ima_appraise_lsm(), called by init_ima_lsm(). Also move the inline function ima_inode_remove_acl() from the public ima.h header to ima_appraise.c. Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Reviewed-by: Stefan Berger <stefanb@linux.ibm.com> Reviewed-by: Mimi Zohar <zohar@linux.ibm.com> Reviewed-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Mimi Zohar <zohar@linux.ibm.com> Signed-off-by: Paul Moore <paul@paul-moore.com> diff fe12cfc1 Tue Jan 09 00:29:27 MST 2024 Jay <merqqcury@gmail.com> fs: fix a typo in attr.c The word "filesytem" should be "filesystem" Signed-off-by: Jay <merqqcury@gmail.com> Link: https://lore.kernel.org/r/20240109072927.29626-1-merqqcury@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org> diff 5d1f903f Wed Jul 12 12:58:49 MDT 2023 Christian Brauner <brauner@kernel.org> attr: block mode changes of symlinks Changing the mode of symlinks is meaningless as the vfs doesn't take the mode of a symlink into account during path lookup permission checking. However, the vfs doesn't block mode changes on symlinks. This however, has lead to an untenable mess roughly classifiable into the following two categories: (1) Filesystems that don't implement a i_op->setattr() for symlinks. Such filesystems may or may not know that without i_op->setattr() defined, notify_change() falls back to simple_setattr() causing the inode's mode in the inode cache to be changed. That's a generic issue as this will affect all non-size changing inode attributes including ownership changes. Example: afs (2) Filesystems that fail with EOPNOTSUPP but change the mode of the symlink nonetheless. Some filesystems will happily update the mode of a symlink but still return EOPNOTSUPP. This is the biggest source of confusion for userspace. The EOPNOTSUPP in this case comes from POSIX ACLs. Specifically it comes from filesystems that call posix_acl_chmod(), e.g., btrfs via if (!err && attr->ia_valid & ATTR_MODE) err = posix_acl_chmod(idmap, dentry, inode->i_mode); Filesystems including btrfs don't implement i_op->set_acl() so posix_acl_chmod() will report EOPNOTSUPP. When posix_acl_chmod() is called, most filesystems will have finished updating the inode. Perversely, this has the consequences that this behavior may depend on two kconfig options and mount options: * CONFIG_POSIX_ACL={y,n} * CONFIG_${FSTYPE}_POSIX_ACL={y,n} * Opt_acl, Opt_noacl Example: btrfs, ext4, xfs The only way to change the mode on a symlink currently involves abusing an O_PATH file descriptor in the following manner: fd = openat(-1, "/path/to/link", O_CLOEXEC | O_PATH | O_NOFOLLOW); char path[PATH_MAX]; snprintf(path, sizeof(path), "/proc/self/fd/%d", fd); chmod(path, 0000); But for most major filesystems with POSIX ACL support such as btrfs, ext4, ceph, tmpfs, xfs and others this will fail with EOPNOTSUPP with the mode still updated due to the aforementioned posix_acl_chmod() nonsense. So, given that for all major filesystems this would fail with EOPNOTSUPP and that both glibc (cf. [1]) and musl (cf. [2]) outright block mode changes on symlinks we should just try and block mode changes on symlinks directly in the vfs and have a clean break with this nonsense. If this causes any regressions, we do the next best thing and fix up all filesystems that do return EOPNOTSUPP with the mode updated to not call posix_acl_chmod() on symlinks. But as usual, let's try the clean cut solution first. It's a simple patch that can be easily reverted. Not marking this for backport as I'll do that manually if we're reasonably sure that this works and there are no strong objections. We could block this in chmod_common() but it's more appropriate to do it notify_change() as it will also mean that we catch filesystems that change symlink permissions explicitly or accidently. Similar proposals were floated in the past as in [3] and [4] and again recently in [5]. There's also a couple of bugs about this inconsistency as in [6] and [7]. Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=99527a3727e44cb8661ee1f743068f108ec93979;hb=HEAD [1] Link: https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c [2] Link: https://lore.kernel.org/all/20200911065733.GA31579@infradead.org [3] Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00518.html [4] Link: https://lore.kernel.org/lkml/87lefmbppo.fsf@oldenburg.str.redhat.com [5] Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00467.html [6] Link: https://sourceware.org/bugzilla/show_bug.cgi?id=14578#c17 [7] Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org # please backport to all LTSes but not before v6.6-rc2 is tagged Suggested-by: Christoph Hellwig <hch@lst.de> Suggested-by: Florian Weimer <fweimer@redhat.com> Message-Id: <20230712-vfs-chmod-symlinks-v2-1-08cfb92b61dd@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> diff 5d1f903f Wed Jul 12 12:58:49 MDT 2023 Christian Brauner <brauner@kernel.org> attr: block mode changes of symlinks Changing the mode of symlinks is meaningless as the vfs doesn't take the mode of a symlink into account during path lookup permission checking. However, the vfs doesn't block mode changes on symlinks. This however, has lead to an untenable mess roughly classifiable into the following two categories: (1) Filesystems that don't implement a i_op->setattr() for symlinks. Such filesystems may or may not know that without i_op->setattr() defined, notify_change() falls back to simple_setattr() causing the inode's mode in the inode cache to be changed. That's a generic issue as this will affect all non-size changing inode attributes including ownership changes. Example: afs (2) Filesystems that fail with EOPNOTSUPP but change the mode of the symlink nonetheless. Some filesystems will happily update the mode of a symlink but still return EOPNOTSUPP. This is the biggest source of confusion for userspace. The EOPNOTSUPP in this case comes from POSIX ACLs. Specifically it comes from filesystems that call posix_acl_chmod(), e.g., btrfs via if (!err && attr->ia_valid & ATTR_MODE) err = posix_acl_chmod(idmap, dentry, inode->i_mode); Filesystems including btrfs don't implement i_op->set_acl() so posix_acl_chmod() will report EOPNOTSUPP. When posix_acl_chmod() is called, most filesystems will have finished updating the inode. Perversely, this has the consequences that this behavior may depend on two kconfig options and mount options: * CONFIG_POSIX_ACL={y,n} * CONFIG_${FSTYPE}_POSIX_ACL={y,n} * Opt_acl, Opt_noacl Example: btrfs, ext4, xfs The only way to change the mode on a symlink currently involves abusing an O_PATH file descriptor in the following manner: fd = openat(-1, "/path/to/link", O_CLOEXEC | O_PATH | O_NOFOLLOW); char path[PATH_MAX]; snprintf(path, sizeof(path), "/proc/self/fd/%d", fd); chmod(path, 0000); But for most major filesystems with POSIX ACL support such as btrfs, ext4, ceph, tmpfs, xfs and others this will fail with EOPNOTSUPP with the mode still updated due to the aforementioned posix_acl_chmod() nonsense. So, given that for all major filesystems this would fail with EOPNOTSUPP and that both glibc (cf. [1]) and musl (cf. [2]) outright block mode changes on symlinks we should just try and block mode changes on symlinks directly in the vfs and have a clean break with this nonsense. If this causes any regressions, we do the next best thing and fix up all filesystems that do return EOPNOTSUPP with the mode updated to not call posix_acl_chmod() on symlinks. But as usual, let's try the clean cut solution first. It's a simple patch that can be easily reverted. Not marking this for backport as I'll do that manually if we're reasonably sure that this works and there are no strong objections. We could block this in chmod_common() but it's more appropriate to do it notify_change() as it will also mean that we catch filesystems that change symlink permissions explicitly or accidently. Similar proposals were floated in the past as in [3] and [4] and again recently in [5]. There's also a couple of bugs about this inconsistency as in [6] and [7]. Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=99527a3727e44cb8661ee1f743068f108ec93979;hb=HEAD [1] Link: https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c [2] Link: https://lore.kernel.org/all/20200911065733.GA31579@infradead.org [3] Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00518.html [4] Link: https://lore.kernel.org/lkml/87lefmbppo.fsf@oldenburg.str.redhat.com [5] Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00467.html [6] Link: https://sourceware.org/bugzilla/show_bug.cgi?id=14578#c17 [7] Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org # please backport to all LTSes but not before v6.6-rc2 is tagged Suggested-by: Christoph Hellwig <hch@lst.de> Suggested-by: Florian Weimer <fweimer@redhat.com> Message-Id: <20230712-vfs-chmod-symlinks-v2-1-08cfb92b61dd@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> diff ed5a7047 Mon Oct 17 09:06:37 MDT 2022 Christian Brauner <brauner@kernel.org> attr: use consistent sgid stripping checks Currently setgid stripping in file_remove_privs()'s should_remove_suid() helper is inconsistent with other parts of the vfs. Specifically, it only raises ATTR_KILL_SGID if the inode is S_ISGID and S_IXGRP but not if the inode isn't in the caller's groups and the caller isn't privileged over the inode although we require this already in setattr_prepare() and setattr_copy() and so all filesystem implement this requirement implicitly because they have to use setattr_{prepare,copy}() anyway. But the inconsistency shows up in setgid stripping bugs for overlayfs in xfstests (e.g., generic/673, generic/683, generic/685, generic/686, generic/687). For example, we test whether suid and setgid stripping works correctly when performing various write-like operations as an unprivileged user (fallocate, reflink, write, etc.): echo "Test 1 - qa_user, non-exec file $verb" setup_testfile chmod a+rws $junk_file commit_and_check "$qa_user" "$verb" 64k 64k The test basically creates a file with 6666 permissions. While the file has the S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. On a regular filesystem like xfs what will happen is: sys_fallocate() -> vfs_fallocate() -> xfs_file_fallocate() -> file_modified() -> __file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = ATTR_FORCE | kill; -> notify_change() -> setattr_copy() In should_remove_suid() we can see that ATTR_KILL_SUID is raised unconditionally because the file in the test has S_ISUID set. But we also see that ATTR_KILL_SGID won't be set because while the file is S_ISGID it is not S_IXGRP (see above) which is a condition for ATTR_KILL_SGID being raised. So by the time we call notify_change() we have attr->ia_valid set to ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that ATTR_KILL_SUID is set and does: ia_valid = attr->ia_valid |= ATTR_MODE attr->ia_mode = (inode->i_mode & ~S_ISUID); which means that when we call setattr_copy() later we will definitely update inode->i_mode. Note that attr->ia_mode still contains S_ISGID. Now we call into the filesystem's ->setattr() inode operation which will end up calling setattr_copy(). Since ATTR_MODE is set we will hit: if (ia_valid & ATTR_MODE) { umode_t mode = attr->ia_mode; vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode); if (!vfsgid_in_group_p(vfsgid) && !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID)) mode &= ~S_ISGID; inode->i_mode = mode; } and since the caller in the test is neither capable nor in the group of the inode the S_ISGID bit is stripped. But assume the file isn't suid then ATTR_KILL_SUID won't be raised which has the consequence that neither the setgid nor the suid bits are stripped even though it should be stripped because the inode isn't in the caller's groups and the caller isn't privileged over the inode. If overlayfs is in the mix things become a bit more complicated and the bug shows up more clearly. When e.g., ovl_setattr() is hit from ovl_fallocate()'s call to file_remove_privs() then ATTR_KILL_SUID and ATTR_KILL_SGID might be raised but because the check in notify_change() is questioning the ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be stripped the S_ISGID bit isn't removed even though it should be stripped: sys_fallocate() -> vfs_fallocate() -> ovl_fallocate() -> file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = ATTR_FORCE | kill; -> notify_change() -> ovl_setattr() // TAKE ON MOUNTER'S CREDS -> ovl_do_notify_change() -> notify_change() // GIVE UP MOUNTER'S CREDS // TAKE ON MOUNTER'S CREDS -> vfs_fallocate() -> xfs_file_fallocate() -> file_modified() -> __file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = attr_force | kill; -> notify_change() The fix for all of this is to make file_remove_privs()'s should_remove_suid() helper to perform the same checks as we already require in setattr_prepare() and setattr_copy() and have notify_change() not pointlessly requiring S_IXGRP again. It doesn't make any sense in the first place because the caller must calculate the flags via should_remove_suid() anyway which would raise ATTR_KILL_SGID. While we're at it we move should_remove_suid() from inode.c to attr.c where it belongs with the rest of the iattr helpers. Especially since it returns ATTR_KILL_S{G,U}ID flags. We also rename it to setattr_should_drop_suidgid() to better reflect that it indicates both setuid and setgid bit removal and also that it returns attr flags. Running xfstests with this doesn't report any regressions. We should really try and use consistent checks. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> diff ed5a7047 Mon Oct 17 09:06:37 MDT 2022 Christian Brauner <brauner@kernel.org> attr: use consistent sgid stripping checks Currently setgid stripping in file_remove_privs()'s should_remove_suid() helper is inconsistent with other parts of the vfs. Specifically, it only raises ATTR_KILL_SGID if the inode is S_ISGID and S_IXGRP but not if the inode isn't in the caller's groups and the caller isn't privileged over the inode although we require this already in setattr_prepare() and setattr_copy() and so all filesystem implement this requirement implicitly because they have to use setattr_{prepare,copy}() anyway. But the inconsistency shows up in setgid stripping bugs for overlayfs in xfstests (e.g., generic/673, generic/683, generic/685, generic/686, generic/687). For example, we test whether suid and setgid stripping works correctly when performing various write-like operations as an unprivileged user (fallocate, reflink, write, etc.): echo "Test 1 - qa_user, non-exec file $verb" setup_testfile chmod a+rws $junk_file commit_and_check "$qa_user" "$verb" 64k 64k The test basically creates a file with 6666 permissions. While the file has the S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. On a regular filesystem like xfs what will happen is: sys_fallocate() -> vfs_fallocate() -> xfs_file_fallocate() -> file_modified() -> __file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = ATTR_FORCE | kill; -> notify_change() -> setattr_copy() In should_remove_suid() we can see that ATTR_KILL_SUID is raised unconditionally because the file in the test has S_ISUID set. But we also see that ATTR_KILL_SGID won't be set because while the file is S_ISGID it is not S_IXGRP (see above) which is a condition for ATTR_KILL_SGID being raised. So by the time we call notify_change() we have attr->ia_valid set to ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that ATTR_KILL_SUID is set and does: ia_valid = attr->ia_valid |= ATTR_MODE attr->ia_mode = (inode->i_mode & ~S_ISUID); which means that when we call setattr_copy() later we will definitely update inode->i_mode. Note that attr->ia_mode still contains S_ISGID. Now we call into the filesystem's ->setattr() inode operation which will end up calling setattr_copy(). Since ATTR_MODE is set we will hit: if (ia_valid & ATTR_MODE) { umode_t mode = attr->ia_mode; vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode); if (!vfsgid_in_group_p(vfsgid) && !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID)) mode &= ~S_ISGID; inode->i_mode = mode; } and since the caller in the test is neither capable nor in the group of the inode the S_ISGID bit is stripped. But assume the file isn't suid then ATTR_KILL_SUID won't be raised which has the consequence that neither the setgid nor the suid bits are stripped even though it should be stripped because the inode isn't in the caller's groups and the caller isn't privileged over the inode. If overlayfs is in the mix things become a bit more complicated and the bug shows up more clearly. When e.g., ovl_setattr() is hit from ovl_fallocate()'s call to file_remove_privs() then ATTR_KILL_SUID and ATTR_KILL_SGID might be raised but because the check in notify_change() is questioning the ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be stripped the S_ISGID bit isn't removed even though it should be stripped: sys_fallocate() -> vfs_fallocate() -> ovl_fallocate() -> file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = ATTR_FORCE | kill; -> notify_change() -> ovl_setattr() // TAKE ON MOUNTER'S CREDS -> ovl_do_notify_change() -> notify_change() // GIVE UP MOUNTER'S CREDS // TAKE ON MOUNTER'S CREDS -> vfs_fallocate() -> xfs_file_fallocate() -> file_modified() -> __file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = attr_force | kill; -> notify_change() The fix for all of this is to make file_remove_privs()'s should_remove_suid() helper to perform the same checks as we already require in setattr_prepare() and setattr_copy() and have notify_change() not pointlessly requiring S_IXGRP again. It doesn't make any sense in the first place because the caller must calculate the flags via should_remove_suid() anyway which would raise ATTR_KILL_SGID. While we're at it we move should_remove_suid() from inode.c to attr.c where it belongs with the rest of the iattr helpers. Especially since it returns ATTR_KILL_S{G,U}ID flags. We also rename it to setattr_should_drop_suidgid() to better reflect that it indicates both setuid and setgid bit removal and also that it returns attr flags. Running xfstests with this doesn't report any regressions. We should really try and use consistent checks. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> diff e243e3f9 Mon Oct 17 09:06:35 MDT 2022 Christian Brauner <brauner@kernel.org> fs: move should_remove_suid() Move the helper from inode.c to attr.c. This keeps the the core of the set{g,u}id stripping logic in one place when we add follow-up changes. It is the better place anyway, since should_remove_suid() returns ATTR_KILL_S{G,U}ID flags. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> diff e243e3f9 Mon Oct 17 09:06:35 MDT 2022 Christian Brauner <brauner@kernel.org> fs: move should_remove_suid() Move the helper from inode.c to attr.c. This keeps the the core of the set{g,u}id stripping logic in one place when we add follow-up changes. It is the better place anyway, since should_remove_suid() returns ATTR_KILL_S{G,U}ID flags. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> |
/linux-master/fs/hostfs/ | ||
H A D | hostfs_kern.c | diff e0820368 Thu Oct 13 10:17:00 MDT 2022 Fabio M. De Francesco <fmdefrancesco@gmail.com> hostfs: Replace kmap() with kmap_local_page() The use of kmap() is being deprecated in favor of kmap_local_page(). There are two main problems with kmap(): (1) It comes with an overhead as the mapping space is restricted and protected by a global lock for synchronization and (2) it also requires global TLB invalidation when the kmap’s pool wraps and it might block when the mapping space is fully utilized until a slot becomes available. With kmap_local_page() the mappings are per thread, CPU local, can take page faults, and can be called from any context (including interrupts). It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore, the tasks can be preempted and, when they are scheduled to run again, the kernel virtual addresses are restored and still valid. Therefore, replace kmap() with kmap_local_page() in hostfs_kern.c, it being the only file with kmap() call sites currently left in fs/hostfs. Cc: "Venkataramanan, Anirudh" <anirudh.venkataramanan@intel.com> Suggested-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Signed-off-by: Richard Weinberger <richard@nod.at> diff bca30265 Tue Jun 12 07:31:17 MDT 2018 Arnd Bergmann <arnd@arndb.de> hostfs: pass 64-bit timestamps to/from user space The use of 'struct timespec' is deprecated in the kernel, so we want to avoid the conversions from/to the proper timespec64 structure. On the user space side, we have a 'struct timespec' that is defined by the C library and that will be incompatible with the kernel's view on 32-bit architectures once they move to a 64-bit time_t, breaking the shared binary layout of hostfs_iattr and hostfs_stat. This changes the two structures to use a new hostfs_timespec structure with fixed 64-bit seconds/nanoseconds for passing the timestamps between hostfs_kern.c and hostfs_user.c. With a new enough user space side, this will allow timestamps beyond year 2038. Signed-off-by: Arnd Bergmann <arnd@arndb.de> diff bca30265 Tue Jun 12 07:31:17 MDT 2018 Arnd Bergmann <arnd@arndb.de> hostfs: pass 64-bit timestamps to/from user space The use of 'struct timespec' is deprecated in the kernel, so we want to avoid the conversions from/to the proper timespec64 structure. On the user space side, we have a 'struct timespec' that is defined by the C library and that will be incompatible with the kernel's view on 32-bit architectures once they move to a 64-bit time_t, breaking the shared binary layout of hostfs_iattr and hostfs_stat. This changes the two structures to use a new hostfs_timespec structure with fixed 64-bit seconds/nanoseconds for passing the timestamps between hostfs_kern.c and hostfs_user.c. With a new enough user space side, this will allow timestamps beyond year 2038. Signed-off-by: Arnd Bergmann <arnd@arndb.de> diff bca30265 Tue Jun 12 07:31:17 MDT 2018 Arnd Bergmann <arnd@arndb.de> hostfs: pass 64-bit timestamps to/from user space The use of 'struct timespec' is deprecated in the kernel, so we want to avoid the conversions from/to the proper timespec64 structure. On the user space side, we have a 'struct timespec' that is defined by the C library and that will be incompatible with the kernel's view on 32-bit architectures once they move to a 64-bit time_t, breaking the shared binary layout of hostfs_iattr and hostfs_stat. This changes the two structures to use a new hostfs_timespec structure with fixed 64-bit seconds/nanoseconds for passing the timestamps between hostfs_kern.c and hostfs_user.c. With a new enough user space side, this will allow timestamps beyond year 2038. Signed-off-by: Arnd Bergmann <arnd@arndb.de> diff 6380161c Sun Mar 11 04:34:48 MDT 2018 Dominik Brodowski <linux@dominikbrodowski.net> hostfs: rename do_rmdir() to hostfs_do_rmdir() do_rmdir() is used in the VFS layer at fs/namei.c, so use a different name in hostfs. Cc: Jeff Dike <jdike@addtoit.com> Cc: user-mode-linux-devel@lists.sourceforge.net Acked-by: Richard Weinberger <richard@nod.at> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> diff 6380161c Sun Mar 11 04:34:48 MDT 2018 Dominik Brodowski <linux@dominikbrodowski.net> hostfs: rename do_rmdir() to hostfs_do_rmdir() do_rmdir() is used in the VFS layer at fs/namei.c, so use a different name in hostfs. Cc: Jeff Dike <jdike@addtoit.com> Cc: user-mode-linux-devel@lists.sourceforge.net Acked-by: Richard Weinberger <richard@nod.at> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> diff 5955102c Fri Jan 22 13:40:57 MST 2016 Al Viro <viro@zeniv.linux.org.uk> wrappers for ->i_mutex access parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested}, inode_foo(inode) being mutex_foo(&inode->i_mutex). Please, use those for access to ->i_mutex; over the coming cycle ->i_mutex will become rwsem, with ->lookup() done with it held only shared. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 9e443bc3 Thu Nov 14 14:15:13 MST 2013 James Hogan <jhogan@kernel.org> um: hostfs: make functions static The hostfs_*() callback functions are all only used within hostfs_kern.c, so make them static. Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: user-mode-linux-devel@lists.sourceforge.net Signed-off-by: Richard Weinberger <richard@nod.at> diff 1025774c Fri Jun 04 03:30:02 MDT 2010 Christoph Hellwig <hch@lst.de> remove inode_setattr Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 1025774c Fri Jun 04 03:30:02 MDT 2010 Christoph Hellwig <hch@lst.de> remove inode_setattr Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
/linux-master/fs/hfsplus/ | ||
H A D | inode.c | diff 9f2b5deb Tue Dec 06 20:05:40 MST 2022 Aditya Garg <gargaditya08@live.com> hfsplus: fix bug causing custom uid and gid being unable to be assigned with mount Despite specifying UID and GID in mount command, the specified UID and GID were not being assigned. This patch fixes this issue. Link: https://lkml.kernel.org/r/C0264BF5-059C-45CF-B8DA-3A3BD2C803A2@live.com Signed-off-by: Aditya Garg <gargaditya08@live.com> Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 55d1cbbb Mon Nov 08 19:35:04 MST 2021 Arnd Bergmann <arnd@arndb.de> hfs/hfsplus: use WARN_ON for sanity check gcc warns about a couple of instances in which a sanity check exists but the author wasn't sure how to react to it failing, which makes it look like a possible bug: fs/hfsplus/inode.c: In function 'hfsplus_cat_read_inode': fs/hfsplus/inode.c:503:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 503 | /* panic? */; | ^ fs/hfsplus/inode.c:524:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 524 | /* panic? */; | ^ fs/hfsplus/inode.c: In function 'hfsplus_cat_write_inode': fs/hfsplus/inode.c:582:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 582 | /* panic? */; | ^ fs/hfsplus/inode.c:608:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 608 | /* panic? */; | ^ fs/hfs/inode.c: In function 'hfs_write_inode': fs/hfs/inode.c:464:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 464 | /* panic? */; | ^ fs/hfs/inode.c:485:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 485 | /* panic? */; | ^ panic() is probably not the correct choice here, but a WARN_ON seems appropriate and avoids the compile-time warning. Link: https://lkml.kernel.org/r/20210927102149.1809384-1-arnd@kernel.org Link: https://lore.kernel.org/all/20210322223249.2632268-1-arnd@kernel.org/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 55d1cbbb Mon Nov 08 19:35:04 MST 2021 Arnd Bergmann <arnd@arndb.de> hfs/hfsplus: use WARN_ON for sanity check gcc warns about a couple of instances in which a sanity check exists but the author wasn't sure how to react to it failing, which makes it look like a possible bug: fs/hfsplus/inode.c: In function 'hfsplus_cat_read_inode': fs/hfsplus/inode.c:503:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 503 | /* panic? */; | ^ fs/hfsplus/inode.c:524:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 524 | /* panic? */; | ^ fs/hfsplus/inode.c: In function 'hfsplus_cat_write_inode': fs/hfsplus/inode.c:582:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 582 | /* panic? */; | ^ fs/hfsplus/inode.c:608:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 608 | /* panic? */; | ^ fs/hfs/inode.c: In function 'hfs_write_inode': fs/hfs/inode.c:464:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 464 | /* panic? */; | ^ fs/hfs/inode.c:485:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 485 | /* panic? */; | ^ panic() is probably not the correct choice here, but a WARN_ON seems appropriate and avoids the compile-time warning. Link: https://lkml.kernel.org/r/20210927102149.1809384-1-arnd@kernel.org Link: https://lore.kernel.org/all/20210322223249.2632268-1-arnd@kernel.org/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 55d1cbbb Mon Nov 08 19:35:04 MST 2021 Arnd Bergmann <arnd@arndb.de> hfs/hfsplus: use WARN_ON for sanity check gcc warns about a couple of instances in which a sanity check exists but the author wasn't sure how to react to it failing, which makes it look like a possible bug: fs/hfsplus/inode.c: In function 'hfsplus_cat_read_inode': fs/hfsplus/inode.c:503:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 503 | /* panic? */; | ^ fs/hfsplus/inode.c:524:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 524 | /* panic? */; | ^ fs/hfsplus/inode.c: In function 'hfsplus_cat_write_inode': fs/hfsplus/inode.c:582:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 582 | /* panic? */; | ^ fs/hfsplus/inode.c:608:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 608 | /* panic? */; | ^ fs/hfs/inode.c: In function 'hfs_write_inode': fs/hfs/inode.c:464:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 464 | /* panic? */; | ^ fs/hfs/inode.c:485:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 485 | /* panic? */; | ^ panic() is probably not the correct choice here, but a WARN_ON seems appropriate and avoids the compile-time warning. Link: https://lkml.kernel.org/r/20210927102149.1809384-1-arnd@kernel.org Link: https://lore.kernel.org/all/20210322223249.2632268-1-arnd@kernel.org/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 55d1cbbb Mon Nov 08 19:35:04 MST 2021 Arnd Bergmann <arnd@arndb.de> hfs/hfsplus: use WARN_ON for sanity check gcc warns about a couple of instances in which a sanity check exists but the author wasn't sure how to react to it failing, which makes it look like a possible bug: fs/hfsplus/inode.c: In function 'hfsplus_cat_read_inode': fs/hfsplus/inode.c:503:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 503 | /* panic? */; | ^ fs/hfsplus/inode.c:524:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 524 | /* panic? */; | ^ fs/hfsplus/inode.c: In function 'hfsplus_cat_write_inode': fs/hfsplus/inode.c:582:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 582 | /* panic? */; | ^ fs/hfsplus/inode.c:608:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 608 | /* panic? */; | ^ fs/hfs/inode.c: In function 'hfs_write_inode': fs/hfs/inode.c:464:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 464 | /* panic? */; | ^ fs/hfs/inode.c:485:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 485 | /* panic? */; | ^ panic() is probably not the correct choice here, but a WARN_ON seems appropriate and avoids the compile-time warning. Link: https://lkml.kernel.org/r/20210927102149.1809384-1-arnd@kernel.org Link: https://lore.kernel.org/all/20210322223249.2632268-1-arnd@kernel.org/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 55d1cbbb Mon Nov 08 19:35:04 MST 2021 Arnd Bergmann <arnd@arndb.de> hfs/hfsplus: use WARN_ON for sanity check gcc warns about a couple of instances in which a sanity check exists but the author wasn't sure how to react to it failing, which makes it look like a possible bug: fs/hfsplus/inode.c: In function 'hfsplus_cat_read_inode': fs/hfsplus/inode.c:503:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 503 | /* panic? */; | ^ fs/hfsplus/inode.c:524:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 524 | /* panic? */; | ^ fs/hfsplus/inode.c: In function 'hfsplus_cat_write_inode': fs/hfsplus/inode.c:582:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 582 | /* panic? */; | ^ fs/hfsplus/inode.c:608:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 608 | /* panic? */; | ^ fs/hfs/inode.c: In function 'hfs_write_inode': fs/hfs/inode.c:464:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 464 | /* panic? */; | ^ fs/hfs/inode.c:485:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 485 | /* panic? */; | ^ panic() is probably not the correct choice here, but a WARN_ON seems appropriate and avoids the compile-time warning. Link: https://lkml.kernel.org/r/20210927102149.1809384-1-arnd@kernel.org Link: https://lore.kernel.org/all/20210322223249.2632268-1-arnd@kernel.org/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 55d1cbbb Mon Nov 08 19:35:04 MST 2021 Arnd Bergmann <arnd@arndb.de> hfs/hfsplus: use WARN_ON for sanity check gcc warns about a couple of instances in which a sanity check exists but the author wasn't sure how to react to it failing, which makes it look like a possible bug: fs/hfsplus/inode.c: In function 'hfsplus_cat_read_inode': fs/hfsplus/inode.c:503:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 503 | /* panic? */; | ^ fs/hfsplus/inode.c:524:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 524 | /* panic? */; | ^ fs/hfsplus/inode.c: In function 'hfsplus_cat_write_inode': fs/hfsplus/inode.c:582:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 582 | /* panic? */; | ^ fs/hfsplus/inode.c:608:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 608 | /* panic? */; | ^ fs/hfs/inode.c: In function 'hfs_write_inode': fs/hfs/inode.c:464:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 464 | /* panic? */; | ^ fs/hfs/inode.c:485:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 485 | /* panic? */; | ^ panic() is probably not the correct choice here, but a WARN_ON seems appropriate and avoids the compile-time warning. Link: https://lkml.kernel.org/r/20210927102149.1809384-1-arnd@kernel.org Link: https://lore.kernel.org/all/20210322223249.2632268-1-arnd@kernel.org/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 55d1cbbb Mon Nov 08 19:35:04 MST 2021 Arnd Bergmann <arnd@arndb.de> hfs/hfsplus: use WARN_ON for sanity check gcc warns about a couple of instances in which a sanity check exists but the author wasn't sure how to react to it failing, which makes it look like a possible bug: fs/hfsplus/inode.c: In function 'hfsplus_cat_read_inode': fs/hfsplus/inode.c:503:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 503 | /* panic? */; | ^ fs/hfsplus/inode.c:524:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 524 | /* panic? */; | ^ fs/hfsplus/inode.c: In function 'hfsplus_cat_write_inode': fs/hfsplus/inode.c:582:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 582 | /* panic? */; | ^ fs/hfsplus/inode.c:608:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 608 | /* panic? */; | ^ fs/hfs/inode.c: In function 'hfs_write_inode': fs/hfs/inode.c:464:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 464 | /* panic? */; | ^ fs/hfs/inode.c:485:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 485 | /* panic? */; | ^ panic() is probably not the correct choice here, but a WARN_ON seems appropriate and avoids the compile-time warning. Link: https://lkml.kernel.org/r/20210927102149.1809384-1-arnd@kernel.org Link: https://lore.kernel.org/all/20210322223249.2632268-1-arnd@kernel.org/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 55d1cbbb Mon Nov 08 19:35:04 MST 2021 Arnd Bergmann <arnd@arndb.de> hfs/hfsplus: use WARN_ON for sanity check gcc warns about a couple of instances in which a sanity check exists but the author wasn't sure how to react to it failing, which makes it look like a possible bug: fs/hfsplus/inode.c: In function 'hfsplus_cat_read_inode': fs/hfsplus/inode.c:503:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 503 | /* panic? */; | ^ fs/hfsplus/inode.c:524:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 524 | /* panic? */; | ^ fs/hfsplus/inode.c: In function 'hfsplus_cat_write_inode': fs/hfsplus/inode.c:582:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 582 | /* panic? */; | ^ fs/hfsplus/inode.c:608:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 608 | /* panic? */; | ^ fs/hfs/inode.c: In function 'hfs_write_inode': fs/hfs/inode.c:464:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 464 | /* panic? */; | ^ fs/hfs/inode.c:485:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 485 | /* panic? */; | ^ panic() is probably not the correct choice here, but a WARN_ON seems appropriate and avoids the compile-time warning. Link: https://lkml.kernel.org/r/20210927102149.1809384-1-arnd@kernel.org Link: https://lore.kernel.org/all/20210322223249.2632268-1-arnd@kernel.org/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 55d1cbbb Mon Nov 08 19:35:04 MST 2021 Arnd Bergmann <arnd@arndb.de> hfs/hfsplus: use WARN_ON for sanity check gcc warns about a couple of instances in which a sanity check exists but the author wasn't sure how to react to it failing, which makes it look like a possible bug: fs/hfsplus/inode.c: In function 'hfsplus_cat_read_inode': fs/hfsplus/inode.c:503:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 503 | /* panic? */; | ^ fs/hfsplus/inode.c:524:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 524 | /* panic? */; | ^ fs/hfsplus/inode.c: In function 'hfsplus_cat_write_inode': fs/hfsplus/inode.c:582:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 582 | /* panic? */; | ^ fs/hfsplus/inode.c:608:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 608 | /* panic? */; | ^ fs/hfs/inode.c: In function 'hfs_write_inode': fs/hfs/inode.c:464:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 464 | /* panic? */; | ^ fs/hfs/inode.c:485:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 485 | /* panic? */; | ^ panic() is probably not the correct choice here, but a WARN_ON seems appropriate and avoids the compile-time warning. Link: https://lkml.kernel.org/r/20210927102149.1809384-1-arnd@kernel.org Link: https://lore.kernel.org/all/20210322223249.2632268-1-arnd@kernel.org/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
/linux-master/arch/powerpc/platforms/cell/spufs/ | ||
H A D | inode.c | diff 35a5c328 Wed Oct 23 07:44:23 MDT 2019 YueHaibing <yuehaibing@huawei.com> powerpc/spufs: remove set but not used variable 'ctx' arch/powerpc/platforms/cell/spufs/inode.c:201:22: warning: variable ctx set but not used [-Wunused-but-set-variable] It is not used since commit 67cba9fd6456 ("move spu_forget() into spufs_rmdir()") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191023134423.15052-1-yuehaibing@huawei.com diff 5955102c Fri Jan 22 13:40:57 MST 2016 Al Viro <viro@zeniv.linux.org.uk> wrappers for ->i_mutex access parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested}, inode_foo(inode) being mutex_foo(&inode->i_mutex). Please, use those for access to ->i_mutex; over the coming cycle ->i_mutex will become rwsem, with ->lookup() done with it held only shared. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 1025774c Fri Jun 04 03:30:02 MDT 2010 Christoph Hellwig <hch@lst.de> remove inode_setattr Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 1025774c Fri Jun 04 03:30:02 MDT 2010 Christoph Hellwig <hch@lst.de> remove inode_setattr Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 1babddbc Tue Jun 02 12:00:51 MDT 2009 Stephen Rothwell <sfr@canb.auug.org.au> powerpc/spufs: Remove unused error path Commit 45db9240890d9343c934db1fe82d6e68ac2d28da ("powerpc/spufs: Remove double check for non-negative dentry") removed the only user of the out_dput label, so remove it and the code following it. Gets rid of this warning: arch/powerpc/platforms/cell/spufs/inode.c: In function 'spufs_create': arch/powerpc/platforms/cell/spufs/inode.c:647: warning: label 'out_dput' defined but not used Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> diff 1babddbc Tue Jun 02 12:00:51 MDT 2009 Stephen Rothwell <sfr@canb.auug.org.au> powerpc/spufs: Remove unused error path Commit 45db9240890d9343c934db1fe82d6e68ac2d28da ("powerpc/spufs: Remove double check for non-negative dentry") removed the only user of the out_dput label, so remove it and the code following it. Gets rid of this warning: arch/powerpc/platforms/cell/spufs/inode.c: In function 'spufs_create': arch/powerpc/platforms/cell/spufs/inode.c:647: warning: label 'out_dput' defined but not used Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> diff 51cc5068 Fri Jul 25 20:45:34 MDT 2008 Alexey Dobriyan <adobriyan@gmail.com> SL*B: drop kmem cache argument from constructor Kmem cache passed to constructor is only needed for constructors that are themselves multiplexeres. Nobody uses this "feature", nor does anybody uses passed kmem cache in non-trivial way, so pass only pointer to object. Non-trivial places are: arch/powerpc/mm/init_64.c arch/powerpc/mm/hugetlbpage.c This is flag day, yes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Matt Mackall <mpm@selenic.com> [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c] [akpm@linux-foundation.org: fix mm/slab.c] [akpm@linux-foundation.org: fix ubifs] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 51cc5068 Fri Jul 25 20:45:34 MDT 2008 Alexey Dobriyan <adobriyan@gmail.com> SL*B: drop kmem cache argument from constructor Kmem cache passed to constructor is only needed for constructors that are themselves multiplexeres. Nobody uses this "feature", nor does anybody uses passed kmem cache in non-trivial way, so pass only pointer to object. Non-trivial places are: arch/powerpc/mm/init_64.c arch/powerpc/mm/hugetlbpage.c This is flag day, yes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Matt Mackall <mpm@selenic.com> [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c] [akpm@linux-foundation.org: fix mm/slab.c] [akpm@linux-foundation.org: fix ubifs] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 51cc5068 Fri Jul 25 20:45:34 MDT 2008 Alexey Dobriyan <adobriyan@gmail.com> SL*B: drop kmem cache argument from constructor Kmem cache passed to constructor is only needed for constructors that are themselves multiplexeres. Nobody uses this "feature", nor does anybody uses passed kmem cache in non-trivial way, so pass only pointer to object. Non-trivial places are: arch/powerpc/mm/init_64.c arch/powerpc/mm/hugetlbpage.c This is flag day, yes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Matt Mackall <mpm@selenic.com> [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c] [akpm@linux-foundation.org: fix mm/slab.c] [akpm@linux-foundation.org: fix ubifs] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 51cc5068 Fri Jul 25 20:45:34 MDT 2008 Alexey Dobriyan <adobriyan@gmail.com> SL*B: drop kmem cache argument from constructor Kmem cache passed to constructor is only needed for constructors that are themselves multiplexeres. Nobody uses this "feature", nor does anybody uses passed kmem cache in non-trivial way, so pass only pointer to object. Non-trivial places are: arch/powerpc/mm/init_64.c arch/powerpc/mm/hugetlbpage.c This is flag day, yes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Matt Mackall <mpm@selenic.com> [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c] [akpm@linux-foundation.org: fix mm/slab.c] [akpm@linux-foundation.org: fix ubifs] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
/linux-master/fs/nilfs2/ | ||
H A D | inode.c | diff 516edb45 Thu May 12 08:05:29 MDT 2022 Yang Li <yang.lee@linux.alibaba.com> nilfs2: Fix some kernel-doc comments The description of @flags in nilfs_dirty_inode() kernel-doc comment is missing, and some functions had kernel-doc that used a hash instead of a colon to separate the parameter name from the one line description. Fix them to remove some warnings found by running scripts/kernel-doc, which is caused by using 'make W=1'. fs/nilfs2/inode.c:73: warning: Function parameter or member 'inode' not described in 'nilfs_get_block' fs/nilfs2/inode.c:73: warning: Function parameter or member 'blkoff' not described in 'nilfs_get_block' fs/nilfs2/inode.c:73: warning: Function parameter or member 'bh_result' not described in 'nilfs_get_block' fs/nilfs2/inode.c:73: warning: Function parameter or member 'create' not described in 'nilfs_get_block' fs/nilfs2/inode.c:145: warning: Function parameter or member 'file' not described in 'nilfs_readpage' fs/nilfs2/inode.c:145: warning: Function parameter or member 'page' not described in 'nilfs_readpage' fs/nilfs2/inode.c:968: warning: Function parameter or member 'flags' not described in 'nilfs_dirty_inode' Link: https://lkml.kernel.org/r/20220324024215.63479-1-yang.lee@linux.alibaba.com Link: https://lkml.kernel.org/r/1652276316-7791-1-git-send-email-konishi.ryusuke@gmail.com Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> diff 516edb45 Thu May 12 08:05:29 MDT 2022 Yang Li <yang.lee@linux.alibaba.com> nilfs2: Fix some kernel-doc comments The description of @flags in nilfs_dirty_inode() kernel-doc comment is missing, and some functions had kernel-doc that used a hash instead of a colon to separate the parameter name from the one line description. Fix them to remove some warnings found by running scripts/kernel-doc, which is caused by using 'make W=1'. fs/nilfs2/inode.c:73: warning: Function parameter or member 'inode' not described in 'nilfs_get_block' fs/nilfs2/inode.c:73: warning: Function parameter or member 'blkoff' not described in 'nilfs_get_block' fs/nilfs2/inode.c:73: warning: Function parameter or member 'bh_result' not described in 'nilfs_get_block' fs/nilfs2/inode.c:73: warning: Function parameter or member 'create' not described in 'nilfs_get_block' fs/nilfs2/inode.c:145: warning: Function parameter or member 'file' not described in 'nilfs_readpage' fs/nilfs2/inode.c:145: warning: Function parameter or member 'page' not described in 'nilfs_readpage' fs/nilfs2/inode.c:968: warning: Function parameter or member 'flags' not described in 'nilfs_dirty_inode' Link: https://lkml.kernel.org/r/20220324024215.63479-1-yang.lee@linux.alibaba.com Link: https://lkml.kernel.org/r/1652276316-7791-1-git-send-email-konishi.ryusuke@gmail.com Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> diff 516edb45 Thu May 12 08:05:29 MDT 2022 Yang Li <yang.lee@linux.alibaba.com> nilfs2: Fix some kernel-doc comments The description of @flags in nilfs_dirty_inode() kernel-doc comment is missing, and some functions had kernel-doc that used a hash instead of a colon to separate the parameter name from the one line description. Fix them to remove some warnings found by running scripts/kernel-doc, which is caused by using 'make W=1'. fs/nilfs2/inode.c:73: warning: Function parameter or member 'inode' not described in 'nilfs_get_block' fs/nilfs2/inode.c:73: warning: Function parameter or member 'blkoff' not described in 'nilfs_get_block' fs/nilfs2/inode.c:73: warning: Function parameter or member 'bh_result' not described in 'nilfs_get_block' fs/nilfs2/inode.c:73: warning: Function parameter or member 'create' not described in 'nilfs_get_block' fs/nilfs2/inode.c:145: warning: Function parameter or member 'file' not described in 'nilfs_readpage' fs/nilfs2/inode.c:145: warning: Function parameter or member 'page' not described in 'nilfs_readpage' fs/nilfs2/inode.c:968: warning: Function parameter or member 'flags' not described in 'nilfs_dirty_inode' Link: https://lkml.kernel.org/r/20220324024215.63479-1-yang.lee@linux.alibaba.com Link: https://lkml.kernel.org/r/1652276316-7791-1-git-send-email-konishi.ryusuke@gmail.com Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> diff 516edb45 Thu May 12 08:05:29 MDT 2022 Yang Li <yang.lee@linux.alibaba.com> nilfs2: Fix some kernel-doc comments The description of @flags in nilfs_dirty_inode() kernel-doc comment is missing, and some functions had kernel-doc that used a hash instead of a colon to separate the parameter name from the one line description. Fix them to remove some warnings found by running scripts/kernel-doc, which is caused by using 'make W=1'. fs/nilfs2/inode.c:73: warning: Function parameter or member 'inode' not described in 'nilfs_get_block' fs/nilfs2/inode.c:73: warning: Function parameter or member 'blkoff' not described in 'nilfs_get_block' fs/nilfs2/inode.c:73: warning: Function parameter or member 'bh_result' not described in 'nilfs_get_block' fs/nilfs2/inode.c:73: warning: Function parameter or member 'create' not described in 'nilfs_get_block' fs/nilfs2/inode.c:145: warning: Function parameter or member 'file' not described in 'nilfs_readpage' fs/nilfs2/inode.c:145: warning: Function parameter or member 'page' not described in 'nilfs_readpage' fs/nilfs2/inode.c:968: warning: Function parameter or member 'flags' not described in 'nilfs_dirty_inode' Link: https://lkml.kernel.org/r/20220324024215.63479-1-yang.lee@linux.alibaba.com Link: https://lkml.kernel.org/r/1652276316-7791-1-git-send-email-konishi.ryusuke@gmail.com Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> diff 516edb45 Thu May 12 08:05:29 MDT 2022 Yang Li <yang.lee@linux.alibaba.com> nilfs2: Fix some kernel-doc comments The description of @flags in nilfs_dirty_inode() kernel-doc comment is missing, and some functions had kernel-doc that used a hash instead of a colon to separate the parameter name from the one line description. Fix them to remove some warnings found by running scripts/kernel-doc, which is caused by using 'make W=1'. fs/nilfs2/inode.c:73: warning: Function parameter or member 'inode' not described in 'nilfs_get_block' fs/nilfs2/inode.c:73: warning: Function parameter or member 'blkoff' not described in 'nilfs_get_block' fs/nilfs2/inode.c:73: warning: Function parameter or member 'bh_result' not described in 'nilfs_get_block' fs/nilfs2/inode.c:73: warning: Function parameter or member 'create' not described in 'nilfs_get_block' fs/nilfs2/inode.c:145: warning: Function parameter or member 'file' not described in 'nilfs_readpage' fs/nilfs2/inode.c:145: warning: Function parameter or member 'page' not described in 'nilfs_readpage' fs/nilfs2/inode.c:968: warning: Function parameter or member 'flags' not described in 'nilfs_dirty_inode' Link: https://lkml.kernel.org/r/20220324024215.63479-1-yang.lee@linux.alibaba.com Link: https://lkml.kernel.org/r/1652276316-7791-1-git-send-email-konishi.ryusuke@gmail.com Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> diff 516edb45 Thu May 12 08:05:29 MDT 2022 Yang Li <yang.lee@linux.alibaba.com> nilfs2: Fix some kernel-doc comments The description of @flags in nilfs_dirty_inode() kernel-doc comment is missing, and some functions had kernel-doc that used a hash instead of a colon to separate the parameter name from the one line description. Fix them to remove some warnings found by running scripts/kernel-doc, which is caused by using 'make W=1'. fs/nilfs2/inode.c:73: warning: Function parameter or member 'inode' not described in 'nilfs_get_block' fs/nilfs2/inode.c:73: warning: Function parameter or member 'blkoff' not described in 'nilfs_get_block' fs/nilfs2/inode.c:73: warning: Function parameter or member 'bh_result' not described in 'nilfs_get_block' fs/nilfs2/inode.c:73: warning: Function parameter or member 'create' not described in 'nilfs_get_block' fs/nilfs2/inode.c:145: warning: Function parameter or member 'file' not described in 'nilfs_readpage' fs/nilfs2/inode.c:145: warning: Function parameter or member 'page' not described in 'nilfs_readpage' fs/nilfs2/inode.c:968: warning: Function parameter or member 'flags' not described in 'nilfs_dirty_inode' Link: https://lkml.kernel.org/r/20220324024215.63479-1-yang.lee@linux.alibaba.com Link: https://lkml.kernel.org/r/1652276316-7791-1-git-send-email-konishi.ryusuke@gmail.com Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> diff 516edb45 Thu May 12 08:05:29 MDT 2022 Yang Li <yang.lee@linux.alibaba.com> nilfs2: Fix some kernel-doc comments The description of @flags in nilfs_dirty_inode() kernel-doc comment is missing, and some functions had kernel-doc that used a hash instead of a colon to separate the parameter name from the one line description. Fix them to remove some warnings found by running scripts/kernel-doc, which is caused by using 'make W=1'. fs/nilfs2/inode.c:73: warning: Function parameter or member 'inode' not described in 'nilfs_get_block' fs/nilfs2/inode.c:73: warning: Function parameter or member 'blkoff' not described in 'nilfs_get_block' fs/nilfs2/inode.c:73: warning: Function parameter or member 'bh_result' not described in 'nilfs_get_block' fs/nilfs2/inode.c:73: warning: Function parameter or member 'create' not described in 'nilfs_get_block' fs/nilfs2/inode.c:145: warning: Function parameter or member 'file' not described in 'nilfs_readpage' fs/nilfs2/inode.c:145: warning: Function parameter or member 'page' not described in 'nilfs_readpage' fs/nilfs2/inode.c:968: warning: Function parameter or member 'flags' not described in 'nilfs_dirty_inode' Link: https://lkml.kernel.org/r/20220324024215.63479-1-yang.lee@linux.alibaba.com Link: https://lkml.kernel.org/r/1652276316-7791-1-git-send-email-konishi.ryusuke@gmail.com Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> diff e897be17 Fri Apr 01 12:28:18 MDT 2022 Ryusuke Konishi <konishi.ryusuke@gmail.com> nilfs2: fix lockdep warnings in page operations for btree nodes Patch series "nilfs2 lockdep warning fixes". The first two are to resolve the lockdep warning issue, and the last one is the accompanying cleanup and low priority. Based on your comment, this series solves the issue by separating inode object as needed. Since I was worried about the impact of the object composition changes, I tested the series carefully not to cause regressions especially for delicate functions such like disk space reclamation and snapshots. This patch (of 3): If CONFIG_LOCKDEP is enabled, nilfs2 hits lockdep warnings at inode_to_wb() during page/folio operations for btree nodes: WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 inode_to_wb include/linux/backing-dev.h:269 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 folio_account_dirtied mm/page-writeback.c:2460 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 __folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 Modules linked in: ... RIP: 0010:inode_to_wb include/linux/backing-dev.h:269 [inline] RIP: 0010:folio_account_dirtied mm/page-writeback.c:2460 [inline] RIP: 0010:__folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 ... Call Trace: __set_page_dirty include/linux/pagemap.h:834 [inline] mark_buffer_dirty+0x4e6/0x650 fs/buffer.c:1145 nilfs_btree_propagate_p fs/nilfs2/btree.c:1889 [inline] nilfs_btree_propagate+0x4ae/0xea0 fs/nilfs2/btree.c:2085 nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337 nilfs_collect_dat_data+0x45/0xd0 fs/nilfs2/segment.c:625 nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1009 nilfs_segctor_scan_file+0x47a/0x700 fs/nilfs2/segment.c:1048 nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1224 [inline] nilfs_segctor_collect fs/nilfs2/segment.c:1494 [inline] nilfs_segctor_do_construct+0x14f3/0x6c60 fs/nilfs2/segment.c:2036 nilfs_segctor_construct+0x7a7/0xb30 fs/nilfs2/segment.c:2372 nilfs_segctor_thread_construct fs/nilfs2/segment.c:2480 [inline] nilfs_segctor_thread+0x3c3/0xf90 fs/nilfs2/segment.c:2563 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 This is because nilfs2 uses two page caches for each inode and inode->i_mapping never points to one of them, the btree node cache. This causes inode_to_wb(inode) to refer to a different page cache than the caller page/folio operations such like __folio_start_writeback(), __folio_end_writeback(), or __folio_mark_dirty() acquired the lock. This patch resolves the issue by allocating and using an additional inode to hold the page cache of btree nodes. The inode is attached one-to-one to the traditional nilfs2 inode if it requires a block mapping with b-tree. This setup change is in memory only and does not affect the disk format. Link: https://lkml.kernel.org/r/1647867427-30498-1-git-send-email-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/1647867427-30498-2-git-send-email-konishi.ryusuke@gmail.com Link: https://lore.kernel.org/r/YXrYvIo8YRnAOJCj@casper.infradead.org Link: https://lore.kernel.org/r/9a20b33d-b38f-b4a2-4742-c1eb5b8e4d6c@redhat.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+0d5b462a6f07447991b3@syzkaller.appspotmail.com Reported-by: syzbot+34ef28bb2aeb28724aa0@syzkaller.appspotmail.com Reported-by: Hao Sun <sunhao.th@gmail.com> Reported-by: David Hildenbrand <david@redhat.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff e897be17 Fri Apr 01 12:28:18 MDT 2022 Ryusuke Konishi <konishi.ryusuke@gmail.com> nilfs2: fix lockdep warnings in page operations for btree nodes Patch series "nilfs2 lockdep warning fixes". The first two are to resolve the lockdep warning issue, and the last one is the accompanying cleanup and low priority. Based on your comment, this series solves the issue by separating inode object as needed. Since I was worried about the impact of the object composition changes, I tested the series carefully not to cause regressions especially for delicate functions such like disk space reclamation and snapshots. This patch (of 3): If CONFIG_LOCKDEP is enabled, nilfs2 hits lockdep warnings at inode_to_wb() during page/folio operations for btree nodes: WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 inode_to_wb include/linux/backing-dev.h:269 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 folio_account_dirtied mm/page-writeback.c:2460 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 __folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 Modules linked in: ... RIP: 0010:inode_to_wb include/linux/backing-dev.h:269 [inline] RIP: 0010:folio_account_dirtied mm/page-writeback.c:2460 [inline] RIP: 0010:__folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 ... Call Trace: __set_page_dirty include/linux/pagemap.h:834 [inline] mark_buffer_dirty+0x4e6/0x650 fs/buffer.c:1145 nilfs_btree_propagate_p fs/nilfs2/btree.c:1889 [inline] nilfs_btree_propagate+0x4ae/0xea0 fs/nilfs2/btree.c:2085 nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337 nilfs_collect_dat_data+0x45/0xd0 fs/nilfs2/segment.c:625 nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1009 nilfs_segctor_scan_file+0x47a/0x700 fs/nilfs2/segment.c:1048 nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1224 [inline] nilfs_segctor_collect fs/nilfs2/segment.c:1494 [inline] nilfs_segctor_do_construct+0x14f3/0x6c60 fs/nilfs2/segment.c:2036 nilfs_segctor_construct+0x7a7/0xb30 fs/nilfs2/segment.c:2372 nilfs_segctor_thread_construct fs/nilfs2/segment.c:2480 [inline] nilfs_segctor_thread+0x3c3/0xf90 fs/nilfs2/segment.c:2563 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 This is because nilfs2 uses two page caches for each inode and inode->i_mapping never points to one of them, the btree node cache. This causes inode_to_wb(inode) to refer to a different page cache than the caller page/folio operations such like __folio_start_writeback(), __folio_end_writeback(), or __folio_mark_dirty() acquired the lock. This patch resolves the issue by allocating and using an additional inode to hold the page cache of btree nodes. The inode is attached one-to-one to the traditional nilfs2 inode if it requires a block mapping with b-tree. This setup change is in memory only and does not affect the disk format. Link: https://lkml.kernel.org/r/1647867427-30498-1-git-send-email-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/1647867427-30498-2-git-send-email-konishi.ryusuke@gmail.com Link: https://lore.kernel.org/r/YXrYvIo8YRnAOJCj@casper.infradead.org Link: https://lore.kernel.org/r/9a20b33d-b38f-b4a2-4742-c1eb5b8e4d6c@redhat.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+0d5b462a6f07447991b3@syzkaller.appspotmail.com Reported-by: syzbot+34ef28bb2aeb28724aa0@syzkaller.appspotmail.com Reported-by: Hao Sun <sunhao.th@gmail.com> Reported-by: David Hildenbrand <david@redhat.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff e897be17 Fri Apr 01 12:28:18 MDT 2022 Ryusuke Konishi <konishi.ryusuke@gmail.com> nilfs2: fix lockdep warnings in page operations for btree nodes Patch series "nilfs2 lockdep warning fixes". The first two are to resolve the lockdep warning issue, and the last one is the accompanying cleanup and low priority. Based on your comment, this series solves the issue by separating inode object as needed. Since I was worried about the impact of the object composition changes, I tested the series carefully not to cause regressions especially for delicate functions such like disk space reclamation and snapshots. This patch (of 3): If CONFIG_LOCKDEP is enabled, nilfs2 hits lockdep warnings at inode_to_wb() during page/folio operations for btree nodes: WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 inode_to_wb include/linux/backing-dev.h:269 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 folio_account_dirtied mm/page-writeback.c:2460 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 __folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 Modules linked in: ... RIP: 0010:inode_to_wb include/linux/backing-dev.h:269 [inline] RIP: 0010:folio_account_dirtied mm/page-writeback.c:2460 [inline] RIP: 0010:__folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 ... Call Trace: __set_page_dirty include/linux/pagemap.h:834 [inline] mark_buffer_dirty+0x4e6/0x650 fs/buffer.c:1145 nilfs_btree_propagate_p fs/nilfs2/btree.c:1889 [inline] nilfs_btree_propagate+0x4ae/0xea0 fs/nilfs2/btree.c:2085 nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337 nilfs_collect_dat_data+0x45/0xd0 fs/nilfs2/segment.c:625 nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1009 nilfs_segctor_scan_file+0x47a/0x700 fs/nilfs2/segment.c:1048 nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1224 [inline] nilfs_segctor_collect fs/nilfs2/segment.c:1494 [inline] nilfs_segctor_do_construct+0x14f3/0x6c60 fs/nilfs2/segment.c:2036 nilfs_segctor_construct+0x7a7/0xb30 fs/nilfs2/segment.c:2372 nilfs_segctor_thread_construct fs/nilfs2/segment.c:2480 [inline] nilfs_segctor_thread+0x3c3/0xf90 fs/nilfs2/segment.c:2563 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 This is because nilfs2 uses two page caches for each inode and inode->i_mapping never points to one of them, the btree node cache. This causes inode_to_wb(inode) to refer to a different page cache than the caller page/folio operations such like __folio_start_writeback(), __folio_end_writeback(), or __folio_mark_dirty() acquired the lock. This patch resolves the issue by allocating and using an additional inode to hold the page cache of btree nodes. The inode is attached one-to-one to the traditional nilfs2 inode if it requires a block mapping with b-tree. This setup change is in memory only and does not affect the disk format. Link: https://lkml.kernel.org/r/1647867427-30498-1-git-send-email-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/1647867427-30498-2-git-send-email-konishi.ryusuke@gmail.com Link: https://lore.kernel.org/r/YXrYvIo8YRnAOJCj@casper.infradead.org Link: https://lore.kernel.org/r/9a20b33d-b38f-b4a2-4742-c1eb5b8e4d6c@redhat.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+0d5b462a6f07447991b3@syzkaller.appspotmail.com Reported-by: syzbot+34ef28bb2aeb28724aa0@syzkaller.appspotmail.com Reported-by: Hao Sun <sunhao.th@gmail.com> Reported-by: David Hildenbrand <david@redhat.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff e897be17 Fri Apr 01 12:28:18 MDT 2022 Ryusuke Konishi <konishi.ryusuke@gmail.com> nilfs2: fix lockdep warnings in page operations for btree nodes Patch series "nilfs2 lockdep warning fixes". The first two are to resolve the lockdep warning issue, and the last one is the accompanying cleanup and low priority. Based on your comment, this series solves the issue by separating inode object as needed. Since I was worried about the impact of the object composition changes, I tested the series carefully not to cause regressions especially for delicate functions such like disk space reclamation and snapshots. This patch (of 3): If CONFIG_LOCKDEP is enabled, nilfs2 hits lockdep warnings at inode_to_wb() during page/folio operations for btree nodes: WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 inode_to_wb include/linux/backing-dev.h:269 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 folio_account_dirtied mm/page-writeback.c:2460 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 __folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 Modules linked in: ... RIP: 0010:inode_to_wb include/linux/backing-dev.h:269 [inline] RIP: 0010:folio_account_dirtied mm/page-writeback.c:2460 [inline] RIP: 0010:__folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 ... Call Trace: __set_page_dirty include/linux/pagemap.h:834 [inline] mark_buffer_dirty+0x4e6/0x650 fs/buffer.c:1145 nilfs_btree_propagate_p fs/nilfs2/btree.c:1889 [inline] nilfs_btree_propagate+0x4ae/0xea0 fs/nilfs2/btree.c:2085 nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337 nilfs_collect_dat_data+0x45/0xd0 fs/nilfs2/segment.c:625 nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1009 nilfs_segctor_scan_file+0x47a/0x700 fs/nilfs2/segment.c:1048 nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1224 [inline] nilfs_segctor_collect fs/nilfs2/segment.c:1494 [inline] nilfs_segctor_do_construct+0x14f3/0x6c60 fs/nilfs2/segment.c:2036 nilfs_segctor_construct+0x7a7/0xb30 fs/nilfs2/segment.c:2372 nilfs_segctor_thread_construct fs/nilfs2/segment.c:2480 [inline] nilfs_segctor_thread+0x3c3/0xf90 fs/nilfs2/segment.c:2563 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 This is because nilfs2 uses two page caches for each inode and inode->i_mapping never points to one of them, the btree node cache. This causes inode_to_wb(inode) to refer to a different page cache than the caller page/folio operations such like __folio_start_writeback(), __folio_end_writeback(), or __folio_mark_dirty() acquired the lock. This patch resolves the issue by allocating and using an additional inode to hold the page cache of btree nodes. The inode is attached one-to-one to the traditional nilfs2 inode if it requires a block mapping with b-tree. This setup change is in memory only and does not affect the disk format. Link: https://lkml.kernel.org/r/1647867427-30498-1-git-send-email-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/1647867427-30498-2-git-send-email-konishi.ryusuke@gmail.com Link: https://lore.kernel.org/r/YXrYvIo8YRnAOJCj@casper.infradead.org Link: https://lore.kernel.org/r/9a20b33d-b38f-b4a2-4742-c1eb5b8e4d6c@redhat.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+0d5b462a6f07447991b3@syzkaller.appspotmail.com Reported-by: syzbot+34ef28bb2aeb28724aa0@syzkaller.appspotmail.com Reported-by: Hao Sun <sunhao.th@gmail.com> Reported-by: David Hildenbrand <david@redhat.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff e897be17 Fri Apr 01 12:28:18 MDT 2022 Ryusuke Konishi <konishi.ryusuke@gmail.com> nilfs2: fix lockdep warnings in page operations for btree nodes Patch series "nilfs2 lockdep warning fixes". The first two are to resolve the lockdep warning issue, and the last one is the accompanying cleanup and low priority. Based on your comment, this series solves the issue by separating inode object as needed. Since I was worried about the impact of the object composition changes, I tested the series carefully not to cause regressions especially for delicate functions such like disk space reclamation and snapshots. This patch (of 3): If CONFIG_LOCKDEP is enabled, nilfs2 hits lockdep warnings at inode_to_wb() during page/folio operations for btree nodes: WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 inode_to_wb include/linux/backing-dev.h:269 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 folio_account_dirtied mm/page-writeback.c:2460 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 __folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 Modules linked in: ... RIP: 0010:inode_to_wb include/linux/backing-dev.h:269 [inline] RIP: 0010:folio_account_dirtied mm/page-writeback.c:2460 [inline] RIP: 0010:__folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 ... Call Trace: __set_page_dirty include/linux/pagemap.h:834 [inline] mark_buffer_dirty+0x4e6/0x650 fs/buffer.c:1145 nilfs_btree_propagate_p fs/nilfs2/btree.c:1889 [inline] nilfs_btree_propagate+0x4ae/0xea0 fs/nilfs2/btree.c:2085 nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337 nilfs_collect_dat_data+0x45/0xd0 fs/nilfs2/segment.c:625 nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1009 nilfs_segctor_scan_file+0x47a/0x700 fs/nilfs2/segment.c:1048 nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1224 [inline] nilfs_segctor_collect fs/nilfs2/segment.c:1494 [inline] nilfs_segctor_do_construct+0x14f3/0x6c60 fs/nilfs2/segment.c:2036 nilfs_segctor_construct+0x7a7/0xb30 fs/nilfs2/segment.c:2372 nilfs_segctor_thread_construct fs/nilfs2/segment.c:2480 [inline] nilfs_segctor_thread+0x3c3/0xf90 fs/nilfs2/segment.c:2563 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 This is because nilfs2 uses two page caches for each inode and inode->i_mapping never points to one of them, the btree node cache. This causes inode_to_wb(inode) to refer to a different page cache than the caller page/folio operations such like __folio_start_writeback(), __folio_end_writeback(), or __folio_mark_dirty() acquired the lock. This patch resolves the issue by allocating and using an additional inode to hold the page cache of btree nodes. The inode is attached one-to-one to the traditional nilfs2 inode if it requires a block mapping with b-tree. This setup change is in memory only and does not affect the disk format. Link: https://lkml.kernel.org/r/1647867427-30498-1-git-send-email-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/1647867427-30498-2-git-send-email-konishi.ryusuke@gmail.com Link: https://lore.kernel.org/r/YXrYvIo8YRnAOJCj@casper.infradead.org Link: https://lore.kernel.org/r/9a20b33d-b38f-b4a2-4742-c1eb5b8e4d6c@redhat.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+0d5b462a6f07447991b3@syzkaller.appspotmail.com Reported-by: syzbot+34ef28bb2aeb28724aa0@syzkaller.appspotmail.com Reported-by: Hao Sun <sunhao.th@gmail.com> Reported-by: David Hildenbrand <david@redhat.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff e897be17 Fri Apr 01 12:28:18 MDT 2022 Ryusuke Konishi <konishi.ryusuke@gmail.com> nilfs2: fix lockdep warnings in page operations for btree nodes Patch series "nilfs2 lockdep warning fixes". The first two are to resolve the lockdep warning issue, and the last one is the accompanying cleanup and low priority. Based on your comment, this series solves the issue by separating inode object as needed. Since I was worried about the impact of the object composition changes, I tested the series carefully not to cause regressions especially for delicate functions such like disk space reclamation and snapshots. This patch (of 3): If CONFIG_LOCKDEP is enabled, nilfs2 hits lockdep warnings at inode_to_wb() during page/folio operations for btree nodes: WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 inode_to_wb include/linux/backing-dev.h:269 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 folio_account_dirtied mm/page-writeback.c:2460 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 __folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 Modules linked in: ... RIP: 0010:inode_to_wb include/linux/backing-dev.h:269 [inline] RIP: 0010:folio_account_dirtied mm/page-writeback.c:2460 [inline] RIP: 0010:__folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 ... Call Trace: __set_page_dirty include/linux/pagemap.h:834 [inline] mark_buffer_dirty+0x4e6/0x650 fs/buffer.c:1145 nilfs_btree_propagate_p fs/nilfs2/btree.c:1889 [inline] nilfs_btree_propagate+0x4ae/0xea0 fs/nilfs2/btree.c:2085 nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337 nilfs_collect_dat_data+0x45/0xd0 fs/nilfs2/segment.c:625 nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1009 nilfs_segctor_scan_file+0x47a/0x700 fs/nilfs2/segment.c:1048 nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1224 [inline] nilfs_segctor_collect fs/nilfs2/segment.c:1494 [inline] nilfs_segctor_do_construct+0x14f3/0x6c60 fs/nilfs2/segment.c:2036 nilfs_segctor_construct+0x7a7/0xb30 fs/nilfs2/segment.c:2372 nilfs_segctor_thread_construct fs/nilfs2/segment.c:2480 [inline] nilfs_segctor_thread+0x3c3/0xf90 fs/nilfs2/segment.c:2563 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 This is because nilfs2 uses two page caches for each inode and inode->i_mapping never points to one of them, the btree node cache. This causes inode_to_wb(inode) to refer to a different page cache than the caller page/folio operations such like __folio_start_writeback(), __folio_end_writeback(), or __folio_mark_dirty() acquired the lock. This patch resolves the issue by allocating and using an additional inode to hold the page cache of btree nodes. The inode is attached one-to-one to the traditional nilfs2 inode if it requires a block mapping with b-tree. This setup change is in memory only and does not affect the disk format. Link: https://lkml.kernel.org/r/1647867427-30498-1-git-send-email-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/1647867427-30498-2-git-send-email-konishi.ryusuke@gmail.com Link: https://lore.kernel.org/r/YXrYvIo8YRnAOJCj@casper.infradead.org Link: https://lore.kernel.org/r/9a20b33d-b38f-b4a2-4742-c1eb5b8e4d6c@redhat.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+0d5b462a6f07447991b3@syzkaller.appspotmail.com Reported-by: syzbot+34ef28bb2aeb28724aa0@syzkaller.appspotmail.com Reported-by: Hao Sun <sunhao.th@gmail.com> Reported-by: David Hildenbrand <david@redhat.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff e897be17 Fri Apr 01 12:28:18 MDT 2022 Ryusuke Konishi <konishi.ryusuke@gmail.com> nilfs2: fix lockdep warnings in page operations for btree nodes Patch series "nilfs2 lockdep warning fixes". The first two are to resolve the lockdep warning issue, and the last one is the accompanying cleanup and low priority. Based on your comment, this series solves the issue by separating inode object as needed. Since I was worried about the impact of the object composition changes, I tested the series carefully not to cause regressions especially for delicate functions such like disk space reclamation and snapshots. This patch (of 3): If CONFIG_LOCKDEP is enabled, nilfs2 hits lockdep warnings at inode_to_wb() during page/folio operations for btree nodes: WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 inode_to_wb include/linux/backing-dev.h:269 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 folio_account_dirtied mm/page-writeback.c:2460 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 __folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 Modules linked in: ... RIP: 0010:inode_to_wb include/linux/backing-dev.h:269 [inline] RIP: 0010:folio_account_dirtied mm/page-writeback.c:2460 [inline] RIP: 0010:__folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 ... Call Trace: __set_page_dirty include/linux/pagemap.h:834 [inline] mark_buffer_dirty+0x4e6/0x650 fs/buffer.c:1145 nilfs_btree_propagate_p fs/nilfs2/btree.c:1889 [inline] nilfs_btree_propagate+0x4ae/0xea0 fs/nilfs2/btree.c:2085 nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337 nilfs_collect_dat_data+0x45/0xd0 fs/nilfs2/segment.c:625 nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1009 nilfs_segctor_scan_file+0x47a/0x700 fs/nilfs2/segment.c:1048 nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1224 [inline] nilfs_segctor_collect fs/nilfs2/segment.c:1494 [inline] nilfs_segctor_do_construct+0x14f3/0x6c60 fs/nilfs2/segment.c:2036 nilfs_segctor_construct+0x7a7/0xb30 fs/nilfs2/segment.c:2372 nilfs_segctor_thread_construct fs/nilfs2/segment.c:2480 [inline] nilfs_segctor_thread+0x3c3/0xf90 fs/nilfs2/segment.c:2563 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 This is because nilfs2 uses two page caches for each inode and inode->i_mapping never points to one of them, the btree node cache. This causes inode_to_wb(inode) to refer to a different page cache than the caller page/folio operations such like __folio_start_writeback(), __folio_end_writeback(), or __folio_mark_dirty() acquired the lock. This patch resolves the issue by allocating and using an additional inode to hold the page cache of btree nodes. The inode is attached one-to-one to the traditional nilfs2 inode if it requires a block mapping with b-tree. This setup change is in memory only and does not affect the disk format. Link: https://lkml.kernel.org/r/1647867427-30498-1-git-send-email-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/1647867427-30498-2-git-send-email-konishi.ryusuke@gmail.com Link: https://lore.kernel.org/r/YXrYvIo8YRnAOJCj@casper.infradead.org Link: https://lore.kernel.org/r/9a20b33d-b38f-b4a2-4742-c1eb5b8e4d6c@redhat.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+0d5b462a6f07447991b3@syzkaller.appspotmail.com Reported-by: syzbot+34ef28bb2aeb28724aa0@syzkaller.appspotmail.com Reported-by: Hao Sun <sunhao.th@gmail.com> Reported-by: David Hildenbrand <david@redhat.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff e897be17 Fri Apr 01 12:28:18 MDT 2022 Ryusuke Konishi <konishi.ryusuke@gmail.com> nilfs2: fix lockdep warnings in page operations for btree nodes Patch series "nilfs2 lockdep warning fixes". The first two are to resolve the lockdep warning issue, and the last one is the accompanying cleanup and low priority. Based on your comment, this series solves the issue by separating inode object as needed. Since I was worried about the impact of the object composition changes, I tested the series carefully not to cause regressions especially for delicate functions such like disk space reclamation and snapshots. This patch (of 3): If CONFIG_LOCKDEP is enabled, nilfs2 hits lockdep warnings at inode_to_wb() during page/folio operations for btree nodes: WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 inode_to_wb include/linux/backing-dev.h:269 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 folio_account_dirtied mm/page-writeback.c:2460 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 __folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 Modules linked in: ... RIP: 0010:inode_to_wb include/linux/backing-dev.h:269 [inline] RIP: 0010:folio_account_dirtied mm/page-writeback.c:2460 [inline] RIP: 0010:__folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 ... Call Trace: __set_page_dirty include/linux/pagemap.h:834 [inline] mark_buffer_dirty+0x4e6/0x650 fs/buffer.c:1145 nilfs_btree_propagate_p fs/nilfs2/btree.c:1889 [inline] nilfs_btree_propagate+0x4ae/0xea0 fs/nilfs2/btree.c:2085 nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337 nilfs_collect_dat_data+0x45/0xd0 fs/nilfs2/segment.c:625 nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1009 nilfs_segctor_scan_file+0x47a/0x700 fs/nilfs2/segment.c:1048 nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1224 [inline] nilfs_segctor_collect fs/nilfs2/segment.c:1494 [inline] nilfs_segctor_do_construct+0x14f3/0x6c60 fs/nilfs2/segment.c:2036 nilfs_segctor_construct+0x7a7/0xb30 fs/nilfs2/segment.c:2372 nilfs_segctor_thread_construct fs/nilfs2/segment.c:2480 [inline] nilfs_segctor_thread+0x3c3/0xf90 fs/nilfs2/segment.c:2563 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 This is because nilfs2 uses two page caches for each inode and inode->i_mapping never points to one of them, the btree node cache. This causes inode_to_wb(inode) to refer to a different page cache than the caller page/folio operations such like __folio_start_writeback(), __folio_end_writeback(), or __folio_mark_dirty() acquired the lock. This patch resolves the issue by allocating and using an additional inode to hold the page cache of btree nodes. The inode is attached one-to-one to the traditional nilfs2 inode if it requires a block mapping with b-tree. This setup change is in memory only and does not affect the disk format. Link: https://lkml.kernel.org/r/1647867427-30498-1-git-send-email-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/1647867427-30498-2-git-send-email-konishi.ryusuke@gmail.com Link: https://lore.kernel.org/r/YXrYvIo8YRnAOJCj@casper.infradead.org Link: https://lore.kernel.org/r/9a20b33d-b38f-b4a2-4742-c1eb5b8e4d6c@redhat.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+0d5b462a6f07447991b3@syzkaller.appspotmail.com Reported-by: syzbot+34ef28bb2aeb28724aa0@syzkaller.appspotmail.com Reported-by: Hao Sun <sunhao.th@gmail.com> Reported-by: David Hildenbrand <david@redhat.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff e897be17 Fri Apr 01 12:28:18 MDT 2022 Ryusuke Konishi <konishi.ryusuke@gmail.com> nilfs2: fix lockdep warnings in page operations for btree nodes Patch series "nilfs2 lockdep warning fixes". The first two are to resolve the lockdep warning issue, and the last one is the accompanying cleanup and low priority. Based on your comment, this series solves the issue by separating inode object as needed. Since I was worried about the impact of the object composition changes, I tested the series carefully not to cause regressions especially for delicate functions such like disk space reclamation and snapshots. This patch (of 3): If CONFIG_LOCKDEP is enabled, nilfs2 hits lockdep warnings at inode_to_wb() during page/folio operations for btree nodes: WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 inode_to_wb include/linux/backing-dev.h:269 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 folio_account_dirtied mm/page-writeback.c:2460 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 __folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 Modules linked in: ... RIP: 0010:inode_to_wb include/linux/backing-dev.h:269 [inline] RIP: 0010:folio_account_dirtied mm/page-writeback.c:2460 [inline] RIP: 0010:__folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 ... Call Trace: __set_page_dirty include/linux/pagemap.h:834 [inline] mark_buffer_dirty+0x4e6/0x650 fs/buffer.c:1145 nilfs_btree_propagate_p fs/nilfs2/btree.c:1889 [inline] nilfs_btree_propagate+0x4ae/0xea0 fs/nilfs2/btree.c:2085 nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337 nilfs_collect_dat_data+0x45/0xd0 fs/nilfs2/segment.c:625 nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1009 nilfs_segctor_scan_file+0x47a/0x700 fs/nilfs2/segment.c:1048 nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1224 [inline] nilfs_segctor_collect fs/nilfs2/segment.c:1494 [inline] nilfs_segctor_do_construct+0x14f3/0x6c60 fs/nilfs2/segment.c:2036 nilfs_segctor_construct+0x7a7/0xb30 fs/nilfs2/segment.c:2372 nilfs_segctor_thread_construct fs/nilfs2/segment.c:2480 [inline] nilfs_segctor_thread+0x3c3/0xf90 fs/nilfs2/segment.c:2563 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 This is because nilfs2 uses two page caches for each inode and inode->i_mapping never points to one of them, the btree node cache. This causes inode_to_wb(inode) to refer to a different page cache than the caller page/folio operations such like __folio_start_writeback(), __folio_end_writeback(), or __folio_mark_dirty() acquired the lock. This patch resolves the issue by allocating and using an additional inode to hold the page cache of btree nodes. The inode is attached one-to-one to the traditional nilfs2 inode if it requires a block mapping with b-tree. This setup change is in memory only and does not affect the disk format. Link: https://lkml.kernel.org/r/1647867427-30498-1-git-send-email-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/1647867427-30498-2-git-send-email-konishi.ryusuke@gmail.com Link: https://lore.kernel.org/r/YXrYvIo8YRnAOJCj@casper.infradead.org Link: https://lore.kernel.org/r/9a20b33d-b38f-b4a2-4742-c1eb5b8e4d6c@redhat.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+0d5b462a6f07447991b3@syzkaller.appspotmail.com Reported-by: syzbot+34ef28bb2aeb28724aa0@syzkaller.appspotmail.com Reported-by: Hao Sun <sunhao.th@gmail.com> Reported-by: David Hildenbrand <david@redhat.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff e897be17 Fri Apr 01 12:28:18 MDT 2022 Ryusuke Konishi <konishi.ryusuke@gmail.com> nilfs2: fix lockdep warnings in page operations for btree nodes Patch series "nilfs2 lockdep warning fixes". The first two are to resolve the lockdep warning issue, and the last one is the accompanying cleanup and low priority. Based on your comment, this series solves the issue by separating inode object as needed. Since I was worried about the impact of the object composition changes, I tested the series carefully not to cause regressions especially for delicate functions such like disk space reclamation and snapshots. This patch (of 3): If CONFIG_LOCKDEP is enabled, nilfs2 hits lockdep warnings at inode_to_wb() during page/folio operations for btree nodes: WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 inode_to_wb include/linux/backing-dev.h:269 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 folio_account_dirtied mm/page-writeback.c:2460 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 __folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 Modules linked in: ... RIP: 0010:inode_to_wb include/linux/backing-dev.h:269 [inline] RIP: 0010:folio_account_dirtied mm/page-writeback.c:2460 [inline] RIP: 0010:__folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 ... Call Trace: __set_page_dirty include/linux/pagemap.h:834 [inline] mark_buffer_dirty+0x4e6/0x650 fs/buffer.c:1145 nilfs_btree_propagate_p fs/nilfs2/btree.c:1889 [inline] nilfs_btree_propagate+0x4ae/0xea0 fs/nilfs2/btree.c:2085 nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337 nilfs_collect_dat_data+0x45/0xd0 fs/nilfs2/segment.c:625 nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1009 nilfs_segctor_scan_file+0x47a/0x700 fs/nilfs2/segment.c:1048 nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1224 [inline] nilfs_segctor_collect fs/nilfs2/segment.c:1494 [inline] nilfs_segctor_do_construct+0x14f3/0x6c60 fs/nilfs2/segment.c:2036 nilfs_segctor_construct+0x7a7/0xb30 fs/nilfs2/segment.c:2372 nilfs_segctor_thread_construct fs/nilfs2/segment.c:2480 [inline] nilfs_segctor_thread+0x3c3/0xf90 fs/nilfs2/segment.c:2563 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 This is because nilfs2 uses two page caches for each inode and inode->i_mapping never points to one of them, the btree node cache. This causes inode_to_wb(inode) to refer to a different page cache than the caller page/folio operations such like __folio_start_writeback(), __folio_end_writeback(), or __folio_mark_dirty() acquired the lock. This patch resolves the issue by allocating and using an additional inode to hold the page cache of btree nodes. The inode is attached one-to-one to the traditional nilfs2 inode if it requires a block mapping with b-tree. This setup change is in memory only and does not affect the disk format. Link: https://lkml.kernel.org/r/1647867427-30498-1-git-send-email-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/1647867427-30498-2-git-send-email-konishi.ryusuke@gmail.com Link: https://lore.kernel.org/r/YXrYvIo8YRnAOJCj@casper.infradead.org Link: https://lore.kernel.org/r/9a20b33d-b38f-b4a2-4742-c1eb5b8e4d6c@redhat.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+0d5b462a6f07447991b3@syzkaller.appspotmail.com Reported-by: syzbot+34ef28bb2aeb28724aa0@syzkaller.appspotmail.com Reported-by: Hao Sun <sunhao.th@gmail.com> Reported-by: David Hildenbrand <david@redhat.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff e897be17 Fri Apr 01 12:28:18 MDT 2022 Ryusuke Konishi <konishi.ryusuke@gmail.com> nilfs2: fix lockdep warnings in page operations for btree nodes Patch series "nilfs2 lockdep warning fixes". The first two are to resolve the lockdep warning issue, and the last one is the accompanying cleanup and low priority. Based on your comment, this series solves the issue by separating inode object as needed. Since I was worried about the impact of the object composition changes, I tested the series carefully not to cause regressions especially for delicate functions such like disk space reclamation and snapshots. This patch (of 3): If CONFIG_LOCKDEP is enabled, nilfs2 hits lockdep warnings at inode_to_wb() during page/folio operations for btree nodes: WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 inode_to_wb include/linux/backing-dev.h:269 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 folio_account_dirtied mm/page-writeback.c:2460 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 __folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 Modules linked in: ... RIP: 0010:inode_to_wb include/linux/backing-dev.h:269 [inline] RIP: 0010:folio_account_dirtied mm/page-writeback.c:2460 [inline] RIP: 0010:__folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 ... Call Trace: __set_page_dirty include/linux/pagemap.h:834 [inline] mark_buffer_dirty+0x4e6/0x650 fs/buffer.c:1145 nilfs_btree_propagate_p fs/nilfs2/btree.c:1889 [inline] nilfs_btree_propagate+0x4ae/0xea0 fs/nilfs2/btree.c:2085 nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337 nilfs_collect_dat_data+0x45/0xd0 fs/nilfs2/segment.c:625 nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1009 nilfs_segctor_scan_file+0x47a/0x700 fs/nilfs2/segment.c:1048 nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1224 [inline] nilfs_segctor_collect fs/nilfs2/segment.c:1494 [inline] nilfs_segctor_do_construct+0x14f3/0x6c60 fs/nilfs2/segment.c:2036 nilfs_segctor_construct+0x7a7/0xb30 fs/nilfs2/segment.c:2372 nilfs_segctor_thread_construct fs/nilfs2/segment.c:2480 [inline] nilfs_segctor_thread+0x3c3/0xf90 fs/nilfs2/segment.c:2563 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 This is because nilfs2 uses two page caches for each inode and inode->i_mapping never points to one of them, the btree node cache. This causes inode_to_wb(inode) to refer to a different page cache than the caller page/folio operations such like __folio_start_writeback(), __folio_end_writeback(), or __folio_mark_dirty() acquired the lock. This patch resolves the issue by allocating and using an additional inode to hold the page cache of btree nodes. The inode is attached one-to-one to the traditional nilfs2 inode if it requires a block mapping with b-tree. This setup change is in memory only and does not affect the disk format. Link: https://lkml.kernel.org/r/1647867427-30498-1-git-send-email-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/1647867427-30498-2-git-send-email-konishi.ryusuke@gmail.com Link: https://lore.kernel.org/r/YXrYvIo8YRnAOJCj@casper.infradead.org Link: https://lore.kernel.org/r/9a20b33d-b38f-b4a2-4742-c1eb5b8e4d6c@redhat.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+0d5b462a6f07447991b3@syzkaller.appspotmail.com Reported-by: syzbot+34ef28bb2aeb28724aa0@syzkaller.appspotmail.com Reported-by: Hao Sun <sunhao.th@gmail.com> Reported-by: David Hildenbrand <david@redhat.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff e897be17 Fri Apr 01 12:28:18 MDT 2022 Ryusuke Konishi <konishi.ryusuke@gmail.com> nilfs2: fix lockdep warnings in page operations for btree nodes Patch series "nilfs2 lockdep warning fixes". The first two are to resolve the lockdep warning issue, and the last one is the accompanying cleanup and low priority. Based on your comment, this series solves the issue by separating inode object as needed. Since I was worried about the impact of the object composition changes, I tested the series carefully not to cause regressions especially for delicate functions such like disk space reclamation and snapshots. This patch (of 3): If CONFIG_LOCKDEP is enabled, nilfs2 hits lockdep warnings at inode_to_wb() during page/folio operations for btree nodes: WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 inode_to_wb include/linux/backing-dev.h:269 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 folio_account_dirtied mm/page-writeback.c:2460 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 __folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 Modules linked in: ... RIP: 0010:inode_to_wb include/linux/backing-dev.h:269 [inline] RIP: 0010:folio_account_dirtied mm/page-writeback.c:2460 [inline] RIP: 0010:__folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 ... Call Trace: __set_page_dirty include/linux/pagemap.h:834 [inline] mark_buffer_dirty+0x4e6/0x650 fs/buffer.c:1145 nilfs_btree_propagate_p fs/nilfs2/btree.c:1889 [inline] nilfs_btree_propagate+0x4ae/0xea0 fs/nilfs2/btree.c:2085 nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337 nilfs_collect_dat_data+0x45/0xd0 fs/nilfs2/segment.c:625 nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1009 nilfs_segctor_scan_file+0x47a/0x700 fs/nilfs2/segment.c:1048 nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1224 [inline] nilfs_segctor_collect fs/nilfs2/segment.c:1494 [inline] nilfs_segctor_do_construct+0x14f3/0x6c60 fs/nilfs2/segment.c:2036 nilfs_segctor_construct+0x7a7/0xb30 fs/nilfs2/segment.c:2372 nilfs_segctor_thread_construct fs/nilfs2/segment.c:2480 [inline] nilfs_segctor_thread+0x3c3/0xf90 fs/nilfs2/segment.c:2563 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 This is because nilfs2 uses two page caches for each inode and inode->i_mapping never points to one of them, the btree node cache. This causes inode_to_wb(inode) to refer to a different page cache than the caller page/folio operations such like __folio_start_writeback(), __folio_end_writeback(), or __folio_mark_dirty() acquired the lock. This patch resolves the issue by allocating and using an additional inode to hold the page cache of btree nodes. The inode is attached one-to-one to the traditional nilfs2 inode if it requires a block mapping with b-tree. This setup change is in memory only and does not affect the disk format. Link: https://lkml.kernel.org/r/1647867427-30498-1-git-send-email-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/1647867427-30498-2-git-send-email-konishi.ryusuke@gmail.com Link: https://lore.kernel.org/r/YXrYvIo8YRnAOJCj@casper.infradead.org Link: https://lore.kernel.org/r/9a20b33d-b38f-b4a2-4742-c1eb5b8e4d6c@redhat.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+0d5b462a6f07447991b3@syzkaller.appspotmail.com Reported-by: syzbot+34ef28bb2aeb28724aa0@syzkaller.appspotmail.com Reported-by: Hao Sun <sunhao.th@gmail.com> Reported-by: David Hildenbrand <david@redhat.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff e897be17 Fri Apr 01 12:28:18 MDT 2022 Ryusuke Konishi <konishi.ryusuke@gmail.com> nilfs2: fix lockdep warnings in page operations for btree nodes Patch series "nilfs2 lockdep warning fixes". The first two are to resolve the lockdep warning issue, and the last one is the accompanying cleanup and low priority. Based on your comment, this series solves the issue by separating inode object as needed. Since I was worried about the impact of the object composition changes, I tested the series carefully not to cause regressions especially for delicate functions such like disk space reclamation and snapshots. This patch (of 3): If CONFIG_LOCKDEP is enabled, nilfs2 hits lockdep warnings at inode_to_wb() during page/folio operations for btree nodes: WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 inode_to_wb include/linux/backing-dev.h:269 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 folio_account_dirtied mm/page-writeback.c:2460 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 __folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 Modules linked in: ... RIP: 0010:inode_to_wb include/linux/backing-dev.h:269 [inline] RIP: 0010:folio_account_dirtied mm/page-writeback.c:2460 [inline] RIP: 0010:__folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 ... Call Trace: __set_page_dirty include/linux/pagemap.h:834 [inline] mark_buffer_dirty+0x4e6/0x650 fs/buffer.c:1145 nilfs_btree_propagate_p fs/nilfs2/btree.c:1889 [inline] nilfs_btree_propagate+0x4ae/0xea0 fs/nilfs2/btree.c:2085 nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337 nilfs_collect_dat_data+0x45/0xd0 fs/nilfs2/segment.c:625 nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1009 nilfs_segctor_scan_file+0x47a/0x700 fs/nilfs2/segment.c:1048 nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1224 [inline] nilfs_segctor_collect fs/nilfs2/segment.c:1494 [inline] nilfs_segctor_do_construct+0x14f3/0x6c60 fs/nilfs2/segment.c:2036 nilfs_segctor_construct+0x7a7/0xb30 fs/nilfs2/segment.c:2372 nilfs_segctor_thread_construct fs/nilfs2/segment.c:2480 [inline] nilfs_segctor_thread+0x3c3/0xf90 fs/nilfs2/segment.c:2563 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 This is because nilfs2 uses two page caches for each inode and inode->i_mapping never points to one of them, the btree node cache. This causes inode_to_wb(inode) to refer to a different page cache than the caller page/folio operations such like __folio_start_writeback(), __folio_end_writeback(), or __folio_mark_dirty() acquired the lock. This patch resolves the issue by allocating and using an additional inode to hold the page cache of btree nodes. The inode is attached one-to-one to the traditional nilfs2 inode if it requires a block mapping with b-tree. This setup change is in memory only and does not affect the disk format. Link: https://lkml.kernel.org/r/1647867427-30498-1-git-send-email-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/1647867427-30498-2-git-send-email-konishi.ryusuke@gmail.com Link: https://lore.kernel.org/r/YXrYvIo8YRnAOJCj@casper.infradead.org Link: https://lore.kernel.org/r/9a20b33d-b38f-b4a2-4742-c1eb5b8e4d6c@redhat.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+0d5b462a6f07447991b3@syzkaller.appspotmail.com Reported-by: syzbot+34ef28bb2aeb28724aa0@syzkaller.appspotmail.com Reported-by: Hao Sun <sunhao.th@gmail.com> Reported-by: David Hildenbrand <david@redhat.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff e897be17 Fri Apr 01 12:28:18 MDT 2022 Ryusuke Konishi <konishi.ryusuke@gmail.com> nilfs2: fix lockdep warnings in page operations for btree nodes Patch series "nilfs2 lockdep warning fixes". The first two are to resolve the lockdep warning issue, and the last one is the accompanying cleanup and low priority. Based on your comment, this series solves the issue by separating inode object as needed. Since I was worried about the impact of the object composition changes, I tested the series carefully not to cause regressions especially for delicate functions such like disk space reclamation and snapshots. This patch (of 3): If CONFIG_LOCKDEP is enabled, nilfs2 hits lockdep warnings at inode_to_wb() during page/folio operations for btree nodes: WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 inode_to_wb include/linux/backing-dev.h:269 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 folio_account_dirtied mm/page-writeback.c:2460 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 __folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 Modules linked in: ... RIP: 0010:inode_to_wb include/linux/backing-dev.h:269 [inline] RIP: 0010:folio_account_dirtied mm/page-writeback.c:2460 [inline] RIP: 0010:__folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 ... Call Trace: __set_page_dirty include/linux/pagemap.h:834 [inline] mark_buffer_dirty+0x4e6/0x650 fs/buffer.c:1145 nilfs_btree_propagate_p fs/nilfs2/btree.c:1889 [inline] nilfs_btree_propagate+0x4ae/0xea0 fs/nilfs2/btree.c:2085 nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337 nilfs_collect_dat_data+0x45/0xd0 fs/nilfs2/segment.c:625 nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1009 nilfs_segctor_scan_file+0x47a/0x700 fs/nilfs2/segment.c:1048 nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1224 [inline] nilfs_segctor_collect fs/nilfs2/segment.c:1494 [inline] nilfs_segctor_do_construct+0x14f3/0x6c60 fs/nilfs2/segment.c:2036 nilfs_segctor_construct+0x7a7/0xb30 fs/nilfs2/segment.c:2372 nilfs_segctor_thread_construct fs/nilfs2/segment.c:2480 [inline] nilfs_segctor_thread+0x3c3/0xf90 fs/nilfs2/segment.c:2563 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 This is because nilfs2 uses two page caches for each inode and inode->i_mapping never points to one of them, the btree node cache. This causes inode_to_wb(inode) to refer to a different page cache than the caller page/folio operations such like __folio_start_writeback(), __folio_end_writeback(), or __folio_mark_dirty() acquired the lock. This patch resolves the issue by allocating and using an additional inode to hold the page cache of btree nodes. The inode is attached one-to-one to the traditional nilfs2 inode if it requires a block mapping with b-tree. This setup change is in memory only and does not affect the disk format. Link: https://lkml.kernel.org/r/1647867427-30498-1-git-send-email-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/1647867427-30498-2-git-send-email-konishi.ryusuke@gmail.com Link: https://lore.kernel.org/r/YXrYvIo8YRnAOJCj@casper.infradead.org Link: https://lore.kernel.org/r/9a20b33d-b38f-b4a2-4742-c1eb5b8e4d6c@redhat.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+0d5b462a6f07447991b3@syzkaller.appspotmail.com Reported-by: syzbot+34ef28bb2aeb28724aa0@syzkaller.appspotmail.com Reported-by: Hao Sun <sunhao.th@gmail.com> Reported-by: David Hildenbrand <david@redhat.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff e897be17 Fri Apr 01 12:28:18 MDT 2022 Ryusuke Konishi <konishi.ryusuke@gmail.com> nilfs2: fix lockdep warnings in page operations for btree nodes Patch series "nilfs2 lockdep warning fixes". The first two are to resolve the lockdep warning issue, and the last one is the accompanying cleanup and low priority. Based on your comment, this series solves the issue by separating inode object as needed. Since I was worried about the impact of the object composition changes, I tested the series carefully not to cause regressions especially for delicate functions such like disk space reclamation and snapshots. This patch (of 3): If CONFIG_LOCKDEP is enabled, nilfs2 hits lockdep warnings at inode_to_wb() during page/folio operations for btree nodes: WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 inode_to_wb include/linux/backing-dev.h:269 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 folio_account_dirtied mm/page-writeback.c:2460 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 __folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 Modules linked in: ... RIP: 0010:inode_to_wb include/linux/backing-dev.h:269 [inline] RIP: 0010:folio_account_dirtied mm/page-writeback.c:2460 [inline] RIP: 0010:__folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 ... Call Trace: __set_page_dirty include/linux/pagemap.h:834 [inline] mark_buffer_dirty+0x4e6/0x650 fs/buffer.c:1145 nilfs_btree_propagate_p fs/nilfs2/btree.c:1889 [inline] nilfs_btree_propagate+0x4ae/0xea0 fs/nilfs2/btree.c:2085 nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337 nilfs_collect_dat_data+0x45/0xd0 fs/nilfs2/segment.c:625 nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1009 nilfs_segctor_scan_file+0x47a/0x700 fs/nilfs2/segment.c:1048 nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1224 [inline] nilfs_segctor_collect fs/nilfs2/segment.c:1494 [inline] nilfs_segctor_do_construct+0x14f3/0x6c60 fs/nilfs2/segment.c:2036 nilfs_segctor_construct+0x7a7/0xb30 fs/nilfs2/segment.c:2372 nilfs_segctor_thread_construct fs/nilfs2/segment.c:2480 [inline] nilfs_segctor_thread+0x3c3/0xf90 fs/nilfs2/segment.c:2563 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 This is because nilfs2 uses two page caches for each inode and inode->i_mapping never points to one of them, the btree node cache. This causes inode_to_wb(inode) to refer to a different page cache than the caller page/folio operations such like __folio_start_writeback(), __folio_end_writeback(), or __folio_mark_dirty() acquired the lock. This patch resolves the issue by allocating and using an additional inode to hold the page cache of btree nodes. The inode is attached one-to-one to the traditional nilfs2 inode if it requires a block mapping with b-tree. This setup change is in memory only and does not affect the disk format. Link: https://lkml.kernel.org/r/1647867427-30498-1-git-send-email-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/1647867427-30498-2-git-send-email-konishi.ryusuke@gmail.com Link: https://lore.kernel.org/r/YXrYvIo8YRnAOJCj@casper.infradead.org Link: https://lore.kernel.org/r/9a20b33d-b38f-b4a2-4742-c1eb5b8e4d6c@redhat.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+0d5b462a6f07447991b3@syzkaller.appspotmail.com Reported-by: syzbot+34ef28bb2aeb28724aa0@syzkaller.appspotmail.com Reported-by: Hao Sun <sunhao.th@gmail.com> Reported-by: David Hildenbrand <david@redhat.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff e897be17 Fri Apr 01 12:28:18 MDT 2022 Ryusuke Konishi <konishi.ryusuke@gmail.com> nilfs2: fix lockdep warnings in page operations for btree nodes Patch series "nilfs2 lockdep warning fixes". The first two are to resolve the lockdep warning issue, and the last one is the accompanying cleanup and low priority. Based on your comment, this series solves the issue by separating inode object as needed. Since I was worried about the impact of the object composition changes, I tested the series carefully not to cause regressions especially for delicate functions such like disk space reclamation and snapshots. This patch (of 3): If CONFIG_LOCKDEP is enabled, nilfs2 hits lockdep warnings at inode_to_wb() during page/folio operations for btree nodes: WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 inode_to_wb include/linux/backing-dev.h:269 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 folio_account_dirtied mm/page-writeback.c:2460 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 __folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 Modules linked in: ... RIP: 0010:inode_to_wb include/linux/backing-dev.h:269 [inline] RIP: 0010:folio_account_dirtied mm/page-writeback.c:2460 [inline] RIP: 0010:__folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 ... Call Trace: __set_page_dirty include/linux/pagemap.h:834 [inline] mark_buffer_dirty+0x4e6/0x650 fs/buffer.c:1145 nilfs_btree_propagate_p fs/nilfs2/btree.c:1889 [inline] nilfs_btree_propagate+0x4ae/0xea0 fs/nilfs2/btree.c:2085 nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337 nilfs_collect_dat_data+0x45/0xd0 fs/nilfs2/segment.c:625 nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1009 nilfs_segctor_scan_file+0x47a/0x700 fs/nilfs2/segment.c:1048 nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1224 [inline] nilfs_segctor_collect fs/nilfs2/segment.c:1494 [inline] nilfs_segctor_do_construct+0x14f3/0x6c60 fs/nilfs2/segment.c:2036 nilfs_segctor_construct+0x7a7/0xb30 fs/nilfs2/segment.c:2372 nilfs_segctor_thread_construct fs/nilfs2/segment.c:2480 [inline] nilfs_segctor_thread+0x3c3/0xf90 fs/nilfs2/segment.c:2563 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 This is because nilfs2 uses two page caches for each inode and inode->i_mapping never points to one of them, the btree node cache. This causes inode_to_wb(inode) to refer to a different page cache than the caller page/folio operations such like __folio_start_writeback(), __folio_end_writeback(), or __folio_mark_dirty() acquired the lock. This patch resolves the issue by allocating and using an additional inode to hold the page cache of btree nodes. The inode is attached one-to-one to the traditional nilfs2 inode if it requires a block mapping with b-tree. This setup change is in memory only and does not affect the disk format. Link: https://lkml.kernel.org/r/1647867427-30498-1-git-send-email-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/1647867427-30498-2-git-send-email-konishi.ryusuke@gmail.com Link: https://lore.kernel.org/r/YXrYvIo8YRnAOJCj@casper.infradead.org Link: https://lore.kernel.org/r/9a20b33d-b38f-b4a2-4742-c1eb5b8e4d6c@redhat.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+0d5b462a6f07447991b3@syzkaller.appspotmail.com Reported-by: syzbot+34ef28bb2aeb28724aa0@syzkaller.appspotmail.com Reported-by: Hao Sun <sunhao.th@gmail.com> Reported-by: David Hildenbrand <david@redhat.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff e897be17 Fri Apr 01 12:28:18 MDT 2022 Ryusuke Konishi <konishi.ryusuke@gmail.com> nilfs2: fix lockdep warnings in page operations for btree nodes Patch series "nilfs2 lockdep warning fixes". The first two are to resolve the lockdep warning issue, and the last one is the accompanying cleanup and low priority. Based on your comment, this series solves the issue by separating inode object as needed. Since I was worried about the impact of the object composition changes, I tested the series carefully not to cause regressions especially for delicate functions such like disk space reclamation and snapshots. This patch (of 3): If CONFIG_LOCKDEP is enabled, nilfs2 hits lockdep warnings at inode_to_wb() during page/folio operations for btree nodes: WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 inode_to_wb include/linux/backing-dev.h:269 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 folio_account_dirtied mm/page-writeback.c:2460 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 __folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 Modules linked in: ... RIP: 0010:inode_to_wb include/linux/backing-dev.h:269 [inline] RIP: 0010:folio_account_dirtied mm/page-writeback.c:2460 [inline] RIP: 0010:__folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 ... Call Trace: __set_page_dirty include/linux/pagemap.h:834 [inline] mark_buffer_dirty+0x4e6/0x650 fs/buffer.c:1145 nilfs_btree_propagate_p fs/nilfs2/btree.c:1889 [inline] nilfs_btree_propagate+0x4ae/0xea0 fs/nilfs2/btree.c:2085 nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337 nilfs_collect_dat_data+0x45/0xd0 fs/nilfs2/segment.c:625 nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1009 nilfs_segctor_scan_file+0x47a/0x700 fs/nilfs2/segment.c:1048 nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1224 [inline] nilfs_segctor_collect fs/nilfs2/segment.c:1494 [inline] nilfs_segctor_do_construct+0x14f3/0x6c60 fs/nilfs2/segment.c:2036 nilfs_segctor_construct+0x7a7/0xb30 fs/nilfs2/segment.c:2372 nilfs_segctor_thread_construct fs/nilfs2/segment.c:2480 [inline] nilfs_segctor_thread+0x3c3/0xf90 fs/nilfs2/segment.c:2563 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 This is because nilfs2 uses two page caches for each inode and inode->i_mapping never points to one of them, the btree node cache. This causes inode_to_wb(inode) to refer to a different page cache than the caller page/folio operations such like __folio_start_writeback(), __folio_end_writeback(), or __folio_mark_dirty() acquired the lock. This patch resolves the issue by allocating and using an additional inode to hold the page cache of btree nodes. The inode is attached one-to-one to the traditional nilfs2 inode if it requires a block mapping with b-tree. This setup change is in memory only and does not affect the disk format. Link: https://lkml.kernel.org/r/1647867427-30498-1-git-send-email-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/1647867427-30498-2-git-send-email-konishi.ryusuke@gmail.com Link: https://lore.kernel.org/r/YXrYvIo8YRnAOJCj@casper.infradead.org Link: https://lore.kernel.org/r/9a20b33d-b38f-b4a2-4742-c1eb5b8e4d6c@redhat.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+0d5b462a6f07447991b3@syzkaller.appspotmail.com Reported-by: syzbot+34ef28bb2aeb28724aa0@syzkaller.appspotmail.com Reported-by: Hao Sun <sunhao.th@gmail.com> Reported-by: David Hildenbrand <david@redhat.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff e897be17 Fri Apr 01 12:28:18 MDT 2022 Ryusuke Konishi <konishi.ryusuke@gmail.com> nilfs2: fix lockdep warnings in page operations for btree nodes Patch series "nilfs2 lockdep warning fixes". The first two are to resolve the lockdep warning issue, and the last one is the accompanying cleanup and low priority. Based on your comment, this series solves the issue by separating inode object as needed. Since I was worried about the impact of the object composition changes, I tested the series carefully not to cause regressions especially for delicate functions such like disk space reclamation and snapshots. This patch (of 3): If CONFIG_LOCKDEP is enabled, nilfs2 hits lockdep warnings at inode_to_wb() during page/folio operations for btree nodes: WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 inode_to_wb include/linux/backing-dev.h:269 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 folio_account_dirtied mm/page-writeback.c:2460 [inline] WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 __folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 Modules linked in: ... RIP: 0010:inode_to_wb include/linux/backing-dev.h:269 [inline] RIP: 0010:folio_account_dirtied mm/page-writeback.c:2460 [inline] RIP: 0010:__folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509 ... Call Trace: __set_page_dirty include/linux/pagemap.h:834 [inline] mark_buffer_dirty+0x4e6/0x650 fs/buffer.c:1145 nilfs_btree_propagate_p fs/nilfs2/btree.c:1889 [inline] nilfs_btree_propagate+0x4ae/0xea0 fs/nilfs2/btree.c:2085 nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337 nilfs_collect_dat_data+0x45/0xd0 fs/nilfs2/segment.c:625 nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1009 nilfs_segctor_scan_file+0x47a/0x700 fs/nilfs2/segment.c:1048 nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1224 [inline] nilfs_segctor_collect fs/nilfs2/segment.c:1494 [inline] nilfs_segctor_do_construct+0x14f3/0x6c60 fs/nilfs2/segment.c:2036 nilfs_segctor_construct+0x7a7/0xb30 fs/nilfs2/segment.c:2372 nilfs_segctor_thread_construct fs/nilfs2/segment.c:2480 [inline] nilfs_segctor_thread+0x3c3/0xf90 fs/nilfs2/segment.c:2563 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 This is because nilfs2 uses two page caches for each inode and inode->i_mapping never points to one of them, the btree node cache. This causes inode_to_wb(inode) to refer to a different page cache than the caller page/folio operations such like __folio_start_writeback(), __folio_end_writeback(), or __folio_mark_dirty() acquired the lock. This patch resolves the issue by allocating and using an additional inode to hold the page cache of btree nodes. The inode is attached one-to-one to the traditional nilfs2 inode if it requires a block mapping with b-tree. This setup change is in memory only and does not affect the disk format. Link: https://lkml.kernel.org/r/1647867427-30498-1-git-send-email-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/1647867427-30498-2-git-send-email-konishi.ryusuke@gmail.com Link: https://lore.kernel.org/r/YXrYvIo8YRnAOJCj@casper.infradead.org Link: https://lore.kernel.org/r/9a20b33d-b38f-b4a2-4742-c1eb5b8e4d6c@redhat.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+0d5b462a6f07447991b3@syzkaller.appspotmail.com Reported-by: syzbot+34ef28bb2aeb28724aa0@syzkaller.appspotmail.com Reported-by: Hao Sun <sunhao.th@gmail.com> Reported-by: David Hildenbrand <david@redhat.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
/linux-master/fs/reiserfs/ | ||
H A D | inode.c | diff 71e7b535 Tue Jun 21 08:14:52 MDT 2022 Christian Brauner <brauner@kernel.org> quota: port quota helpers mount ids Port the is_quota_modification() and dqout_transfer() helper to type safe vfs{g,u}id_t. Since these helpers are only called by a few filesystems don't introduce a new helper but simply extend the existing helpers to pass down the mount's idmapping. Note, that this is a non-functional change, i.e. nothing will have happened here or at the end of this series to how quota are done! This a change necessary because we will at the end of this series make ownership changes easier to reason about by keeping the original value in struct iattr for both non-idmapped and idmapped mounts. For now we always pass the initial idmapping which makes the idmapping functions these helpers call nops. This is done because we currently always pass the actual value to be written to i_{g,u}id via struct iattr. While this allowed us to treat the {g,u}id values in struct iattr as values that can be directly written to inode->i_{g,u}id it also increases the potential for confusion for filesystems. Now that we are have dedicated types to prevent this confusion we will ultimately only map the value from the idmapped mount into a filesystem value that can be written to inode->i_{g,u}id when the filesystem actually updates the inode. So pass down the initial idmapping until we finished that conversion at which point we pass down the mount's idmapping. Since struct iattr uses an anonymous union with overlapping types as supported by the C standard, filesystems that haven't converted to ia_vfs{g,u}id won't see any difference and things will continue to work as before. In other words, no functional changes intended with this change. Link: https://lore.kernel.org/r/20220621141454.2914719-7-brauner@kernel.org Cc: Seth Forshee <sforshee@digitalocean.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.cz> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> CC: linux-fsdevel@vger.kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Seth Forshee <sforshee@digitalocean.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> diff 6396bb22 Tue Jun 12 15:03:40 MDT 2018 Kees Cook <keescook@chromium.org> treewide: kzalloc() -> kcalloc() The kzalloc() function has a 2-factor argument form, kcalloc(). This patch replaces cases of: kzalloc(a * b, gfp) with: kcalloc(a * b, gfp) as well as handling cases of: kzalloc(a * b * c, gfp) with: kzalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kzalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kzalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(char) * COUNT + COUNT , ...) | kzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kzalloc + kcalloc ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kzalloc(C1 * C2 * C3, ...) | kzalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kzalloc(sizeof(THING) * C2, ...) | kzalloc(sizeof(TYPE) * C2, ...) | kzalloc(C1 * C2 * C3, ...) | kzalloc(C1 * C2, ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - (E1) * E2 + E1, E2 , ...) | - kzalloc + kcalloc ( - (E1) * (E2) + E1, E2 , ...) | - kzalloc + kcalloc ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org> diff 6396bb22 Tue Jun 12 15:03:40 MDT 2018 Kees Cook <keescook@chromium.org> treewide: kzalloc() -> kcalloc() The kzalloc() function has a 2-factor argument form, kcalloc(). This patch replaces cases of: kzalloc(a * b, gfp) with: kcalloc(a * b, gfp) as well as handling cases of: kzalloc(a * b * c, gfp) with: kzalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kzalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kzalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(char) * COUNT + COUNT , ...) | kzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kzalloc + kcalloc ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kzalloc(C1 * C2 * C3, ...) | kzalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kzalloc(sizeof(THING) * C2, ...) | kzalloc(sizeof(TYPE) * C2, ...) | kzalloc(C1 * C2 * C3, ...) | kzalloc(C1 * C2, ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - (E1) * E2 + E1, E2 , ...) | - kzalloc + kcalloc ( - (E1) * (E2) + E1, E2 , ...) | - kzalloc + kcalloc ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org> diff 6396bb22 Tue Jun 12 15:03:40 MDT 2018 Kees Cook <keescook@chromium.org> treewide: kzalloc() -> kcalloc() The kzalloc() function has a 2-factor argument form, kcalloc(). This patch replaces cases of: kzalloc(a * b, gfp) with: kcalloc(a * b, gfp) as well as handling cases of: kzalloc(a * b * c, gfp) with: kzalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kzalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kzalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(char) * COUNT + COUNT , ...) | kzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kzalloc + kcalloc ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kzalloc(C1 * C2 * C3, ...) | kzalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kzalloc(sizeof(THING) * C2, ...) | kzalloc(sizeof(TYPE) * C2, ...) | kzalloc(C1 * C2 * C3, ...) | kzalloc(C1 * C2, ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - (E1) * E2 + E1, E2 , ...) | - kzalloc + kcalloc ( - (E1) * (E2) + E1, E2 , ...) | - kzalloc + kcalloc ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org> diff 1751e8a6 Mon Nov 27 14:05:09 MST 2017 Linus Torvalds <torvalds@linux-foundation.org> Rename superblock flags (MS_xyz -> SB_xyz) This is a pure automated search-and-replace of the internal kernel superblock flags. The s_flags are now called SB_*, with the names and the values for the moment mirroring the MS_* flags that they're equivalent to. Note how the MS_xyz flags are the ones passed to the mount system call, while the SB_xyz flags are what we then use in sb->s_flags. The script to do this was: # places to look in; re security/*: it generally should *not* be # touched (that stuff parses mount(2) arguments directly), but # there are two places where we really deal with superblock flags. FILES="drivers/mtd drivers/staging/lustre fs ipc mm \ include/linux/fs.h include/uapi/linux/bfs_fs.h \ security/apparmor/apparmorfs.c security/apparmor/include/lib.h" # the list of MS_... constants SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \ DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \ POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \ I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \ ACTIVE NOUSER" SED_PROG= for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done # we want files that contain at least one of MS_..., # with fs/namespace.c and fs/pnode.c excluded. L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c') for f in $L; do sed -i $f $SED_PROG; done Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 1751e8a6 Mon Nov 27 14:05:09 MST 2017 Linus Torvalds <torvalds@linux-foundation.org> Rename superblock flags (MS_xyz -> SB_xyz) This is a pure automated search-and-replace of the internal kernel superblock flags. The s_flags are now called SB_*, with the names and the values for the moment mirroring the MS_* flags that they're equivalent to. Note how the MS_xyz flags are the ones passed to the mount system call, while the SB_xyz flags are what we then use in sb->s_flags. The script to do this was: # places to look in; re security/*: it generally should *not* be # touched (that stuff parses mount(2) arguments directly), but # there are two places where we really deal with superblock flags. FILES="drivers/mtd drivers/staging/lustre fs ipc mm \ include/linux/fs.h include/uapi/linux/bfs_fs.h \ security/apparmor/apparmorfs.c security/apparmor/include/lib.h" # the list of MS_... constants SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \ DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \ POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \ I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \ ACTIVE NOUSER" SED_PROG= for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done # we want files that contain at least one of MS_..., # with fs/namespace.c and fs/pnode.c excluded. L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c') for f in $L; do sed -i $f $SED_PROG; done Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 1751e8a6 Mon Nov 27 14:05:09 MST 2017 Linus Torvalds <torvalds@linux-foundation.org> Rename superblock flags (MS_xyz -> SB_xyz) This is a pure automated search-and-replace of the internal kernel superblock flags. The s_flags are now called SB_*, with the names and the values for the moment mirroring the MS_* flags that they're equivalent to. Note how the MS_xyz flags are the ones passed to the mount system call, while the SB_xyz flags are what we then use in sb->s_flags. The script to do this was: # places to look in; re security/*: it generally should *not* be # touched (that stuff parses mount(2) arguments directly), but # there are two places where we really deal with superblock flags. FILES="drivers/mtd drivers/staging/lustre fs ipc mm \ include/linux/fs.h include/uapi/linux/bfs_fs.h \ security/apparmor/apparmorfs.c security/apparmor/include/lib.h" # the list of MS_... constants SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \ DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \ POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \ I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \ ACTIVE NOUSER" SED_PROG= for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done # we want files that contain at least one of MS_..., # with fs/namespace.c and fs/pnode.c excluded. L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c') for f in $L; do sed -i $f $SED_PROG; done Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 1751e8a6 Mon Nov 27 14:05:09 MST 2017 Linus Torvalds <torvalds@linux-foundation.org> Rename superblock flags (MS_xyz -> SB_xyz) This is a pure automated search-and-replace of the internal kernel superblock flags. The s_flags are now called SB_*, with the names and the values for the moment mirroring the MS_* flags that they're equivalent to. Note how the MS_xyz flags are the ones passed to the mount system call, while the SB_xyz flags are what we then use in sb->s_flags. The script to do this was: # places to look in; re security/*: it generally should *not* be # touched (that stuff parses mount(2) arguments directly), but # there are two places where we really deal with superblock flags. FILES="drivers/mtd drivers/staging/lustre fs ipc mm \ include/linux/fs.h include/uapi/linux/bfs_fs.h \ security/apparmor/apparmorfs.c security/apparmor/include/lib.h" # the list of MS_... constants SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \ DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \ POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \ I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \ ACTIVE NOUSER" SED_PROG= for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done # we want files that contain at least one of MS_..., # with fs/namespace.c and fs/pnode.c excluded. L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c') for f in $L; do sed -i $f $SED_PROG; done Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 1751e8a6 Mon Nov 27 14:05:09 MST 2017 Linus Torvalds <torvalds@linux-foundation.org> Rename superblock flags (MS_xyz -> SB_xyz) This is a pure automated search-and-replace of the internal kernel superblock flags. The s_flags are now called SB_*, with the names and the values for the moment mirroring the MS_* flags that they're equivalent to. Note how the MS_xyz flags are the ones passed to the mount system call, while the SB_xyz flags are what we then use in sb->s_flags. The script to do this was: # places to look in; re security/*: it generally should *not* be # touched (that stuff parses mount(2) arguments directly), but # there are two places where we really deal with superblock flags. FILES="drivers/mtd drivers/staging/lustre fs ipc mm \ include/linux/fs.h include/uapi/linux/bfs_fs.h \ security/apparmor/apparmorfs.c security/apparmor/include/lib.h" # the list of MS_... constants SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \ DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \ POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \ I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \ ACTIVE NOUSER" SED_PROG= for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done # we want files that contain at least one of MS_..., # with fs/namespace.c and fs/pnode.c excluded. L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c') for f in $L; do sed -i $f $SED_PROG; done Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 714b71a3 Tue Feb 17 14:45:33 MST 2015 Fabian Frederick <fabf@skynet.be> fs/reiserfs/inode.c: replace 0 by NULL for pointers Fix sparse warning: fs/reiserfs/inode.c:2769:19: warning: Using plain integer as NULL pointer Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 714b71a3 Tue Feb 17 14:45:33 MST 2015 Fabian Frederick <fabf@skynet.be> fs/reiserfs/inode.c: replace 0 by NULL for pointers Fix sparse warning: fs/reiserfs/inode.c:2769:19: warning: Using plain integer as NULL pointer Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
/linux-master/fs/proc/ | ||
H A D | generic.c | diff 6dfbbae1 Fri Jan 21 23:14:20 MST 2022 Muchun Song <songmuchun@bytedance.com> fs: proc: store PDE()->data into inode->i_private PDE_DATA(inode) is introduced to get user private data and hide the layout of struct proc_dir_entry. The inode->i_private is used to do the same thing as well. Save a copy of user private data to inode-> i_private when proc inode is allocated. This means the user also can get their private data by inode->i_private. Introduce pde_data() to wrap inode->i_private so that we can remove PDE_DATA() from fs/proc/generic.c and make PTE_DATE() as a wrapper of pde_data(). It will be easier if we decide to remove PDE_DATE() in the future. Link: https://lkml.kernel.org/r/20211124081956.87711-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Alexey Gladkov <gladkov.alexey@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff f4bf74d8 Thu May 06 19:02:10 MDT 2021 Colin Ian King <colin.king@canonical.com> fs/proc/generic.c: fix incorrect pde_is_permanent check Currently the pde_is_permanent() check is being run on root multiple times rather than on the next proc directory entry. This looks like a copy-paste error. Fix this by replacing root with next. Addresses-Coverity: ("Copy-paste error") Link: https://lkml.kernel.org/r/20210318122633.14222-1-colin.king@canonical.com Fixes: d919b33dafb3 ("proc: faster open/read/close with "permanent" files") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Greg Kroah-Hartman <gregkh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff c6c75ded Tue Dec 15 21:42:39 MST 2020 Alexey Dobriyan <adobriyan@gmail.com> proc: fix lookup in /proc/net subdirectories after setns(2) Commit 1fde6f21d90f ("proc: fix /proc/net/* after setns(2)") only forced revalidation of regular files under /proc/net/ However, /proc/net/ is unusual in the sense of /proc/net/foo handlers take netns pointer from parent directory which is old netns. Steps to reproduce: (void)open("/proc/net/sctp/snmp", O_RDONLY); unshare(CLONE_NEWNET); int fd = open("/proc/net/sctp/snmp", O_RDONLY); read(fd, &c, 1); Read will read wrong data from original netns. Patch forces lookup on every directory under /proc/net . Link: https://lkml.kernel.org/r/20201205160916.GA109739@localhost.localdomain Fixes: 1da4d377f943 ("proc: revalidate misc dentries") Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reported-by: "Rantala, Tommi T. (Nokia - FI/Espoo)" <tommi.t.rantala@nokia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff d919b33d Mon Apr 06 21:09:01 MDT 2020 Alexey Dobriyan <adobriyan@gmail.com> proc: faster open/read/close with "permanent" files Now that "struct proc_ops" exist we can start putting there stuff which could not fly with VFS "struct file_operations"... Most of fs/proc/inode.c file is dedicated to make open/read/.../close reliable in the event of disappearing /proc entries which usually happens if module is getting removed. Files like /proc/cpuinfo which never disappear simply do not need such protection. Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such "permanent" files. Enable "permanent" flag for /proc/cpuinfo /proc/kmsg /proc/modules /proc/slabinfo /proc/stat /proc/sysvipc/* /proc/swaps More will come once I figure out foolproof way to prevent out module authors from marking their stuff "permanent" for performance reasons when it is not. This should help with scalability: benchmark is "read /proc/cpuinfo R times by N threads scattered over the system". N R t, s (before) t, s (after) ----------------------------------------------------- 64 4096 1.582458 1.530502 -3.2% 256 4096 6.371926 6.125168 -3.9% 1024 4096 25.64888 24.47528 -4.6% Benchmark source: #include <chrono> #include <iostream> #include <thread> #include <vector> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN); int N; const char *filename; int R; int xxx = 0; int glue(int n) { cpu_set_t m; CPU_ZERO(&m); CPU_SET(n, &m); return sched_setaffinity(0, sizeof(cpu_set_t), &m); } void f(int n) { glue(n % NR_CPUS); while (*(volatile int *)&xxx == 0) { } for (int i = 0; i < R; i++) { int fd = open(filename, O_RDONLY); char buf[4096]; ssize_t rv = read(fd, buf, sizeof(buf)); asm volatile ("" :: "g" (rv)); close(fd); } } int main(int argc, char *argv[]) { if (argc < 4) { std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R "; return 1; } N = atoi(argv[1]); filename = argv[2]; R = atoi(argv[3]); for (int i = 0; i < NR_CPUS; i++) { if (glue(i) == 0) break; } std::vector<std::thread> T; T.reserve(N); for (int i = 0; i < N; i++) { T.emplace_back(f, i); } auto t0 = std::chrono::system_clock::now(); { *(volatile int *)&xxx = 1; for (auto& t: T) { t.join(); } } auto t1 = std::chrono::system_clock::now(); std::chrono::duration<double> dt = t1 - t0; std::cout << dt.count() << ' '; return 0; } P.S.: Explicit randomization marker is added because adding non-function pointer will silently disable structure layout randomization. [akpm@linux-foundation.org: coding style fixes] Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Joe Perches <joe@perches.com> Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 5f6354ea Wed Dec 04 17:50:02 MST 2019 Alexey Dobriyan <adobriyan@gmail.com> fs/proc/generic.c: delete useless "len" variable Pointer to next '/' encodes length of path element and next start position. Subtraction and increment are redundant. Link: http://lkml.kernel.org/r/20191004234521.GA30246@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 61172eae Tue Apr 24 09:00:52 MDT 2018 Christoph Hellwig <hch@lst.de> proc: simplify proc_register calling conventions Return registered entry on success, return NULL on failure and free the passed in entry. Also expose it in internal.h as we'll start using it in proc_net.c soon. Signed-off-by: Christoph Hellwig <hch@lst.de> diff cde1b693 Mon Jul 10 16:50:52 MDT 2017 Heiner Kallweit <hkallweit1@gmail.com> fs/proc/generic.c: switch to ida_simple_get/remove The code can be much simplified by switching to ida_simple_get/remove. Link: http://lkml.kernel.org/r/8d1cc9f7-5115-c9dc-028e-c0770b6bfe1f@gmail.com Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff a528d35e Tue Jan 31 09:46:22 MST 2017 David Howells <dhowells@redhat.com> statx: Add a system call to make enhanced file info available Add a system call to make extended file information available, including file creation and some attribute flags where available through the underlying filesystem. The getattr inode operation is altered to take two additional arguments: a u32 request_mask and an unsigned int flags that indicate the synchronisation mode. This change is propagated to the vfs_getattr*() function. Functions like vfs_stat() are now inline wrappers around new functions vfs_statx() and vfs_statx_fd() to reduce stack usage. ======== OVERVIEW ======== The idea was initially proposed as a set of xattrs that could be retrieved with getxattr(), but the general preference proved to be for a new syscall with an extended stat structure. A number of requests were gathered for features to be included. The following have been included: (1) Make the fields a consistent size on all arches and make them large. (2) Spare space, request flags and information flags are provided for future expansion. (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an __s64). (4) Creation time: The SMB protocol carries the creation time, which could be exported by Samba, which will in turn help CIFS make use of FS-Cache as that can be used for coherency data (stx_btime). This is also specified in NFSv4 as a recommended attribute and could be exported by NFSD [Steve French]. (5) Lightweight stat: Ask for just those details of interest, and allow a netfs (such as NFS) to approximate anything not of interest, possibly without going to the server [Trond Myklebust, Ulrich Drepper, Andreas Dilger] (AT_STATX_DONT_SYNC). (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks its cached attributes are up to date [Trond Myklebust] (AT_STATX_FORCE_SYNC). And the following have been left out for future extension: (7) Data version number: Could be used by userspace NFS servers [Aneesh Kumar]. Can also be used to modify fill_post_wcc() in NFSD which retrieves i_version directly, but has just called vfs_getattr(). It could get it from the kstat struct if it used vfs_xgetattr() instead. (There's disagreement on the exact semantics of a single field, since not all filesystems do this the same way). (8) BSD stat compatibility: Including more fields from the BSD stat such as creation time (st_btime) and inode generation number (st_gen) [Jeremy Allison, Bernd Schubert]. (9) Inode generation number: Useful for FUSE and userspace NFS servers [Bernd Schubert]. (This was asked for but later deemed unnecessary with the open-by-handle capability available and caused disagreement as to whether it's a security hole or not). (10) Extra coherency data may be useful in making backups [Andreas Dilger]. (No particular data were offered, but things like last backup timestamp, the data version number and the DOS archive bit would come into this category). (11) Allow the filesystem to indicate what it can/cannot provide: A filesystem can now say it doesn't support a standard stat feature if that isn't available, so if, for instance, inode numbers or UIDs don't exist or are fabricated locally... (This requires a separate system call - I have an fsinfo() call idea for this). (12) Store a 16-byte volume ID in the superblock that can be returned in struct xstat [Steve French]. (Deferred to fsinfo). (13) Include granularity fields in the time data to indicate the granularity of each of the times (NFSv4 time_delta) [Steve French]. (Deferred to fsinfo). (14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags. Note that the Linux IOC flags are a mess and filesystems such as Ext4 define flags that aren't in linux/fs.h, so translation in the kernel may be a necessity (or, possibly, we provide the filesystem type too). (Some attributes are made available in stx_attributes, but the general feeling was that the IOC flags were to ext[234]-specific and shouldn't be exposed through statx this way). (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer, Michael Kerrisk]. (Deferred, probably to fsinfo. Finding out if there's an ACL or seclabal might require extra filesystem operations). (16) Femtosecond-resolution timestamps [Dave Chinner]. (A __reserved field has been left in the statx_timestamp struct for this - if there proves to be a need). (17) A set multiple attributes syscall to go with this. =============== NEW SYSTEM CALL =============== The new system call is: int ret = statx(int dfd, const char *filename, unsigned int flags, unsigned int mask, struct statx *buffer); The dfd, filename and flags parameters indicate the file to query, in a similar way to fstatat(). There is no equivalent of lstat() as that can be emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is also no equivalent of fstat() as that can be emulated by passing a NULL filename to statx() with the fd of interest in dfd. Whether or not statx() synchronises the attributes with the backing store can be controlled by OR'ing a value into the flags argument (this typically only affects network filesystems): (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this respect. (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise its attributes with the server - which might require data writeback to occur to get the timestamps correct. (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a network filesystem. The resulting values should be considered approximate. mask is a bitmask indicating the fields in struct statx that are of interest to the caller. The user should set this to STATX_BASIC_STATS to get the basic set returned by stat(). It should be noted that asking for more information may entail extra I/O operations. buffer points to the destination for the data. This must be 256 bytes in size. ====================== MAIN ATTRIBUTES RECORD ====================== The following structures are defined in which to return the main attribute set: struct statx_timestamp { __s64 tv_sec; __s32 tv_nsec; __s32 __reserved; }; struct statx { __u32 stx_mask; __u32 stx_blksize; __u64 stx_attributes; __u32 stx_nlink; __u32 stx_uid; __u32 stx_gid; __u16 stx_mode; __u16 __spare0[1]; __u64 stx_ino; __u64 stx_size; __u64 stx_blocks; __u64 __spare1[1]; struct statx_timestamp stx_atime; struct statx_timestamp stx_btime; struct statx_timestamp stx_ctime; struct statx_timestamp stx_mtime; __u32 stx_rdev_major; __u32 stx_rdev_minor; __u32 stx_dev_major; __u32 stx_dev_minor; __u64 __spare2[14]; }; The defined bits in request_mask and stx_mask are: STATX_TYPE Want/got stx_mode & S_IFMT STATX_MODE Want/got stx_mode & ~S_IFMT STATX_NLINK Want/got stx_nlink STATX_UID Want/got stx_uid STATX_GID Want/got stx_gid STATX_ATIME Want/got stx_atime{,_ns} STATX_MTIME Want/got stx_mtime{,_ns} STATX_CTIME Want/got stx_ctime{,_ns} STATX_INO Want/got stx_ino STATX_SIZE Want/got stx_size STATX_BLOCKS Want/got stx_blocks STATX_BASIC_STATS [The stuff in the normal stat struct] STATX_BTIME Want/got stx_btime{,_ns} STATX_ALL [All currently available stuff] stx_btime is the file creation time, stx_mask is a bitmask indicating the data provided and __spares*[] are where as-yet undefined fields can be placed. Time fields are structures with separate seconds and nanoseconds fields plus a reserved field in case we want to add even finer resolution. Note that times will be negative if before 1970; in such a case, the nanosecond fields will also be negative if not zero. The bits defined in the stx_attributes field convey information about a file, how it is accessed, where it is and what it does. The following attributes map to FS_*_FL flags and are the same numerical value: STATX_ATTR_COMPRESSED File is compressed by the fs STATX_ATTR_IMMUTABLE File is marked immutable STATX_ATTR_APPEND File is append-only STATX_ATTR_NODUMP File is not to be dumped STATX_ATTR_ENCRYPTED File requires key to decrypt in fs Within the kernel, the supported flags are listed by: KSTAT_ATTR_FS_IOC_FLAGS [Are any other IOC flags of sufficient general interest to be exposed through this interface?] New flags include: STATX_ATTR_AUTOMOUNT Object is an automount trigger These are for the use of GUI tools that might want to mark files specially, depending on what they are. Fields in struct statx come in a number of classes: (0) stx_dev_*, stx_blksize. These are local system information and are always available. (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino, stx_size, stx_blocks. These will be returned whether the caller asks for them or not. The corresponding bits in stx_mask will be set to indicate whether they actually have valid values. If the caller didn't ask for them, then they may be approximated. For example, NFS won't waste any time updating them from the server, unless as a byproduct of updating something requested. If the values don't actually exist for the underlying object (such as UID or GID on a DOS file), then the bit won't be set in the stx_mask, even if the caller asked for the value. In such a case, the returned value will be a fabrication. Note that there are instances where the type might not be valid, for instance Windows reparse points. (2) stx_rdev_*. This will be set only if stx_mode indicates we're looking at a blockdev or a chardev, otherwise will be 0. (3) stx_btime. Similar to (1), except this will be set to 0 if it doesn't exist. ======= TESTING ======= The following test program can be used to test the statx system call: samples/statx/test-statx.c Just compile and run, passing it paths to the files you want to examine. The file is built automatically if CONFIG_SAMPLES is enabled. Here's some example output. Firstly, an NFS directory that crosses to another FSID. Note that the AUTOMOUNT attribute is set because transiting this directory will cause d_automount to be invoked by the VFS. [root@andromeda ~]# /tmp/test-statx -A /warthog/data statx(/warthog/data) = 0 results=7ff Size: 4096 Blocks: 8 IO Block: 1048576 directory Device: 00:26 Inode: 1703937 Links: 125 Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041 Access: 2016-11-24 09:02:12.219699527+0000 Modify: 2016-11-17 10:44:36.225653653+0000 Change: 2016-11-17 10:44:36.225653653+0000 Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------) Secondly, the result of automounting on that directory. [root@andromeda ~]# /tmp/test-statx /warthog/data statx(/warthog/data) = 0 results=7ff Size: 4096 Blocks: 8 IO Block: 1048576 directory Device: 00:27 Inode: 2 Links: 125 Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041 Access: 2016-11-24 09:02:12.219699527+0000 Modify: 2016-11-17 10:44:36.225653653+0000 Change: 2016-11-17 10:44:36.225653653+0000 Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 2fc1e948 Wed Dec 10 16:45:07 MST 2014 Nicolas Dichtel <nicolas.dichtel@6wind.com> fs/proc.c: use rb_entry_safe() instead of rb_entry() Better to use existing macro that rewriting them. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff fd3930f7 Mon Aug 19 17:26:12 MDT 2013 Linus Torvalds <torvalds@linux-foundation.org> proc: more readdir conversion bug-fixes In the previous commit, Richard Genoud fixed proc_root_readdir(), which had lost the check for whether all of the non-process /proc entries had been returned or not. But that in turn exposed _another_ bug, namely that the original readdir conversion patch had yet another problem: it had lost the return value of proc_readdir_de(), so now checking whether it had completed successfully or not didn't actually work right anyway. This reinstates the non-zero return for the "end of base entries" that had also gotten lost in commit f0c3b5093add ("[readdir] convert procfs"). So now you get all the base entries *and* you get all the process entries, regardless of getdents buffer size. (Side note: the Linux "getdents" manual page actually has a nice example application for testing getdents, which can be easily modified to use different buffers. Who knew? Man-pages can be useful) Reported-by: Emmanuel Benisty <benisty.e@gmail.com> Reported-by: Marc Dionne <marc.c.dionne@gmail.com> Cc: Richard Genoud <richard.genoud@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
H A D | proc_sysctl.c | diff 8b793bcd Fri Oct 27 15:46:53 MDT 2023 Krister Johansen <kjlx@templeofstupid.com> watchdog: move softlockup_panic back to early_param Setting softlockup_panic from do_sysctl_args() causes it to take effect later in boot. The lockup detector is enabled before SMP is brought online, but do_sysctl_args runs afterwards. If a user wants to set softlockup_panic on boot and have it trigger should a softlockup occur during onlining of the non-boot processors, they could do this prior to commit f117955a2255 ("kernel/watchdog.c: convert {soft/hard}lockup boot parameters to sysctl aliases"). However, after this commit the value of softlockup_panic is set too late to be of help for this type of problem. Restore the prior behavior. Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> Cc: stable@vger.kernel.org Fixes: f117955a2255 ("kernel/watchdog.c: convert {soft/hard}lockup boot parameters to sysctl aliases") Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> diff 8b793bcd Fri Oct 27 15:46:53 MDT 2023 Krister Johansen <kjlx@templeofstupid.com> watchdog: move softlockup_panic back to early_param Setting softlockup_panic from do_sysctl_args() causes it to take effect later in boot. The lockup detector is enabled before SMP is brought online, but do_sysctl_args runs afterwards. If a user wants to set softlockup_panic on boot and have it trigger should a softlockup occur during onlining of the non-boot processors, they could do this prior to commit f117955a2255 ("kernel/watchdog.c: convert {soft/hard}lockup boot parameters to sysctl aliases"). However, after this commit the value of softlockup_panic is set too late to be of help for this type of problem. Restore the prior behavior. Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> Cc: stable@vger.kernel.org Fixes: f117955a2255 ("kernel/watchdog.c: convert {soft/hard}lockup boot parameters to sysctl aliases") Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> diff bff97cf1 Tue Aug 08 16:49:57 MDT 2023 Joel Granados <joel.granados@gmail.com> sysctl: Add a size arg to __register_sysctl_table We make these changes in order to prepare __register_sysctl_table and its callers for when we remove the sentinel element (empty element at the end of ctl_table arrays). We don't actually remove any sentinels in this commit, but we *do* make sure to use ARRAY_SIZE so the table_size is available when the removal occurs. We add a table_size argument to __register_sysctl_table and adjust callers, all of which pass ctl_table pointers and need an explicit call to ARRAY_SIZE. We implement a size calculation in register_net_sysctl in order to forward the size of the array pointer received from the network register calls. The new table_size argument does not yet have any effect in the init_header call which is still dependent on the sentinel's presence. table_size *does* however drive the `kzalloc` allocation in __register_sysctl_table with no adverse effects as the allocated memory is either one element greater than the calculated ctl_table array (for the calls in ipc_sysctl.c, mq_sysctl.c and ucount.c) or the exact size of the calculated ctl_table array (for the call from sysctl_net.c and register_sysctl). This approach will allows us to "just" remove the sentinel without further changes to __register_sysctl_table as table_size will represent the exact size for all the callers at that point. Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> diff bff97cf1 Tue Aug 08 16:49:57 MDT 2023 Joel Granados <joel.granados@gmail.com> sysctl: Add a size arg to __register_sysctl_table We make these changes in order to prepare __register_sysctl_table and its callers for when we remove the sentinel element (empty element at the end of ctl_table arrays). We don't actually remove any sentinels in this commit, but we *do* make sure to use ARRAY_SIZE so the table_size is available when the removal occurs. We add a table_size argument to __register_sysctl_table and adjust callers, all of which pass ctl_table pointers and need an explicit call to ARRAY_SIZE. We implement a size calculation in register_net_sysctl in order to forward the size of the array pointer received from the network register calls. The new table_size argument does not yet have any effect in the init_header call which is still dependent on the sentinel's presence. table_size *does* however drive the `kzalloc` allocation in __register_sysctl_table with no adverse effects as the allocated memory is either one element greater than the calculated ctl_table array (for the calls in ipc_sysctl.c, mq_sysctl.c and ucount.c) or the exact size of the calculated ctl_table array (for the call from sysctl_net.c and register_sysctl). This approach will allows us to "just" remove the sentinel without further changes to __register_sysctl_table as table_size will represent the exact size for all the callers at that point. Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> diff bff97cf1 Tue Aug 08 16:49:57 MDT 2023 Joel Granados <joel.granados@gmail.com> sysctl: Add a size arg to __register_sysctl_table We make these changes in order to prepare __register_sysctl_table and its callers for when we remove the sentinel element (empty element at the end of ctl_table arrays). We don't actually remove any sentinels in this commit, but we *do* make sure to use ARRAY_SIZE so the table_size is available when the removal occurs. We add a table_size argument to __register_sysctl_table and adjust callers, all of which pass ctl_table pointers and need an explicit call to ARRAY_SIZE. We implement a size calculation in register_net_sysctl in order to forward the size of the array pointer received from the network register calls. The new table_size argument does not yet have any effect in the init_header call which is still dependent on the sentinel's presence. table_size *does* however drive the `kzalloc` allocation in __register_sysctl_table with no adverse effects as the allocated memory is either one element greater than the calculated ctl_table array (for the calls in ipc_sysctl.c, mq_sysctl.c and ucount.c) or the exact size of the calculated ctl_table array (for the call from sysctl_net.c and register_sysctl). This approach will allows us to "just" remove the sentinel without further changes to __register_sysctl_table as table_size will represent the exact size for all the callers at that point. Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> diff bff97cf1 Tue Aug 08 16:49:57 MDT 2023 Joel Granados <joel.granados@gmail.com> sysctl: Add a size arg to __register_sysctl_table We make these changes in order to prepare __register_sysctl_table and its callers for when we remove the sentinel element (empty element at the end of ctl_table arrays). We don't actually remove any sentinels in this commit, but we *do* make sure to use ARRAY_SIZE so the table_size is available when the removal occurs. We add a table_size argument to __register_sysctl_table and adjust callers, all of which pass ctl_table pointers and need an explicit call to ARRAY_SIZE. We implement a size calculation in register_net_sysctl in order to forward the size of the array pointer received from the network register calls. The new table_size argument does not yet have any effect in the init_header call which is still dependent on the sentinel's presence. table_size *does* however drive the `kzalloc` allocation in __register_sysctl_table with no adverse effects as the allocated memory is either one element greater than the calculated ctl_table array (for the calls in ipc_sysctl.c, mq_sysctl.c and ucount.c) or the exact size of the calculated ctl_table array (for the call from sysctl_net.c and register_sysctl). This approach will allows us to "just" remove the sentinel without further changes to __register_sysctl_table as table_size will represent the exact size for all the callers at that point. Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> diff cc9f7ee0 Tue Aug 08 16:49:53 MDT 2023 Joel Granados <joel.granados@gmail.com> sysctl: Prefer ctl_table_header in proc_sysctl This is a preparation commit that replaces ctl_table with ctl_table_header as the pointer that is passed around in proc_sysctl.c. This will become necessary in subsequent commits when the size of the ctl_table array can no longer be calculated by searching for an empty sentinel (last empty ctl_table element) but will be carried along inside the ctl_table_header struct. Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> diff 7fffbc71 Fri Jun 30 17:19:47 MDT 2023 Tom Rix <trix@redhat.com> sysctl: set variable sysctl_mount_point storage-class-specifier to static smatch reports fs/proc/proc_sysctl.c:32:18: warning: symbol 'sysctl_mount_point' was not declared. Should it be static? This variable is only used in its defining file, so it should be static. Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> diff 2f2665c1 Fri Jun 16 02:59:22 MDT 2023 Joel Granados <j.granados@samsung.com> sysctl: replace child with an enumeration This is part of the effort to remove the empty element at the end of ctl_table structs. "child" was a deprecated elem in this struct and was being used to differentiate between two types of ctl_tables: "normal" and "permanently emtpy". What changed?: * Replace "child" with an enumeration that will have two values: the default (0) and the permanently empty (1). The latter is left at zero so when struct ctl_table is created with kzalloc or in a local context, it will have the zero value by default. We document the new enum with kdoc. * Remove the "empty child" check from sysctl_check_table * Remove count_subheaders function as there is no longer a need to calculate how many headers there are for every child * Remove the recursive call to unregister_sysctl_table as there is no need to traverse down the child tree any longer * Add a new SYSCTL_PERM_EMPTY_DIR binary flag * Remove the last remanence of child from partport/procfs.c Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> diff 19c4e618 Tue May 23 06:22:18 MDT 2023 Joel Granados <j.granados@samsung.com> sysctl: stop exporting register_sysctl_table We make register_sysctl_table static because the only function calling it is in fs/proc/proc_sysctl.c (__register_sysctl_base). We remove it from the sysctl.h header and modify the documentation in both the header and proc_sysctl.c files to mention "register_sysctl" instead of "register_sysctl_table". This plus the commits that remove register_sysctl_table from parport save 217 bytes: ./scripts/bloat-o-meter .bsysctl/vmlinux.old .bsysctl/vmlinux.new add/remove: 0/1 grow/shrink: 5/1 up/down: 458/-675 (-217) Function old new delta __register_sysctl_base 8 286 +278 parport_proc_register 268 379 +111 parport_device_proc_register 195 247 +52 kzalloc.constprop 598 608 +10 parport_default_proc_register 62 69 +7 register_sysctl_table 291 - -291 parport_sysctl_template 1288 904 -384 Total: Before=8603076, After=8602859, chg -0.00% Signed-off-by: Joel Granados <j.granados@samsung.com> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> diff 19c4e618 Tue May 23 06:22:18 MDT 2023 Joel Granados <j.granados@samsung.com> sysctl: stop exporting register_sysctl_table We make register_sysctl_table static because the only function calling it is in fs/proc/proc_sysctl.c (__register_sysctl_base). We remove it from the sysctl.h header and modify the documentation in both the header and proc_sysctl.c files to mention "register_sysctl" instead of "register_sysctl_table". This plus the commits that remove register_sysctl_table from parport save 217 bytes: ./scripts/bloat-o-meter .bsysctl/vmlinux.old .bsysctl/vmlinux.new add/remove: 0/1 grow/shrink: 5/1 up/down: 458/-675 (-217) Function old new delta __register_sysctl_base 8 286 +278 parport_proc_register 268 379 +111 parport_device_proc_register 195 247 +52 kzalloc.constprop 598 608 +10 parport_default_proc_register 62 69 +7 register_sysctl_table 291 - -291 parport_sysctl_template 1288 904 -384 Total: Before=8603076, After=8602859, chg -0.00% Signed-off-by: Joel Granados <j.granados@samsung.com> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> |
H A D | base.c | diff 63993102 Wed Oct 25 18:56:34 MDT 2023 Yang Li <yang.lee@linux.alibaba.com> fs/proc/base.c: remove unneeded semicolon ./fs/proc/base.c:3829:2-3: Unneeded semicolon Link: https://lkml.kernel.org/r/20231026005634.6581-1-yang.lee@linux.alibaba.com Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=7057 Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 63993102 Wed Oct 25 18:56:34 MDT 2023 Yang Li <yang.lee@linux.alibaba.com> fs/proc/base.c: remove unneeded semicolon ./fs/proc/base.c:3829:2-3: Unneeded semicolon Link: https://lkml.kernel.org/r/20231026005634.6581-1-yang.lee@linux.alibaba.com Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=7057 Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff da4d6b9c Mon Nov 08 19:32:05 MST 2021 Stephen Brennan <stephen.s.brennan@oracle.com> proc: allow pid_revalidate() during LOOKUP_RCU Problem Description: When running running ~128 parallel instances of TZ=/etc/localtime ps -fe >/dev/null on a 128CPU machine, the %sys utilization reaches 97%, and perf shows the following code path as being responsible for heavy contention on the d_lockref spinlock: walk_component() lookup_fast() d_revalidate() pid_revalidate() // returns -ECHILD unlazy_child() lockref_get_not_dead(&nd->path.dentry->d_lockref) <-- contention The reason is that pid_revalidate() is triggering a drop from RCU to ref path walk mode. All concurrent path lookups thus try to grab a reference to the dentry for /proc/, before re-executing pid_revalidate() and then stepping into the /proc/$pid directory. Thus there is huge spinlock contention. This patch allows pid_revalidate() to execute in RCU mode, meaning that the path lookup can successfully enter the /proc/$pid directory while still in RCU mode. Later on, the path lookup may still drop into ref mode, but the contention will be much reduced at this point. By applying this patch, %sys utilization falls to around 85% under the same workload, and the number of ps processes executed per unit time increases by 3x-4x. Although this particular workload is a bit contrived, we have seen some large collections of eager monitoring scripts which produced similarly high %sys time due to contention in the /proc directory. As a result this patch, Al noted that several procfs methods which were only called in ref-walk mode could now be called from RCU mode. To ensure that this patch is safe, I audited all the inode get_link and permission() implementations, as well as dentry d_revalidate() implementations, in fs/proc. The purpose here is to ensure that they either are safe to call in RCU (i.e. don't sleep) or correctly bail out of RCU mode if they don't support it. My analysis shows that all at-risk procfs methods are safe to call under RCU, and thus this patch is safe. Procfs RCU-walk Analysis: This analysis is up-to-date with 5.15-rc3. When called under RCU mode, these functions have arguments as follows: * get_link() receives a NULL dentry pointer when called in RCU mode. * permission() receives MAY_NOT_BLOCK in the mode parameter when called from RCU. * d_revalidate() receives LOOKUP_RCU in flags. For the following functions, either they are trivially RCU safe, or they explicitly bail at the beginning of the function when they run: proc_ns_get_link (bails out) proc_get_link (RCU safe) proc_pid_get_link (bails out) map_files_d_revalidate (bails out) map_misc_d_revalidate (bails out) proc_net_d_revalidate (RCU safe) proc_sys_revalidate (bails out, also not under /proc/$pid) tid_fd_revalidate (bails out) proc_sys_permission (not under /proc/$pid) The remainder of the functions require a bit more detail: * proc_fd_permission: RCU safe. All of the body of this function is under rcu_read_lock(), except generic_permission() which declares itself RCU safe in its documentation string. * proc_self_get_link uses GFP_ATOMIC in the RCU case, so it is RCU aware and otherwise looks safe. The same is true of proc_thread_self_get_link. * proc_map_files_get_link: calls ns_capable, which calls capable(), and thus calls into the audit code (see note #1 below). The remainder is just a call to the trivially safe proc_pid_get_link(). * proc_pid_permission: calls ptrace_may_access(), which appears RCU safe, although it does call into the "security_ptrace_access_check()" hook, which looks safe under smack and selinux. Just the audit code is of concern. Also uses get_task_struct() and put_task_struct(), see note #2 below. * proc_tid_comm_permission: Appears safe, though calls put_task_struct (see note #2 below). Note #1: Most of the concern of RCU safety has centered around the audit code. However, since b17ec22fb339 ("selinux: slow_avc_audit has become non-blocking"), it's safe to call this code under RCU. So all of the above are safe by my estimation. Note #2: get_task_struct() and put_task_struct(): The majority of get_task_struct() is under RCU read lock, and in any case it is a simple increment. But put_task_struct() is complex, given that it could at some point free the task struct, and this process has many steps which I couldn't manually verify. However, several other places call put_task_struct() under RCU, so it appears safe to use here too (see kernel/hung_task.c:165 or rcu/tree-stall.h:296) Patch description: pid_revalidate() drops from RCU into REF lookup mode. When many threads are resolving paths within /proc in parallel, this can result in heavy spinlock contention on d_lockref as each thread tries to grab a reference to the /proc dentry (and drop it shortly thereafter). Investigation indicates that it is not necessary to drop RCU in pid_revalidate(), as no RCU data is modified and the function never sleeps. So, remove the LOOKUP_RCU check. Link: https://lkml.kernel.org/r/20211004175629.292270-2-stephen.s.brennan@oracle.com Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com> Cc: Konrad Wilk <konrad.wilk@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 63f818f4 Tue Apr 07 08:43:04 MDT 2020 Eric W. Biederman <ebiederm@xmission.com> proc: Use a dedicated lock in struct pid syzbot wrote: > ======================================================== > WARNING: possible irq lock inversion dependency detected > 5.6.0-syzkaller #0 Not tainted > -------------------------------------------------------- > swapper/1/0 just changed the state of lock: > ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840 > but this lock took another, SOFTIRQ-unsafe lock in the past: > (&pid->wait_pidfd){+.+.}-{2:2} > > > and interrupts could create inverse lock ordering between them. > > > other info that might help us debug this: > Possible interrupt unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(&pid->wait_pidfd); > local_irq_disable(); > lock(tasklist_lock); > lock(&pid->wait_pidfd); > <Interrupt> > lock(tasklist_lock); > > *** DEADLOCK *** > > 4 locks held by swapper/1/0: The problem is that because wait_pidfd.lock is taken under the tasklist lock. It must always be taken with irqs disabled as tasklist_lock can be taken from interrupt context and if wait_pidfd.lock was already taken this would create a lock order inversion. Oleg suggested just disabling irqs where I have added extra calls to wait_pidfd.lock. That should be safe and I think the code will eventually do that. It was rightly pointed out by Christian that sharing the wait_pidfd.lock was a premature optimization. It is also true that my pre-merge window testing was insufficient. So remove the premature optimization and give struct pid a dedicated lock of it's own for struct pid things. I have verified that lockdep sees all 3 paths where we take the new pid->lock and lockdep does not complain. It is my current day dream that one day pid->lock can be used to guard the task lists as well and then the tasklist_lock won't need to be held to deliver signals. That will require taking pid->lock with irqs disabled. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/lkml/00000000000011d66805a25cd73f@google.com/ Cc: Oleg Nesterov <oleg@redhat.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Reported-by: syzbot+343f75cdeea091340956@syzkaller.appspotmail.com Reported-by: syzbot+832aabf700bc3ec920b9@syzkaller.appspotmail.com Reported-by: syzbot+f675f964019f884dbd0f@syzkaller.appspotmail.com Reported-by: syzbot+a9fb1457d720a55d6dc5@syzkaller.appspotmail.com Fixes: 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> diff e79f15a4 Wed Jan 15 02:28:51 MST 2020 Chen Yu <yu.c.chen@intel.com> x86/resctrl: Add task resctrl information display Monitoring tools that want to find out which resctrl control and monitor groups a task belongs to must currently read the "tasks" file in every group until they locate the process ID. Add an additional file /proc/{pid}/cpu_resctrl_groups to provide this information: 1) res: mon: resctrl is not available. 2) res:/ mon: Task is part of the root resctrl control group, and it is not associated to any monitor group. 3) res:/ mon:mon0 Task is part of the root resctrl control group and monitor group mon0. 4) res:group0 mon: Task is part of resctrl control group group0, and it is not associated to any monitor group. 5) res:group0 mon:mon1 Task is part of resctrl control group group0 and monitor group mon1. Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Jinshi Chen <jinshi.chen@intel.com> Link: https://lkml.kernel.org/r/20200115092851.14761-1-yu.c.chen@intel.com diff e79f15a4 Wed Jan 15 02:28:51 MST 2020 Chen Yu <yu.c.chen@intel.com> x86/resctrl: Add task resctrl information display Monitoring tools that want to find out which resctrl control and monitor groups a task belongs to must currently read the "tasks" file in every group until they locate the process ID. Add an additional file /proc/{pid}/cpu_resctrl_groups to provide this information: 1) res: mon: resctrl is not available. 2) res:/ mon: Task is part of the root resctrl control group, and it is not associated to any monitor group. 3) res:/ mon:mon0 Task is part of the root resctrl control group and monitor group mon0. 4) res:group0 mon: Task is part of resctrl control group group0, and it is not associated to any monitor group. 5) res:group0 mon:mon1 Task is part of resctrl control group group0 and monitor group mon1. Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Jinshi Chen <jinshi.chen@intel.com> Link: https://lkml.kernel.org/r/20200115092851.14761-1-yu.c.chen@intel.com diff e79f15a4 Wed Jan 15 02:28:51 MST 2020 Chen Yu <yu.c.chen@intel.com> x86/resctrl: Add task resctrl information display Monitoring tools that want to find out which resctrl control and monitor groups a task belongs to must currently read the "tasks" file in every group until they locate the process ID. Add an additional file /proc/{pid}/cpu_resctrl_groups to provide this information: 1) res: mon: resctrl is not available. 2) res:/ mon: Task is part of the root resctrl control group, and it is not associated to any monitor group. 3) res:/ mon:mon0 Task is part of the root resctrl control group and monitor group mon0. 4) res:group0 mon: Task is part of resctrl control group group0, and it is not associated to any monitor group. 5) res:group0 mon:mon1 Task is part of resctrl control group group0 and monitor group mon1. Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Jinshi Chen <jinshi.chen@intel.com> Link: https://lkml.kernel.org/r/20200115092851.14761-1-yu.c.chen@intel.com diff d26d0cd9 Sat Jul 13 15:27:14 MDT 2019 Linus Torvalds <torvalds@linux-foundation.org> /proc/<pid>/cmdline: add back the setproctitle() special case This makes the setproctitle() special case very explicit indeed, and handles it with a separate helper function entirely. In the process, it re-instates the original semantics of simply stopping at the first NUL character when the original last NUL character is no longer there. [ The original semantics can still be seen in mm/util.c: get_cmdline() that is limited to a fixed-size buffer ] This makes the logic about when we use the string lengths etc much more obvious, and makes it easier to see what we do and what the two very different cases are. Note that even when we allow walking past the end of the argument array (because the setproctitle() might have overwritten and overflowed the original argv[] strings), we only allow it when it overflows into the environment region if it is immediately adjacent. [ Fixed for missing 'count' checks noted by Alexey Izbyshev ] Link: https://lore.kernel.org/lkml/alpine.LNX.2.21.1904052326230.3249@kich.toxcorp.com/ Fixes: 5ab827189965 ("fs/proc: simplify and clarify get_mm_cmdline() function") Cc: Jakub Jankowski <shasta@toxcorp.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Alexey Izbyshev <izbyshev@ispras.ru> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ac311a14 Thu Jul 11 22:00:31 MDT 2019 Shakeel Butt <shakeelb@google.com> oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
/linux-master/fs/9p/ | ||
H A D | vfs_inode.c | diff cf7c33d3 Wed May 03 01:49:29 MDT 2023 Dominique Martinet <asmadeus@codewreck.org> 9p: remove dead stores (variable set again without being read) The 9p code for some reason used to initialize variables outside of the declaration, e.g. instead of just initializing the variable like this: int retval = 0 We would be doing this: int retval; retval = 0; This is perfectly fine and the compiler will just optimize dead stores anyway, but scan-build seems to think this is a problem and there are many of these warnings making the output of scan-build full of such warnings: fs/9p/vfs_inode.c:916:2: warning: Value stored to 'retval' is never read [deadcode.DeadStores] retval = 0; ^ ~ I have no strong opinion here, but if we want to regularly run scan-build we should fix these just to silence the messages. I've confirmed these all are indeed ok to remove. Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org> diff 874c8ca1 Thu Jun 09 14:46:04 MDT 2022 David Howells <dhowells@redhat.com> netfs: Fix gcc-12 warning by embedding vfs inode in netfs_i_context While randstruct was satisfied with using an open-coded "void *" offset cast for the netfs_i_context <-> inode casting, __builtin_object_size() as used by FORTIFY_SOURCE was not as easily fooled. This was causing the following complaint[1] from gcc v12: In file included from include/linux/string.h:253, from include/linux/ceph/ceph_debug.h:7, from fs/ceph/inode.c:2: In function 'fortify_memset_chk', inlined from 'netfs_i_context_init' at include/linux/netfs.h:326:2, inlined from 'ceph_alloc_inode' at fs/ceph/inode.c:463:2: include/linux/fortify-string.h:242:25: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning] 242 | __write_overflow_field(p_size_field, size); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fix this by embedding a struct inode into struct netfs_i_context (which should perhaps be renamed to struct netfs_inode). The struct inode vfs_inode fields are then removed from the 9p, afs, ceph and cifs inode structs and vfs_inode is then simply changed to "netfs.inode" in those filesystems. Further, rename netfs_i_context to netfs_inode, get rid of the netfs_inode() function that converted a netfs_i_context pointer to an inode pointer (that can now be done with &ctx->inode) and rename the netfs_i_context() function to netfs_inode() (which is now a wrapper around container_of()). Most of the changes were done with: perl -p -i -e 's/vfs_inode/netfs.inode/'g \ `git grep -l 'vfs_inode' -- fs/{9p,afs,ceph,cifs}/*.[ch]` Kees suggested doing it with a pair structure[2] and a special declarator to insert that into the network filesystem's inode wrapper[3], but I think it's cleaner to embed it - and then it doesn't matter if struct randomisation reorders things. Dave Chinner suggested using a filesystem-specific VFS_I() function in each filesystem to convert that filesystem's own inode wrapper struct into the VFS inode struct[4]. Version #2: - Fix a couple of missed name changes due to a disabled cifs option. - Rename nfs_i_context to nfs_inode - Use "netfs" instead of "nic" as the member name in per-fs inode wrapper structs. [ This also undoes commit 507160f46c55 ("netfs: gcc-12: temporarily disable '-Wattribute-warning' for now") that is no longer needed ] Fixes: bc899ee1c898 ("netfs: Add a netfs inode context") Reported-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> cc: Jonathan Corbet <corbet@lwn.net> cc: Eric Van Hensbergen <ericvh@gmail.com> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Dominique Martinet <asmadeus@codewreck.org> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: Steve French <smfrench@gmail.com> cc: William Kucharski <william.kucharski@oracle.com> cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> cc: Dave Chinner <david@fromorbit.com> cc: linux-doc@vger.kernel.org cc: v9fs-developer@lists.sourceforge.net cc: linux-afs@lists.infradead.org cc: ceph-devel@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: samba-technical@lists.samba.org cc: linux-fsdevel@vger.kernel.org cc: linux-hardening@vger.kernel.org Link: https://lore.kernel.org/r/d2ad3a3d7bdd794c6efb562d2f2b655fb67756b9.camel@kernel.org/ [1] Link: https://lore.kernel.org/r/20220517210230.864239-1-keescook@chromium.org/ [2] Link: https://lore.kernel.org/r/20220518202212.2322058-1-keescook@chromium.org/ [3] Link: https://lore.kernel.org/r/20220524101205.GI2306852@dread.disaster.area/ [4] Link: https://lore.kernel.org/r/165296786831.3591209.12111293034669289733.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/165305805651.4094995.7763502506786714216.stgit@warthog.procyon.org.uk # v2 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 874c8ca1 Thu Jun 09 14:46:04 MDT 2022 David Howells <dhowells@redhat.com> netfs: Fix gcc-12 warning by embedding vfs inode in netfs_i_context While randstruct was satisfied with using an open-coded "void *" offset cast for the netfs_i_context <-> inode casting, __builtin_object_size() as used by FORTIFY_SOURCE was not as easily fooled. This was causing the following complaint[1] from gcc v12: In file included from include/linux/string.h:253, from include/linux/ceph/ceph_debug.h:7, from fs/ceph/inode.c:2: In function 'fortify_memset_chk', inlined from 'netfs_i_context_init' at include/linux/netfs.h:326:2, inlined from 'ceph_alloc_inode' at fs/ceph/inode.c:463:2: include/linux/fortify-string.h:242:25: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning] 242 | __write_overflow_field(p_size_field, size); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fix this by embedding a struct inode into struct netfs_i_context (which should perhaps be renamed to struct netfs_inode). The struct inode vfs_inode fields are then removed from the 9p, afs, ceph and cifs inode structs and vfs_inode is then simply changed to "netfs.inode" in those filesystems. Further, rename netfs_i_context to netfs_inode, get rid of the netfs_inode() function that converted a netfs_i_context pointer to an inode pointer (that can now be done with &ctx->inode) and rename the netfs_i_context() function to netfs_inode() (which is now a wrapper around container_of()). Most of the changes were done with: perl -p -i -e 's/vfs_inode/netfs.inode/'g \ `git grep -l 'vfs_inode' -- fs/{9p,afs,ceph,cifs}/*.[ch]` Kees suggested doing it with a pair structure[2] and a special declarator to insert that into the network filesystem's inode wrapper[3], but I think it's cleaner to embed it - and then it doesn't matter if struct randomisation reorders things. Dave Chinner suggested using a filesystem-specific VFS_I() function in each filesystem to convert that filesystem's own inode wrapper struct into the VFS inode struct[4]. Version #2: - Fix a couple of missed name changes due to a disabled cifs option. - Rename nfs_i_context to nfs_inode - Use "netfs" instead of "nic" as the member name in per-fs inode wrapper structs. [ This also undoes commit 507160f46c55 ("netfs: gcc-12: temporarily disable '-Wattribute-warning' for now") that is no longer needed ] Fixes: bc899ee1c898 ("netfs: Add a netfs inode context") Reported-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> cc: Jonathan Corbet <corbet@lwn.net> cc: Eric Van Hensbergen <ericvh@gmail.com> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Dominique Martinet <asmadeus@codewreck.org> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: Steve French <smfrench@gmail.com> cc: William Kucharski <william.kucharski@oracle.com> cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> cc: Dave Chinner <david@fromorbit.com> cc: linux-doc@vger.kernel.org cc: v9fs-developer@lists.sourceforge.net cc: linux-afs@lists.infradead.org cc: ceph-devel@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: samba-technical@lists.samba.org cc: linux-fsdevel@vger.kernel.org cc: linux-hardening@vger.kernel.org Link: https://lore.kernel.org/r/d2ad3a3d7bdd794c6efb562d2f2b655fb67756b9.camel@kernel.org/ [1] Link: https://lore.kernel.org/r/20220517210230.864239-1-keescook@chromium.org/ [2] Link: https://lore.kernel.org/r/20220518202212.2322058-1-keescook@chromium.org/ [3] Link: https://lore.kernel.org/r/20220524101205.GI2306852@dread.disaster.area/ [4] Link: https://lore.kernel.org/r/165296786831.3591209.12111293034669289733.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/165305805651.4094995.7763502506786714216.stgit@warthog.procyon.org.uk # v2 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 6e195b0f Tue Nov 02 07:16:43 MDT 2021 Dominique Martinet <asmadeus@codewreck.org> 9p: fix a bunch of checkpatch warnings Sohaib Mohamed started a serie of tiny and incomplete checkpatch fixes but seemingly stopped halfway -- take over and do most of it. This is still missing net/9p/trans* and net/9p/protocol.c for a later time... Link: http://lkml.kernel.org/r/20211102134608.1588018-3-dominique.martinet@atmark-techno.com Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> diff 2ed0b757 Wed Jul 29 02:59:00 MDT 2020 Li Heng <liheng40@huawei.com> 9p: Remove unneeded cast from memory allocation Remove kmem_cache_alloc return value cast. Coccinelle emits the following warning: ./fs/9p/vfs_inode.c:226:12-29: WARNING: casting value returned by memory allocation function to (struct v9fs_inode *) is useless. Link: http://lkml.kernel.org/r/1596013140-49744-1-git-send-email-liheng40@huawei.com Signed-off-by: Li Heng <liheng40@huawei.com> [Dominique: commit message wording] Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> diff 51958817 Thu Dec 19 03:48:37 MST 2019 zhengbin <zhengbin13@huawei.com> 9p: Remove unneeded semicolon Fixes coccicheck warning: fs/9p/vfs_inode.c:146:3-4: Unneeded semicolon Link: http://lkml.kernel.org/r/1576752517-58292-1-git-send-email-zhengbin13@huawei.com Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr> diff ac89b2ef Thu Apr 05 17:19:49 MDT 2018 Eryu Guan <eguan@linux.alibaba.com> 9p: don't maintain dir i_nlink if the exported fs doesn't either If the exported filesystem dir on 9p server doesn't maintain accurate i_nlink count, e.g. always reports i_nlink as 1, then 9p should not maintain nlink count either, otherwise drop_link would report warning with i_nlink being zero. For example: - overlayfs sets nlink to 1 for merged dir - ext4 (with dir_nlink feature enabled) sets nlink to 1 if a dir has more than EXT4_LINK_MAX (65000) links. In this case, everytime a stat(2) call (getattr) on such exported dirs on 9p client side, the i_nlink gets reset to 1, then operations like rmdir(2), unlink(2) and rename(2) would cause the dir nlink to go to zero (then negative), which results in warnings in drop_nlink() and/or inc_nlink() calls. This can be reproduced easily as the following steps: - export a merged overlayfs dir via qemu virtfs to guest - mount the exported virtfs in guest - create two sub-directories in the root dir of the mounted 9pfs - stat the root dir of 9pfs, this resets nlink to 1 - remove all subdirs, the second unlink/rmdir would trigger warning ------------[ cut here ]------------ WARNING: CPU: 3 PID: 1284 at fs/inode.c:282 drop_nlink+0x3e/0x50 ... Call Trace: dump_stack+0x63/0x81 __warn+0xcb/0xf0 warn_slowpath_null+0x1d/0x20 drop_nlink+0x3e/0x50 v9fs_remove+0xaa/0x130 [9p] v9fs_vfs_rmdir+0x13/0x20 [9p] vfs_rmdir+0xb7/0x130 do_rmdir+0x1b8/0x230 SyS_unlinkat+0x22/0x30 do_syscall_64+0x67/0x180 ---[ end trace 43758d8ba91e603b ]--- Fix it by leaving i_nlink to be 1 and don't drop nlink if a directory has nlink <= 2, which indicates that the underlying exported fs doesn't maintain nlink count accurately. This follows what ext4 does in ext4_dec_count(). Link: http://lkml.kernel.org/r/20180312053829.4367-1-eguan@linux.alibaba.com Signed-off-by: Eryu Guan <eguan@linux.alibaba.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Tested-by: Roman Kapl <code@rkapl.cz> Cc: Caspar Zhang <caspar@linux.alibaba.com> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Cc: <v9fs-developer@lists.sourceforge.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff a528d35e Tue Jan 31 09:46:22 MST 2017 David Howells <dhowells@redhat.com> statx: Add a system call to make enhanced file info available Add a system call to make extended file information available, including file creation and some attribute flags where available through the underlying filesystem. The getattr inode operation is altered to take two additional arguments: a u32 request_mask and an unsigned int flags that indicate the synchronisation mode. This change is propagated to the vfs_getattr*() function. Functions like vfs_stat() are now inline wrappers around new functions vfs_statx() and vfs_statx_fd() to reduce stack usage. ======== OVERVIEW ======== The idea was initially proposed as a set of xattrs that could be retrieved with getxattr(), but the general preference proved to be for a new syscall with an extended stat structure. A number of requests were gathered for features to be included. The following have been included: (1) Make the fields a consistent size on all arches and make them large. (2) Spare space, request flags and information flags are provided for future expansion. (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an __s64). (4) Creation time: The SMB protocol carries the creation time, which could be exported by Samba, which will in turn help CIFS make use of FS-Cache as that can be used for coherency data (stx_btime). This is also specified in NFSv4 as a recommended attribute and could be exported by NFSD [Steve French]. (5) Lightweight stat: Ask for just those details of interest, and allow a netfs (such as NFS) to approximate anything not of interest, possibly without going to the server [Trond Myklebust, Ulrich Drepper, Andreas Dilger] (AT_STATX_DONT_SYNC). (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks its cached attributes are up to date [Trond Myklebust] (AT_STATX_FORCE_SYNC). And the following have been left out for future extension: (7) Data version number: Could be used by userspace NFS servers [Aneesh Kumar]. Can also be used to modify fill_post_wcc() in NFSD which retrieves i_version directly, but has just called vfs_getattr(). It could get it from the kstat struct if it used vfs_xgetattr() instead. (There's disagreement on the exact semantics of a single field, since not all filesystems do this the same way). (8) BSD stat compatibility: Including more fields from the BSD stat such as creation time (st_btime) and inode generation number (st_gen) [Jeremy Allison, Bernd Schubert]. (9) Inode generation number: Useful for FUSE and userspace NFS servers [Bernd Schubert]. (This was asked for but later deemed unnecessary with the open-by-handle capability available and caused disagreement as to whether it's a security hole or not). (10) Extra coherency data may be useful in making backups [Andreas Dilger]. (No particular data were offered, but things like last backup timestamp, the data version number and the DOS archive bit would come into this category). (11) Allow the filesystem to indicate what it can/cannot provide: A filesystem can now say it doesn't support a standard stat feature if that isn't available, so if, for instance, inode numbers or UIDs don't exist or are fabricated locally... (This requires a separate system call - I have an fsinfo() call idea for this). (12) Store a 16-byte volume ID in the superblock that can be returned in struct xstat [Steve French]. (Deferred to fsinfo). (13) Include granularity fields in the time data to indicate the granularity of each of the times (NFSv4 time_delta) [Steve French]. (Deferred to fsinfo). (14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags. Note that the Linux IOC flags are a mess and filesystems such as Ext4 define flags that aren't in linux/fs.h, so translation in the kernel may be a necessity (or, possibly, we provide the filesystem type too). (Some attributes are made available in stx_attributes, but the general feeling was that the IOC flags were to ext[234]-specific and shouldn't be exposed through statx this way). (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer, Michael Kerrisk]. (Deferred, probably to fsinfo. Finding out if there's an ACL or seclabal might require extra filesystem operations). (16) Femtosecond-resolution timestamps [Dave Chinner]. (A __reserved field has been left in the statx_timestamp struct for this - if there proves to be a need). (17) A set multiple attributes syscall to go with this. =============== NEW SYSTEM CALL =============== The new system call is: int ret = statx(int dfd, const char *filename, unsigned int flags, unsigned int mask, struct statx *buffer); The dfd, filename and flags parameters indicate the file to query, in a similar way to fstatat(). There is no equivalent of lstat() as that can be emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is also no equivalent of fstat() as that can be emulated by passing a NULL filename to statx() with the fd of interest in dfd. Whether or not statx() synchronises the attributes with the backing store can be controlled by OR'ing a value into the flags argument (this typically only affects network filesystems): (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this respect. (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise its attributes with the server - which might require data writeback to occur to get the timestamps correct. (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a network filesystem. The resulting values should be considered approximate. mask is a bitmask indicating the fields in struct statx that are of interest to the caller. The user should set this to STATX_BASIC_STATS to get the basic set returned by stat(). It should be noted that asking for more information may entail extra I/O operations. buffer points to the destination for the data. This must be 256 bytes in size. ====================== MAIN ATTRIBUTES RECORD ====================== The following structures are defined in which to return the main attribute set: struct statx_timestamp { __s64 tv_sec; __s32 tv_nsec; __s32 __reserved; }; struct statx { __u32 stx_mask; __u32 stx_blksize; __u64 stx_attributes; __u32 stx_nlink; __u32 stx_uid; __u32 stx_gid; __u16 stx_mode; __u16 __spare0[1]; __u64 stx_ino; __u64 stx_size; __u64 stx_blocks; __u64 __spare1[1]; struct statx_timestamp stx_atime; struct statx_timestamp stx_btime; struct statx_timestamp stx_ctime; struct statx_timestamp stx_mtime; __u32 stx_rdev_major; __u32 stx_rdev_minor; __u32 stx_dev_major; __u32 stx_dev_minor; __u64 __spare2[14]; }; The defined bits in request_mask and stx_mask are: STATX_TYPE Want/got stx_mode & S_IFMT STATX_MODE Want/got stx_mode & ~S_IFMT STATX_NLINK Want/got stx_nlink STATX_UID Want/got stx_uid STATX_GID Want/got stx_gid STATX_ATIME Want/got stx_atime{,_ns} STATX_MTIME Want/got stx_mtime{,_ns} STATX_CTIME Want/got stx_ctime{,_ns} STATX_INO Want/got stx_ino STATX_SIZE Want/got stx_size STATX_BLOCKS Want/got stx_blocks STATX_BASIC_STATS [The stuff in the normal stat struct] STATX_BTIME Want/got stx_btime{,_ns} STATX_ALL [All currently available stuff] stx_btime is the file creation time, stx_mask is a bitmask indicating the data provided and __spares*[] are where as-yet undefined fields can be placed. Time fields are structures with separate seconds and nanoseconds fields plus a reserved field in case we want to add even finer resolution. Note that times will be negative if before 1970; in such a case, the nanosecond fields will also be negative if not zero. The bits defined in the stx_attributes field convey information about a file, how it is accessed, where it is and what it does. The following attributes map to FS_*_FL flags and are the same numerical value: STATX_ATTR_COMPRESSED File is compressed by the fs STATX_ATTR_IMMUTABLE File is marked immutable STATX_ATTR_APPEND File is append-only STATX_ATTR_NODUMP File is not to be dumped STATX_ATTR_ENCRYPTED File requires key to decrypt in fs Within the kernel, the supported flags are listed by: KSTAT_ATTR_FS_IOC_FLAGS [Are any other IOC flags of sufficient general interest to be exposed through this interface?] New flags include: STATX_ATTR_AUTOMOUNT Object is an automount trigger These are for the use of GUI tools that might want to mark files specially, depending on what they are. Fields in struct statx come in a number of classes: (0) stx_dev_*, stx_blksize. These are local system information and are always available. (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino, stx_size, stx_blocks. These will be returned whether the caller asks for them or not. The corresponding bits in stx_mask will be set to indicate whether they actually have valid values. If the caller didn't ask for them, then they may be approximated. For example, NFS won't waste any time updating them from the server, unless as a byproduct of updating something requested. If the values don't actually exist for the underlying object (such as UID or GID on a DOS file), then the bit won't be set in the stx_mask, even if the caller asked for the value. In such a case, the returned value will be a fabrication. Note that there are instances where the type might not be valid, for instance Windows reparse points. (2) stx_rdev_*. This will be set only if stx_mode indicates we're looking at a blockdev or a chardev, otherwise will be 0. (3) stx_btime. Similar to (1), except this will be set to 0 if it doesn't exist. ======= TESTING ======= The following test program can be used to test the statx system call: samples/statx/test-statx.c Just compile and run, passing it paths to the files you want to examine. The file is built automatically if CONFIG_SAMPLES is enabled. Here's some example output. Firstly, an NFS directory that crosses to another FSID. Note that the AUTOMOUNT attribute is set because transiting this directory will cause d_automount to be invoked by the VFS. [root@andromeda ~]# /tmp/test-statx -A /warthog/data statx(/warthog/data) = 0 results=7ff Size: 4096 Blocks: 8 IO Block: 1048576 directory Device: 00:26 Inode: 1703937 Links: 125 Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041 Access: 2016-11-24 09:02:12.219699527+0000 Modify: 2016-11-17 10:44:36.225653653+0000 Change: 2016-11-17 10:44:36.225653653+0000 Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------) Secondly, the result of automounting on that directory. [root@andromeda ~]# /tmp/test-statx /warthog/data statx(/warthog/data) = 0 results=7ff Size: 4096 Blocks: 8 IO Block: 1048576 directory Device: 00:27 Inode: 2 Links: 125 Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041 Access: 2016-11-24 09:02:12.219699527+0000 Modify: 2016-11-17 10:44:36.225653653+0000 Change: 2016-11-17 10:44:36.225653653+0000 Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 447c5094 Tue Jan 29 17:18:50 MST 2013 Eric W. Biederman <ebiederm@xmission.com> 9p: Modify the stat structures to use kuid_t and kgid_t 9p has thre strucrtures that can encode inode stat information. Modify all of those structures to contain kuid_t and kgid_t values. Modify he wire encoders and decoders of those structures to use 'u' and 'g' instead of 'd' in the format string where uids and gids are present. This results in all kuid and kgid conversion to and from on the wire values being performed by the same code in protocol.c where the client is known at the time of the conversion. Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@gmail.com> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> diff 30d90494 Thu Jun 21 14:40:19 MDT 2012 Al Viro <viro@zeniv.linux.org.uk> kill struct opendata Just pass struct file *. Methods are happier that way... There's no need to return struct file * from finish_open() now, so let it return int. Next: saner prototypes for parts in namei.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
/linux-master/fs/ocfs2/ | ||
H A D | file.c | diff ed5a7047 Mon Oct 17 09:06:37 MDT 2022 Christian Brauner <brauner@kernel.org> attr: use consistent sgid stripping checks Currently setgid stripping in file_remove_privs()'s should_remove_suid() helper is inconsistent with other parts of the vfs. Specifically, it only raises ATTR_KILL_SGID if the inode is S_ISGID and S_IXGRP but not if the inode isn't in the caller's groups and the caller isn't privileged over the inode although we require this already in setattr_prepare() and setattr_copy() and so all filesystem implement this requirement implicitly because they have to use setattr_{prepare,copy}() anyway. But the inconsistency shows up in setgid stripping bugs for overlayfs in xfstests (e.g., generic/673, generic/683, generic/685, generic/686, generic/687). For example, we test whether suid and setgid stripping works correctly when performing various write-like operations as an unprivileged user (fallocate, reflink, write, etc.): echo "Test 1 - qa_user, non-exec file $verb" setup_testfile chmod a+rws $junk_file commit_and_check "$qa_user" "$verb" 64k 64k The test basically creates a file with 6666 permissions. While the file has the S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. On a regular filesystem like xfs what will happen is: sys_fallocate() -> vfs_fallocate() -> xfs_file_fallocate() -> file_modified() -> __file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = ATTR_FORCE | kill; -> notify_change() -> setattr_copy() In should_remove_suid() we can see that ATTR_KILL_SUID is raised unconditionally because the file in the test has S_ISUID set. But we also see that ATTR_KILL_SGID won't be set because while the file is S_ISGID it is not S_IXGRP (see above) which is a condition for ATTR_KILL_SGID being raised. So by the time we call notify_change() we have attr->ia_valid set to ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that ATTR_KILL_SUID is set and does: ia_valid = attr->ia_valid |= ATTR_MODE attr->ia_mode = (inode->i_mode & ~S_ISUID); which means that when we call setattr_copy() later we will definitely update inode->i_mode. Note that attr->ia_mode still contains S_ISGID. Now we call into the filesystem's ->setattr() inode operation which will end up calling setattr_copy(). Since ATTR_MODE is set we will hit: if (ia_valid & ATTR_MODE) { umode_t mode = attr->ia_mode; vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode); if (!vfsgid_in_group_p(vfsgid) && !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID)) mode &= ~S_ISGID; inode->i_mode = mode; } and since the caller in the test is neither capable nor in the group of the inode the S_ISGID bit is stripped. But assume the file isn't suid then ATTR_KILL_SUID won't be raised which has the consequence that neither the setgid nor the suid bits are stripped even though it should be stripped because the inode isn't in the caller's groups and the caller isn't privileged over the inode. If overlayfs is in the mix things become a bit more complicated and the bug shows up more clearly. When e.g., ovl_setattr() is hit from ovl_fallocate()'s call to file_remove_privs() then ATTR_KILL_SUID and ATTR_KILL_SGID might be raised but because the check in notify_change() is questioning the ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be stripped the S_ISGID bit isn't removed even though it should be stripped: sys_fallocate() -> vfs_fallocate() -> ovl_fallocate() -> file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = ATTR_FORCE | kill; -> notify_change() -> ovl_setattr() // TAKE ON MOUNTER'S CREDS -> ovl_do_notify_change() -> notify_change() // GIVE UP MOUNTER'S CREDS // TAKE ON MOUNTER'S CREDS -> vfs_fallocate() -> xfs_file_fallocate() -> file_modified() -> __file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = attr_force | kill; -> notify_change() The fix for all of this is to make file_remove_privs()'s should_remove_suid() helper to perform the same checks as we already require in setattr_prepare() and setattr_copy() and have notify_change() not pointlessly requiring S_IXGRP again. It doesn't make any sense in the first place because the caller must calculate the flags via should_remove_suid() anyway which would raise ATTR_KILL_SGID. While we're at it we move should_remove_suid() from inode.c to attr.c where it belongs with the rest of the iattr helpers. Especially since it returns ATTR_KILL_S{G,U}ID flags. We also rename it to setattr_should_drop_suidgid() to better reflect that it indicates both setuid and setgid bit removal and also that it returns attr flags. Running xfstests with this doesn't report any regressions. We should really try and use consistent checks. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> diff ed5a7047 Mon Oct 17 09:06:37 MDT 2022 Christian Brauner <brauner@kernel.org> attr: use consistent sgid stripping checks Currently setgid stripping in file_remove_privs()'s should_remove_suid() helper is inconsistent with other parts of the vfs. Specifically, it only raises ATTR_KILL_SGID if the inode is S_ISGID and S_IXGRP but not if the inode isn't in the caller's groups and the caller isn't privileged over the inode although we require this already in setattr_prepare() and setattr_copy() and so all filesystem implement this requirement implicitly because they have to use setattr_{prepare,copy}() anyway. But the inconsistency shows up in setgid stripping bugs for overlayfs in xfstests (e.g., generic/673, generic/683, generic/685, generic/686, generic/687). For example, we test whether suid and setgid stripping works correctly when performing various write-like operations as an unprivileged user (fallocate, reflink, write, etc.): echo "Test 1 - qa_user, non-exec file $verb" setup_testfile chmod a+rws $junk_file commit_and_check "$qa_user" "$verb" 64k 64k The test basically creates a file with 6666 permissions. While the file has the S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. On a regular filesystem like xfs what will happen is: sys_fallocate() -> vfs_fallocate() -> xfs_file_fallocate() -> file_modified() -> __file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = ATTR_FORCE | kill; -> notify_change() -> setattr_copy() In should_remove_suid() we can see that ATTR_KILL_SUID is raised unconditionally because the file in the test has S_ISUID set. But we also see that ATTR_KILL_SGID won't be set because while the file is S_ISGID it is not S_IXGRP (see above) which is a condition for ATTR_KILL_SGID being raised. So by the time we call notify_change() we have attr->ia_valid set to ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that ATTR_KILL_SUID is set and does: ia_valid = attr->ia_valid |= ATTR_MODE attr->ia_mode = (inode->i_mode & ~S_ISUID); which means that when we call setattr_copy() later we will definitely update inode->i_mode. Note that attr->ia_mode still contains S_ISGID. Now we call into the filesystem's ->setattr() inode operation which will end up calling setattr_copy(). Since ATTR_MODE is set we will hit: if (ia_valid & ATTR_MODE) { umode_t mode = attr->ia_mode; vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode); if (!vfsgid_in_group_p(vfsgid) && !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID)) mode &= ~S_ISGID; inode->i_mode = mode; } and since the caller in the test is neither capable nor in the group of the inode the S_ISGID bit is stripped. But assume the file isn't suid then ATTR_KILL_SUID won't be raised which has the consequence that neither the setgid nor the suid bits are stripped even though it should be stripped because the inode isn't in the caller's groups and the caller isn't privileged over the inode. If overlayfs is in the mix things become a bit more complicated and the bug shows up more clearly. When e.g., ovl_setattr() is hit from ovl_fallocate()'s call to file_remove_privs() then ATTR_KILL_SUID and ATTR_KILL_SGID might be raised but because the check in notify_change() is questioning the ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be stripped the S_ISGID bit isn't removed even though it should be stripped: sys_fallocate() -> vfs_fallocate() -> ovl_fallocate() -> file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = ATTR_FORCE | kill; -> notify_change() -> ovl_setattr() // TAKE ON MOUNTER'S CREDS -> ovl_do_notify_change() -> notify_change() // GIVE UP MOUNTER'S CREDS // TAKE ON MOUNTER'S CREDS -> vfs_fallocate() -> xfs_file_fallocate() -> file_modified() -> __file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = attr_force | kill; -> notify_change() The fix for all of this is to make file_remove_privs()'s should_remove_suid() helper to perform the same checks as we already require in setattr_prepare() and setattr_copy() and have notify_change() not pointlessly requiring S_IXGRP again. It doesn't make any sense in the first place because the caller must calculate the flags via should_remove_suid() anyway which would raise ATTR_KILL_SGID. While we're at it we move should_remove_suid() from inode.c to attr.c where it belongs with the rest of the iattr helpers. Especially since it returns ATTR_KILL_S{G,U}ID flags. We also rename it to setattr_should_drop_suidgid() to better reflect that it indicates both setuid and setgid bit removal and also that it returns attr flags. Running xfstests with this doesn't report any regressions. We should really try and use consistent checks. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> diff 71e7b535 Tue Jun 21 08:14:52 MDT 2022 Christian Brauner <brauner@kernel.org> quota: port quota helpers mount ids Port the is_quota_modification() and dqout_transfer() helper to type safe vfs{g,u}id_t. Since these helpers are only called by a few filesystems don't introduce a new helper but simply extend the existing helpers to pass down the mount's idmapping. Note, that this is a non-functional change, i.e. nothing will have happened here or at the end of this series to how quota are done! This a change necessary because we will at the end of this series make ownership changes easier to reason about by keeping the original value in struct iattr for both non-idmapped and idmapped mounts. For now we always pass the initial idmapping which makes the idmapping functions these helpers call nops. This is done because we currently always pass the actual value to be written to i_{g,u}id via struct iattr. While this allowed us to treat the {g,u}id values in struct iattr as values that can be directly written to inode->i_{g,u}id it also increases the potential for confusion for filesystems. Now that we are have dedicated types to prevent this confusion we will ultimately only map the value from the idmapped mount into a filesystem value that can be written to inode->i_{g,u}id when the filesystem actually updates the inode. So pass down the initial idmapping until we finished that conversion at which point we pass down the mount's idmapping. Since struct iattr uses an anonymous union with overlapping types as supported by the C standard, filesystems that haven't converted to ia_vfs{g,u}id won't see any difference and things will continue to work as before. In other words, no functional changes intended with this change. Link: https://lore.kernel.org/r/20220621141454.2914719-7-brauner@kernel.org Cc: Seth Forshee <sforshee@digitalocean.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.cz> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> CC: linux-fsdevel@vger.kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Seth Forshee <sforshee@digitalocean.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> diff 9449ad33 Thu Jul 29 15:53:41 MDT 2021 Junxiao Bi <junxiao.bi@oracle.com> ocfs2: issue zeroout to EOF blocks For punch holes in EOF blocks, fallocate used buffer write to zero the EOF blocks in last cluster. But since ->writepage will ignore EOF pages, those zeros will not be flushed. This "looks" ok as commit 6bba4471f0cc ("ocfs2: fix data corruption by fallocate") will zero the EOF blocks when extend the file size, but it isn't. The problem happened on those EOF pages, before writeback, those pages had DIRTY flag set and all buffer_head in them also had DIRTY flag set, when writeback run by write_cache_pages(), DIRTY flag on the page was cleared, but DIRTY flag on the buffer_head not. When next write happened to those EOF pages, since buffer_head already had DIRTY flag set, it would not mark page DIRTY again. That made writeback ignore them forever. That will cause data corruption. Even directio write can't work because it will fail when trying to drop pages caches before direct io, as it found the buffer_head for those pages still had DIRTY flag set, then it will fall back to buffer io mode. To make a summary of the issue, as writeback ingores EOF pages, once any EOF page is generated, any write to it will only go to the page cache, it will never be flushed to disk even file size extends and that page is not EOF page any more. The fix is to avoid zero EOF blocks with buffer write. The following code snippet from qemu-img could trigger the corruption. 656 open("6b3711ae-3306-4bdd-823c-cf1c0060a095.conv.2", O_RDWR|O_DIRECT|O_CLOEXEC) = 11 ... 660 fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2275868672, 327680 <unfinished ...> 660 fallocate(11, 0, 2275868672, 327680) = 0 658 pwrite64(11, " Link: https://lkml.kernel.org/r/20210722054923.24389-2-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 2d797e9f Mon Feb 03 18:33:45 MST 2020 Gang He <GHe@suse.com> ocfs2: fix oops when writing cloned file Writing a cloned file triggers a kernel oops and the user-space command process is also killed by the system. The bug can be reproduced stably via: 1) create a file under ocfs2 file system directory. journalctl -b > aa.txt 2) create a cloned file for this file. reflink aa.txt bb.txt 3) write the cloned file with dd command. dd if=/dev/zero of=bb.txt bs=512 count=1 conv=notrunc The dd command is killed by the kernel, then you can see the oops message via dmesg command. [ 463.875404] BUG: kernel NULL pointer dereference, address: 0000000000000028 [ 463.875413] #PF: supervisor read access in kernel mode [ 463.875416] #PF: error_code(0x0000) - not-present page [ 463.875418] PGD 0 P4D 0 [ 463.875425] Oops: 0000 [#1] SMP PTI [ 463.875431] CPU: 1 PID: 2291 Comm: dd Tainted: G OE 5.3.16-2-default [ 463.875433] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 463.875500] RIP: 0010:ocfs2_refcount_cow+0xa4/0x5d0 [ocfs2] [ 463.875505] Code: 06 89 6c 24 38 89 eb f6 44 24 3c 02 74 be 49 8b 47 28 [ 463.875508] RSP: 0018:ffffa2cb409dfce8 EFLAGS: 00010202 [ 463.875512] RAX: ffff8b1ebdca8000 RBX: 0000000000000001 RCX: ffff8b1eb73a9df0 [ 463.875515] RDX: 0000000000056a01 RSI: 0000000000000000 RDI: 0000000000000000 [ 463.875517] RBP: 0000000000000001 R08: ffff8b1eb73a9de0 R09: 0000000000000000 [ 463.875520] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 [ 463.875522] R13: ffff8b1eb922f048 R14: 0000000000000000 R15: ffff8b1eb922f048 [ 463.875526] FS: 00007f8f44d15540(0000) GS:ffff8b1ebeb00000(0000) knlGS:0000000000000000 [ 463.875529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 463.875532] CR2: 0000000000000028 CR3: 000000003c17a000 CR4: 00000000000006e0 [ 463.875546] Call Trace: [ 463.875596] ? ocfs2_inode_lock_full_nested+0x18b/0x960 [ocfs2] [ 463.875648] ocfs2_file_write_iter+0xaf8/0xc70 [ocfs2] [ 463.875672] new_sync_write+0x12d/0x1d0 [ 463.875688] vfs_write+0xad/0x1a0 [ 463.875697] ksys_write+0xa1/0xe0 [ 463.875710] do_syscall_64+0x60/0x1f0 [ 463.875743] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 463.875758] RIP: 0033:0x7f8f4482ed44 [ 463.875762] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 80 00 00 00 [ 463.875765] RSP: 002b:00007fff300a79d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 463.875769] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8f4482ed44 [ 463.875771] RDX: 0000000000000200 RSI: 000055f771b5c000 RDI: 0000000000000001 [ 463.875774] RBP: 0000000000000200 R08: 00007f8f44af9c78 R09: 0000000000000003 [ 463.875776] R10: 000000000000089f R11: 0000000000000246 R12: 000055f771b5c000 [ 463.875779] R13: 0000000000000200 R14: 0000000000000000 R15: 000055f771b5c000 This regression problem was introduced by commit e74540b28556 ("ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()"). Link: http://lkml.kernel.org/r/20200121050153.13290-1-ghe@suse.com Fixes: e74540b28556 ("ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()"). Signed-off-by: Gang He <ghe@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 2d797e9f Mon Feb 03 18:33:45 MST 2020 Gang He <GHe@suse.com> ocfs2: fix oops when writing cloned file Writing a cloned file triggers a kernel oops and the user-space command process is also killed by the system. The bug can be reproduced stably via: 1) create a file under ocfs2 file system directory. journalctl -b > aa.txt 2) create a cloned file for this file. reflink aa.txt bb.txt 3) write the cloned file with dd command. dd if=/dev/zero of=bb.txt bs=512 count=1 conv=notrunc The dd command is killed by the kernel, then you can see the oops message via dmesg command. [ 463.875404] BUG: kernel NULL pointer dereference, address: 0000000000000028 [ 463.875413] #PF: supervisor read access in kernel mode [ 463.875416] #PF: error_code(0x0000) - not-present page [ 463.875418] PGD 0 P4D 0 [ 463.875425] Oops: 0000 [#1] SMP PTI [ 463.875431] CPU: 1 PID: 2291 Comm: dd Tainted: G OE 5.3.16-2-default [ 463.875433] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 463.875500] RIP: 0010:ocfs2_refcount_cow+0xa4/0x5d0 [ocfs2] [ 463.875505] Code: 06 89 6c 24 38 89 eb f6 44 24 3c 02 74 be 49 8b 47 28 [ 463.875508] RSP: 0018:ffffa2cb409dfce8 EFLAGS: 00010202 [ 463.875512] RAX: ffff8b1ebdca8000 RBX: 0000000000000001 RCX: ffff8b1eb73a9df0 [ 463.875515] RDX: 0000000000056a01 RSI: 0000000000000000 RDI: 0000000000000000 [ 463.875517] RBP: 0000000000000001 R08: ffff8b1eb73a9de0 R09: 0000000000000000 [ 463.875520] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 [ 463.875522] R13: ffff8b1eb922f048 R14: 0000000000000000 R15: ffff8b1eb922f048 [ 463.875526] FS: 00007f8f44d15540(0000) GS:ffff8b1ebeb00000(0000) knlGS:0000000000000000 [ 463.875529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 463.875532] CR2: 0000000000000028 CR3: 000000003c17a000 CR4: 00000000000006e0 [ 463.875546] Call Trace: [ 463.875596] ? ocfs2_inode_lock_full_nested+0x18b/0x960 [ocfs2] [ 463.875648] ocfs2_file_write_iter+0xaf8/0xc70 [ocfs2] [ 463.875672] new_sync_write+0x12d/0x1d0 [ 463.875688] vfs_write+0xad/0x1a0 [ 463.875697] ksys_write+0xa1/0xe0 [ 463.875710] do_syscall_64+0x60/0x1f0 [ 463.875743] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 463.875758] RIP: 0033:0x7f8f4482ed44 [ 463.875762] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 80 00 00 00 [ 463.875765] RSP: 002b:00007fff300a79d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 463.875769] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8f4482ed44 [ 463.875771] RDX: 0000000000000200 RSI: 000055f771b5c000 RDI: 0000000000000001 [ 463.875774] RBP: 0000000000000200 R08: 00007f8f44af9c78 R09: 0000000000000003 [ 463.875776] R10: 000000000000089f R11: 0000000000000246 R12: 000055f771b5c000 [ 463.875779] R13: 0000000000000200 R14: 0000000000000000 R15: 000055f771b5c000 This regression problem was introduced by commit e74540b28556 ("ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()"). Link: http://lkml.kernel.org/r/20200121050153.13290-1-ghe@suse.com Fixes: e74540b28556 ("ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()"). Signed-off-by: Gang He <ghe@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff e74540b2 Tue Nov 05 22:16:34 MST 2019 Shuning Zhang <sunny.s.zhang@oracle.com> ocfs2: protect extent tree in ocfs2_prepare_inode_for_write() When the extent tree is modified, it should be protected by inode cluster lock and ip_alloc_sem. The extent tree is accessed and modified in the ocfs2_prepare_inode_for_write, but isn't protected by ip_alloc_sem. The following is a case. The function ocfs2_fiemap is accessing the extent tree, which is modified at the same time. kernel BUG at fs/ocfs2/extent_map.c:475! invalid opcode: 0000 [#1] SMP Modules linked in: tun ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue [...] CPU: 16 PID: 14047 Comm: o2info Not tainted 4.1.12-124.23.1.el6uek.x86_64 #2 Hardware name: Oracle Corporation ORACLE SERVER X7-2L/ASM, MB MECH, X7-2L, BIOS 42040600 10/19/2018 task: ffff88019487e200 ti: ffff88003daa4000 task.ti: ffff88003daa4000 RIP: ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2] Call Trace: ocfs2_fiemap+0x1e3/0x430 [ocfs2] do_vfs_ioctl+0x155/0x510 SyS_ioctl+0x81/0xa0 system_call_fastpath+0x18/0xd8 Code: 18 48 c7 c6 60 7f 65 a0 31 c0 bb e2 ff ff ff 48 8b 4a 40 48 8b 7a 28 48 c7 c2 78 2d 66 a0 e8 38 4f 05 00 e9 28 fe ff ff 0f 1f 00 <0f> 0b 66 0f 1f 44 00 00 bb 86 ff ff ff e9 13 fe ff ff 66 0f 1f RIP ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2] ---[ end trace c8aa0c8180e869dc ]--- Kernel panic - not syncing: Fatal exception Kernel Offset: disabled This issue can be reproduced every week in a production environment. This issue is related to the usage mode. If others use ocfs2 in this mode, the kernel will panic frequently. [akpm@linux-foundation.org: coding style fixes] [Fix new warning due to unused function by removing said function - Linus ] Link: http://lkml.kernel.org/r/1568772175-2906-2-git-send-email-sunny.s.zhang@oracle.com Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Gang He <ghe@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 236dcc2a Mon Sep 23 16:33:27 MDT 2019 zhengbin <zhengbin13@huawei.com> fs/ocfs2/file.c: remove set but not used variables Fixes gcc '-Wunused-but-set-variable' warning: fs/ocfs2/file.c: In function ocfs2_prepare_inode_for_write: fs/ocfs2/file.c:2143:9: warning: variable end set but not used [-Wunused-but-set-variable] Link: http://lkml.kernel.org/r/1566522588-63786-3-git-send-email-joseph.qi@linux.alibaba.com Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reported-by: Hulk Robot <hulkci@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Changwei Ge <chge@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 236dcc2a Mon Sep 23 16:33:27 MDT 2019 zhengbin <zhengbin13@huawei.com> fs/ocfs2/file.c: remove set but not used variables Fixes gcc '-Wunused-but-set-variable' warning: fs/ocfs2/file.c: In function ocfs2_prepare_inode_for_write: fs/ocfs2/file.c:2143:9: warning: variable end set but not used [-Wunused-but-set-variable] Link: http://lkml.kernel.org/r/1566522588-63786-3-git-send-email-joseph.qi@linux.alibaba.com Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reported-by: Hulk Robot <hulkci@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Changwei Ge <chge@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 236dcc2a Mon Sep 23 16:33:27 MDT 2019 zhengbin <zhengbin13@huawei.com> fs/ocfs2/file.c: remove set but not used variables Fixes gcc '-Wunused-but-set-variable' warning: fs/ocfs2/file.c: In function ocfs2_prepare_inode_for_write: fs/ocfs2/file.c:2143:9: warning: variable end set but not used [-Wunused-but-set-variable] Link: http://lkml.kernel.org/r/1566522588-63786-3-git-send-email-joseph.qi@linux.alibaba.com Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reported-by: Hulk Robot <hulkci@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Changwei Ge <chge@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
/linux-master/fs/hugetlbfs/ | ||
H A D | inode.c | diff 79d72c68 Tue Jan 30 14:04:18 MST 2024 Oscar Salvador <osalvador@suse.de> fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super When configuring a hugetlb filesystem via the fsconfig() syscall, there is a possible NULL dereference in hugetlbfs_fill_super() caused by assigning NULL to ctx->hstate in hugetlbfs_parse_param() when the requested pagesize is non valid. E.g: Taking the following steps: fd = fsopen("hugetlbfs", FSOPEN_CLOEXEC); fsconfig(fd, FSCONFIG_SET_STRING, "pagesize", "1024", 0); fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0); Given that the requested "pagesize" is invalid, ctxt->hstate will be replaced with NULL, losing its previous value, and we will print an error: ... ... case Opt_pagesize: ps = memparse(param->string, &rest); ctx->hstate = h; if (!ctx->hstate) { pr_err("Unsupported page size %lu MB\n", ps / SZ_1M); return -EINVAL; } return 0; ... ... This is a problem because later on, we will dereference ctxt->hstate in hugetlbfs_fill_super() ... ... sb->s_blocksize = huge_page_size(ctx->hstate); ... ... Causing below Oops. Fix this by replacing cxt->hstate value only when then pagesize is known to be valid. kernel: hugetlbfs: Unsupported page size 0 MB kernel: BUG: kernel NULL pointer dereference, address: 0000000000000028 kernel: #PF: supervisor read access in kernel mode kernel: #PF: error_code(0x0000) - not-present page kernel: PGD 800000010f66c067 P4D 800000010f66c067 PUD 1b22f8067 PMD 0 kernel: Oops: 0000 [#1] PREEMPT SMP PTI kernel: CPU: 4 PID: 5659 Comm: syscall Tainted: G E 6.8.0-rc2-default+ #22 5a47c3fef76212addcc6eb71344aabc35190ae8f kernel: Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017 kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0 kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28 kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246 kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004 kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000 kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004 kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000 kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400 kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0 kernel: Call Trace: kernel: <TASK> kernel: ? __die_body+0x1a/0x60 kernel: ? page_fault_oops+0x16f/0x4a0 kernel: ? search_bpf_extables+0x65/0x70 kernel: ? fixup_exception+0x22/0x310 kernel: ? exc_page_fault+0x69/0x150 kernel: ? asm_exc_page_fault+0x22/0x30 kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10 kernel: ? hugetlbfs_fill_super+0xb4/0x1a0 kernel: ? hugetlbfs_fill_super+0x28/0x1a0 kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10 kernel: vfs_get_super+0x40/0xa0 kernel: ? __pfx_bpf_lsm_capable+0x10/0x10 kernel: vfs_get_tree+0x25/0xd0 kernel: vfs_cmd_create+0x64/0xe0 kernel: __x64_sys_fsconfig+0x395/0x410 kernel: do_syscall_64+0x80/0x160 kernel: ? syscall_exit_to_user_mode+0x82/0x240 kernel: ? do_syscall_64+0x8d/0x160 kernel: ? syscall_exit_to_user_mode+0x82/0x240 kernel: ? do_syscall_64+0x8d/0x160 kernel: ? exc_page_fault+0x69/0x150 kernel: entry_SYSCALL_64_after_hwframe+0x6e/0x76 kernel: RIP: 0033:0x7ffbc0cb87c9 kernel: Code: 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 97 96 0d 00 f7 d8 64 89 01 48 kernel: RSP: 002b:00007ffc29d2f388 EFLAGS: 00000206 ORIG_RAX: 00000000000001af kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffbc0cb87c9 kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003 kernel: RBP: 00007ffc29d2f3b0 R08: 0000000000000000 R09: 0000000000000000 kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 kernel: R13: 00007ffc29d2f4c0 R14: 0000000000000000 R15: 0000000000000000 kernel: </TASK> kernel: Modules linked in: rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) sunrpc(E) netfs(E) af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) intel_rapl_msr(E) intel_rapl_common(E) iTCO_wdt(E) intel_pmc_bxt(E) sb_edac(E) iTCO_vendor_support(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) rfkill(E) ipmi_ssif(E) kvm(E) acpi_ipmi(E) irqbypass(E) pcspkr(E) igb(E) ipmi_si(E) mei_me(E) i2c_i801(E) joydev(E) intel_pch_thermal(E) i2c_smbus(E) dca(E) lpc_ich(E) mei(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_pad(E) tiny_power_button(E) button(E) fuse(E) efi_pstore(E) configfs(E) ip_tables(E) x_tables(E) ext4(E) mbcache(E) jbd2(E) hid_generic(E) usbhid(E) sd_mod(E) t10_pi(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) polyval_clmulni(E) ahci(E) xhci_pci(E) polyval_generic(E) gf128mul(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha256_ssse3(E) xhci_pci_renesas(E) libahci(E) ehci_pci(E) sha1_ssse3(E) xhci_hcd(E) ehci_hcd(E) libata(E) kernel: mgag200(E) i2c_algo_bit(E) usbcore(E) wmi(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) scsi_common(E) aesni_intel(E) crypto_simd(E) cryptd(E) kernel: Unloaded tainted modules: acpi_cpufreq(E):1 fjes(E):1 kernel: CR2: 0000000000000028 kernel: ---[ end trace 0000000000000000 ]--- kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0 kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28 kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246 kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004 kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000 kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004 kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000 kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400 kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0 Link: https://lkml.kernel.org/r/20240130210418.3771-1-osalvador@suse.de Fixes: 32021982a324 ("hugetlbfs: Convert to fs_context") Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 79d72c68 Tue Jan 30 14:04:18 MST 2024 Oscar Salvador <osalvador@suse.de> fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super When configuring a hugetlb filesystem via the fsconfig() syscall, there is a possible NULL dereference in hugetlbfs_fill_super() caused by assigning NULL to ctx->hstate in hugetlbfs_parse_param() when the requested pagesize is non valid. E.g: Taking the following steps: fd = fsopen("hugetlbfs", FSOPEN_CLOEXEC); fsconfig(fd, FSCONFIG_SET_STRING, "pagesize", "1024", 0); fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0); Given that the requested "pagesize" is invalid, ctxt->hstate will be replaced with NULL, losing its previous value, and we will print an error: ... ... case Opt_pagesize: ps = memparse(param->string, &rest); ctx->hstate = h; if (!ctx->hstate) { pr_err("Unsupported page size %lu MB\n", ps / SZ_1M); return -EINVAL; } return 0; ... ... This is a problem because later on, we will dereference ctxt->hstate in hugetlbfs_fill_super() ... ... sb->s_blocksize = huge_page_size(ctx->hstate); ... ... Causing below Oops. Fix this by replacing cxt->hstate value only when then pagesize is known to be valid. kernel: hugetlbfs: Unsupported page size 0 MB kernel: BUG: kernel NULL pointer dereference, address: 0000000000000028 kernel: #PF: supervisor read access in kernel mode kernel: #PF: error_code(0x0000) - not-present page kernel: PGD 800000010f66c067 P4D 800000010f66c067 PUD 1b22f8067 PMD 0 kernel: Oops: 0000 [#1] PREEMPT SMP PTI kernel: CPU: 4 PID: 5659 Comm: syscall Tainted: G E 6.8.0-rc2-default+ #22 5a47c3fef76212addcc6eb71344aabc35190ae8f kernel: Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017 kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0 kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28 kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246 kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004 kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000 kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004 kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000 kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400 kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0 kernel: Call Trace: kernel: <TASK> kernel: ? __die_body+0x1a/0x60 kernel: ? page_fault_oops+0x16f/0x4a0 kernel: ? search_bpf_extables+0x65/0x70 kernel: ? fixup_exception+0x22/0x310 kernel: ? exc_page_fault+0x69/0x150 kernel: ? asm_exc_page_fault+0x22/0x30 kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10 kernel: ? hugetlbfs_fill_super+0xb4/0x1a0 kernel: ? hugetlbfs_fill_super+0x28/0x1a0 kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10 kernel: vfs_get_super+0x40/0xa0 kernel: ? __pfx_bpf_lsm_capable+0x10/0x10 kernel: vfs_get_tree+0x25/0xd0 kernel: vfs_cmd_create+0x64/0xe0 kernel: __x64_sys_fsconfig+0x395/0x410 kernel: do_syscall_64+0x80/0x160 kernel: ? syscall_exit_to_user_mode+0x82/0x240 kernel: ? do_syscall_64+0x8d/0x160 kernel: ? syscall_exit_to_user_mode+0x82/0x240 kernel: ? do_syscall_64+0x8d/0x160 kernel: ? exc_page_fault+0x69/0x150 kernel: entry_SYSCALL_64_after_hwframe+0x6e/0x76 kernel: RIP: 0033:0x7ffbc0cb87c9 kernel: Code: 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 97 96 0d 00 f7 d8 64 89 01 48 kernel: RSP: 002b:00007ffc29d2f388 EFLAGS: 00000206 ORIG_RAX: 00000000000001af kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffbc0cb87c9 kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003 kernel: RBP: 00007ffc29d2f3b0 R08: 0000000000000000 R09: 0000000000000000 kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 kernel: R13: 00007ffc29d2f4c0 R14: 0000000000000000 R15: 0000000000000000 kernel: </TASK> kernel: Modules linked in: rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) sunrpc(E) netfs(E) af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) intel_rapl_msr(E) intel_rapl_common(E) iTCO_wdt(E) intel_pmc_bxt(E) sb_edac(E) iTCO_vendor_support(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) rfkill(E) ipmi_ssif(E) kvm(E) acpi_ipmi(E) irqbypass(E) pcspkr(E) igb(E) ipmi_si(E) mei_me(E) i2c_i801(E) joydev(E) intel_pch_thermal(E) i2c_smbus(E) dca(E) lpc_ich(E) mei(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_pad(E) tiny_power_button(E) button(E) fuse(E) efi_pstore(E) configfs(E) ip_tables(E) x_tables(E) ext4(E) mbcache(E) jbd2(E) hid_generic(E) usbhid(E) sd_mod(E) t10_pi(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) polyval_clmulni(E) ahci(E) xhci_pci(E) polyval_generic(E) gf128mul(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha256_ssse3(E) xhci_pci_renesas(E) libahci(E) ehci_pci(E) sha1_ssse3(E) xhci_hcd(E) ehci_hcd(E) libata(E) kernel: mgag200(E) i2c_algo_bit(E) usbcore(E) wmi(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) scsi_common(E) aesni_intel(E) crypto_simd(E) cryptd(E) kernel: Unloaded tainted modules: acpi_cpufreq(E):1 fjes(E):1 kernel: CR2: 0000000000000028 kernel: ---[ end trace 0000000000000000 ]--- kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0 kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28 kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246 kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004 kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000 kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004 kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000 kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400 kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0 Link: https://lkml.kernel.org/r/20240130210418.3771-1-osalvador@suse.de Fixes: 32021982a324 ("hugetlbfs: Convert to fs_context") Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 79d72c68 Tue Jan 30 14:04:18 MST 2024 Oscar Salvador <osalvador@suse.de> fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super When configuring a hugetlb filesystem via the fsconfig() syscall, there is a possible NULL dereference in hugetlbfs_fill_super() caused by assigning NULL to ctx->hstate in hugetlbfs_parse_param() when the requested pagesize is non valid. E.g: Taking the following steps: fd = fsopen("hugetlbfs", FSOPEN_CLOEXEC); fsconfig(fd, FSCONFIG_SET_STRING, "pagesize", "1024", 0); fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0); Given that the requested "pagesize" is invalid, ctxt->hstate will be replaced with NULL, losing its previous value, and we will print an error: ... ... case Opt_pagesize: ps = memparse(param->string, &rest); ctx->hstate = h; if (!ctx->hstate) { pr_err("Unsupported page size %lu MB\n", ps / SZ_1M); return -EINVAL; } return 0; ... ... This is a problem because later on, we will dereference ctxt->hstate in hugetlbfs_fill_super() ... ... sb->s_blocksize = huge_page_size(ctx->hstate); ... ... Causing below Oops. Fix this by replacing cxt->hstate value only when then pagesize is known to be valid. kernel: hugetlbfs: Unsupported page size 0 MB kernel: BUG: kernel NULL pointer dereference, address: 0000000000000028 kernel: #PF: supervisor read access in kernel mode kernel: #PF: error_code(0x0000) - not-present page kernel: PGD 800000010f66c067 P4D 800000010f66c067 PUD 1b22f8067 PMD 0 kernel: Oops: 0000 [#1] PREEMPT SMP PTI kernel: CPU: 4 PID: 5659 Comm: syscall Tainted: G E 6.8.0-rc2-default+ #22 5a47c3fef76212addcc6eb71344aabc35190ae8f kernel: Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017 kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0 kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28 kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246 kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004 kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000 kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004 kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000 kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400 kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0 kernel: Call Trace: kernel: <TASK> kernel: ? __die_body+0x1a/0x60 kernel: ? page_fault_oops+0x16f/0x4a0 kernel: ? search_bpf_extables+0x65/0x70 kernel: ? fixup_exception+0x22/0x310 kernel: ? exc_page_fault+0x69/0x150 kernel: ? asm_exc_page_fault+0x22/0x30 kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10 kernel: ? hugetlbfs_fill_super+0xb4/0x1a0 kernel: ? hugetlbfs_fill_super+0x28/0x1a0 kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10 kernel: vfs_get_super+0x40/0xa0 kernel: ? __pfx_bpf_lsm_capable+0x10/0x10 kernel: vfs_get_tree+0x25/0xd0 kernel: vfs_cmd_create+0x64/0xe0 kernel: __x64_sys_fsconfig+0x395/0x410 kernel: do_syscall_64+0x80/0x160 kernel: ? syscall_exit_to_user_mode+0x82/0x240 kernel: ? do_syscall_64+0x8d/0x160 kernel: ? syscall_exit_to_user_mode+0x82/0x240 kernel: ? do_syscall_64+0x8d/0x160 kernel: ? exc_page_fault+0x69/0x150 kernel: entry_SYSCALL_64_after_hwframe+0x6e/0x76 kernel: RIP: 0033:0x7ffbc0cb87c9 kernel: Code: 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 97 96 0d 00 f7 d8 64 89 01 48 kernel: RSP: 002b:00007ffc29d2f388 EFLAGS: 00000206 ORIG_RAX: 00000000000001af kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffbc0cb87c9 kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003 kernel: RBP: 00007ffc29d2f3b0 R08: 0000000000000000 R09: 0000000000000000 kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 kernel: R13: 00007ffc29d2f4c0 R14: 0000000000000000 R15: 0000000000000000 kernel: </TASK> kernel: Modules linked in: rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) sunrpc(E) netfs(E) af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) intel_rapl_msr(E) intel_rapl_common(E) iTCO_wdt(E) intel_pmc_bxt(E) sb_edac(E) iTCO_vendor_support(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) rfkill(E) ipmi_ssif(E) kvm(E) acpi_ipmi(E) irqbypass(E) pcspkr(E) igb(E) ipmi_si(E) mei_me(E) i2c_i801(E) joydev(E) intel_pch_thermal(E) i2c_smbus(E) dca(E) lpc_ich(E) mei(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_pad(E) tiny_power_button(E) button(E) fuse(E) efi_pstore(E) configfs(E) ip_tables(E) x_tables(E) ext4(E) mbcache(E) jbd2(E) hid_generic(E) usbhid(E) sd_mod(E) t10_pi(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) polyval_clmulni(E) ahci(E) xhci_pci(E) polyval_generic(E) gf128mul(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha256_ssse3(E) xhci_pci_renesas(E) libahci(E) ehci_pci(E) sha1_ssse3(E) xhci_hcd(E) ehci_hcd(E) libata(E) kernel: mgag200(E) i2c_algo_bit(E) usbcore(E) wmi(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) scsi_common(E) aesni_intel(E) crypto_simd(E) cryptd(E) kernel: Unloaded tainted modules: acpi_cpufreq(E):1 fjes(E):1 kernel: CR2: 0000000000000028 kernel: ---[ end trace 0000000000000000 ]--- kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0 kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28 kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246 kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004 kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000 kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004 kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000 kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400 kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0 Link: https://lkml.kernel.org/r/20240130210418.3771-1-osalvador@suse.de Fixes: 32021982a324 ("hugetlbfs: Convert to fs_context") Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 79d72c68 Tue Jan 30 14:04:18 MST 2024 Oscar Salvador <osalvador@suse.de> fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super When configuring a hugetlb filesystem via the fsconfig() syscall, there is a possible NULL dereference in hugetlbfs_fill_super() caused by assigning NULL to ctx->hstate in hugetlbfs_parse_param() when the requested pagesize is non valid. E.g: Taking the following steps: fd = fsopen("hugetlbfs", FSOPEN_CLOEXEC); fsconfig(fd, FSCONFIG_SET_STRING, "pagesize", "1024", 0); fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0); Given that the requested "pagesize" is invalid, ctxt->hstate will be replaced with NULL, losing its previous value, and we will print an error: ... ... case Opt_pagesize: ps = memparse(param->string, &rest); ctx->hstate = h; if (!ctx->hstate) { pr_err("Unsupported page size %lu MB\n", ps / SZ_1M); return -EINVAL; } return 0; ... ... This is a problem because later on, we will dereference ctxt->hstate in hugetlbfs_fill_super() ... ... sb->s_blocksize = huge_page_size(ctx->hstate); ... ... Causing below Oops. Fix this by replacing cxt->hstate value only when then pagesize is known to be valid. kernel: hugetlbfs: Unsupported page size 0 MB kernel: BUG: kernel NULL pointer dereference, address: 0000000000000028 kernel: #PF: supervisor read access in kernel mode kernel: #PF: error_code(0x0000) - not-present page kernel: PGD 800000010f66c067 P4D 800000010f66c067 PUD 1b22f8067 PMD 0 kernel: Oops: 0000 [#1] PREEMPT SMP PTI kernel: CPU: 4 PID: 5659 Comm: syscall Tainted: G E 6.8.0-rc2-default+ #22 5a47c3fef76212addcc6eb71344aabc35190ae8f kernel: Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017 kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0 kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28 kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246 kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004 kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000 kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004 kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000 kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400 kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0 kernel: Call Trace: kernel: <TASK> kernel: ? __die_body+0x1a/0x60 kernel: ? page_fault_oops+0x16f/0x4a0 kernel: ? search_bpf_extables+0x65/0x70 kernel: ? fixup_exception+0x22/0x310 kernel: ? exc_page_fault+0x69/0x150 kernel: ? asm_exc_page_fault+0x22/0x30 kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10 kernel: ? hugetlbfs_fill_super+0xb4/0x1a0 kernel: ? hugetlbfs_fill_super+0x28/0x1a0 kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10 kernel: vfs_get_super+0x40/0xa0 kernel: ? __pfx_bpf_lsm_capable+0x10/0x10 kernel: vfs_get_tree+0x25/0xd0 kernel: vfs_cmd_create+0x64/0xe0 kernel: __x64_sys_fsconfig+0x395/0x410 kernel: do_syscall_64+0x80/0x160 kernel: ? syscall_exit_to_user_mode+0x82/0x240 kernel: ? do_syscall_64+0x8d/0x160 kernel: ? syscall_exit_to_user_mode+0x82/0x240 kernel: ? do_syscall_64+0x8d/0x160 kernel: ? exc_page_fault+0x69/0x150 kernel: entry_SYSCALL_64_after_hwframe+0x6e/0x76 kernel: RIP: 0033:0x7ffbc0cb87c9 kernel: Code: 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 97 96 0d 00 f7 d8 64 89 01 48 kernel: RSP: 002b:00007ffc29d2f388 EFLAGS: 00000206 ORIG_RAX: 00000000000001af kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffbc0cb87c9 kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003 kernel: RBP: 00007ffc29d2f3b0 R08: 0000000000000000 R09: 0000000000000000 kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 kernel: R13: 00007ffc29d2f4c0 R14: 0000000000000000 R15: 0000000000000000 kernel: </TASK> kernel: Modules linked in: rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) sunrpc(E) netfs(E) af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) intel_rapl_msr(E) intel_rapl_common(E) iTCO_wdt(E) intel_pmc_bxt(E) sb_edac(E) iTCO_vendor_support(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) rfkill(E) ipmi_ssif(E) kvm(E) acpi_ipmi(E) irqbypass(E) pcspkr(E) igb(E) ipmi_si(E) mei_me(E) i2c_i801(E) joydev(E) intel_pch_thermal(E) i2c_smbus(E) dca(E) lpc_ich(E) mei(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_pad(E) tiny_power_button(E) button(E) fuse(E) efi_pstore(E) configfs(E) ip_tables(E) x_tables(E) ext4(E) mbcache(E) jbd2(E) hid_generic(E) usbhid(E) sd_mod(E) t10_pi(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) polyval_clmulni(E) ahci(E) xhci_pci(E) polyval_generic(E) gf128mul(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha256_ssse3(E) xhci_pci_renesas(E) libahci(E) ehci_pci(E) sha1_ssse3(E) xhci_hcd(E) ehci_hcd(E) libata(E) kernel: mgag200(E) i2c_algo_bit(E) usbcore(E) wmi(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) scsi_common(E) aesni_intel(E) crypto_simd(E) cryptd(E) kernel: Unloaded tainted modules: acpi_cpufreq(E):1 fjes(E):1 kernel: CR2: 0000000000000028 kernel: ---[ end trace 0000000000000000 ]--- kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0 kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28 kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246 kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004 kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000 kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004 kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000 kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400 kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0 Link: https://lkml.kernel.org/r/20240130210418.3771-1-osalvador@suse.de Fixes: 32021982a324 ("hugetlbfs: Convert to fs_context") Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff e656c7a9 Tue Jan 23 01:04:42 MST 2024 Prakash Sangappa <prakash.sangappa@oracle.com> mm: hugetlb pages should not be reserved by shmat() if SHM_NORESERVE For shared memory of type SHM_HUGETLB, hugetlb pages are reserved in shmget() call. If SHM_NORESERVE flags is specified then the hugetlb pages are not reserved. However when the shared memory is attached with the shmat() call the hugetlb pages are getting reserved incorrectly for SHM_HUGETLB shared memory created with SHM_NORESERVE which is a bug. ------------------------------- Following test shows the issue. $cat shmhtb.c int main() { int shmflags = 0660 | IPC_CREAT | SHM_HUGETLB | SHM_NORESERVE; int shmid; shmid = shmget(SKEY, SHMSZ, shmflags); if (shmid < 0) { printf("shmat: shmget() failed, %d\n", errno); return 1; } printf("After shmget()\n"); system("cat /proc/meminfo | grep -i hugepages_"); shmat(shmid, NULL, 0); printf("\nAfter shmat()\n"); system("cat /proc/meminfo | grep -i hugepages_"); shmctl(shmid, IPC_RMID, NULL); return 0; } #sysctl -w vm.nr_hugepages=20 #./shmhtb After shmget() HugePages_Total: 20 HugePages_Free: 20 HugePages_Rsvd: 0 HugePages_Surp: 0 After shmat() HugePages_Total: 20 HugePages_Free: 20 HugePages_Rsvd: 5 <-- HugePages_Surp: 0 -------------------------------- Fix is to ensure that hugetlb pages are not reserved for SHM_HUGETLB shared memory in the shmat() call. Link: https://lkml.kernel.org/r/1706040282-12388-1-git-send-email-prakash.sangappa@oracle.com Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 19d3e221 Fri Jan 12 11:08:40 MST 2024 Sidhartha Kumar <sidhartha.kumar@oracle.com> fs/hugetlbfs/inode.c: mm/memory-failure.c: fix hugetlbfs hwpoison handling has_extra_refcount() makes the assumption that the page cache adds a ref count of 1 and subtracts this in the extra_pins case. Commit a08c7193e4f1 (mm/filemap: remove hugetlb special casing in filemap.c) modifies __filemap_add_folio() by calling folio_ref_add(folio, nr); for all cases (including hugtetlb) where nr is the number of pages in the folio. We should adjust the number of references coming from the page cache by subtracing the number of pages rather than 1. In hugetlbfs_read_iter(), folio_test_has_hwpoisoned() is testing the wrong flag as, in the hugetlb case, memory-failure code calls folio_test_set_hwpoison() to indicate poison. folio_test_hwpoison() is the correct function to test for that flag. After these fixes, the hugetlb hwpoison read selftest passes all cases. Link: https://lkml.kernel.org/r/20240112180840.367006-1-sidhartha.kumar@oracle.com Fixes: a08c7193e4f1 ("mm/filemap: remove hugetlb special casing in filemap.c") Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Closes: https://lore.kernel.org/linux-mm/20230713001833.3778937-1-jiaqiyan@google.com/T/#m8e1469119e5b831bbd05d495f96b842e4a1c5519 Reported-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: James Houghton <jthoughton@google.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: <stable@vger.kernel.org> [6.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 19d3e221 Fri Jan 12 11:08:40 MST 2024 Sidhartha Kumar <sidhartha.kumar@oracle.com> fs/hugetlbfs/inode.c: mm/memory-failure.c: fix hugetlbfs hwpoison handling has_extra_refcount() makes the assumption that the page cache adds a ref count of 1 and subtracts this in the extra_pins case. Commit a08c7193e4f1 (mm/filemap: remove hugetlb special casing in filemap.c) modifies __filemap_add_folio() by calling folio_ref_add(folio, nr); for all cases (including hugtetlb) where nr is the number of pages in the folio. We should adjust the number of references coming from the page cache by subtracing the number of pages rather than 1. In hugetlbfs_read_iter(), folio_test_has_hwpoisoned() is testing the wrong flag as, in the hugetlb case, memory-failure code calls folio_test_set_hwpoison() to indicate poison. folio_test_hwpoison() is the correct function to test for that flag. After these fixes, the hugetlb hwpoison read selftest passes all cases. Link: https://lkml.kernel.org/r/20240112180840.367006-1-sidhartha.kumar@oracle.com Fixes: a08c7193e4f1 ("mm/filemap: remove hugetlb special casing in filemap.c") Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Closes: https://lore.kernel.org/linux-mm/20230713001833.3778937-1-jiaqiyan@google.com/T/#m8e1469119e5b831bbd05d495f96b842e4a1c5519 Reported-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: James Houghton <jthoughton@google.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: <stable@vger.kernel.org> [6.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 19d3e221 Fri Jan 12 11:08:40 MST 2024 Sidhartha Kumar <sidhartha.kumar@oracle.com> fs/hugetlbfs/inode.c: mm/memory-failure.c: fix hugetlbfs hwpoison handling has_extra_refcount() makes the assumption that the page cache adds a ref count of 1 and subtracts this in the extra_pins case. Commit a08c7193e4f1 (mm/filemap: remove hugetlb special casing in filemap.c) modifies __filemap_add_folio() by calling folio_ref_add(folio, nr); for all cases (including hugtetlb) where nr is the number of pages in the folio. We should adjust the number of references coming from the page cache by subtracing the number of pages rather than 1. In hugetlbfs_read_iter(), folio_test_has_hwpoisoned() is testing the wrong flag as, in the hugetlb case, memory-failure code calls folio_test_set_hwpoison() to indicate poison. folio_test_hwpoison() is the correct function to test for that flag. After these fixes, the hugetlb hwpoison read selftest passes all cases. Link: https://lkml.kernel.org/r/20240112180840.367006-1-sidhartha.kumar@oracle.com Fixes: a08c7193e4f1 ("mm/filemap: remove hugetlb special casing in filemap.c") Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Closes: https://lore.kernel.org/linux-mm/20230713001833.3778937-1-jiaqiyan@google.com/T/#m8e1469119e5b831bbd05d495f96b842e4a1c5519 Reported-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: James Houghton <jthoughton@google.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: <stable@vger.kernel.org> [6.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 19d3e221 Fri Jan 12 11:08:40 MST 2024 Sidhartha Kumar <sidhartha.kumar@oracle.com> fs/hugetlbfs/inode.c: mm/memory-failure.c: fix hugetlbfs hwpoison handling has_extra_refcount() makes the assumption that the page cache adds a ref count of 1 and subtracts this in the extra_pins case. Commit a08c7193e4f1 (mm/filemap: remove hugetlb special casing in filemap.c) modifies __filemap_add_folio() by calling folio_ref_add(folio, nr); for all cases (including hugtetlb) where nr is the number of pages in the folio. We should adjust the number of references coming from the page cache by subtracing the number of pages rather than 1. In hugetlbfs_read_iter(), folio_test_has_hwpoisoned() is testing the wrong flag as, in the hugetlb case, memory-failure code calls folio_test_set_hwpoison() to indicate poison. folio_test_hwpoison() is the correct function to test for that flag. After these fixes, the hugetlb hwpoison read selftest passes all cases. Link: https://lkml.kernel.org/r/20240112180840.367006-1-sidhartha.kumar@oracle.com Fixes: a08c7193e4f1 ("mm/filemap: remove hugetlb special casing in filemap.c") Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Closes: https://lore.kernel.org/linux-mm/20230713001833.3778937-1-jiaqiyan@google.com/T/#m8e1469119e5b831bbd05d495f96b842e4a1c5519 Reported-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: James Houghton <jthoughton@google.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: <stable@vger.kernel.org> [6.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 10969b55 Tue Oct 03 03:15:09 MDT 2023 Hugh Dickins <hughd@google.com> hugetlbfs: drop shared NUMA mempolicy pretence Patch series "mempolicy: cleanups leading to NUMA mpol without vma", v2. Mostly cleanups in mm/mempolicy.c, but finally removing the pseudo-vma from shmem folio allocation, and removing the mmap_lock around folio migration for mbind and migrate_pages syscalls. This patch (of 12): hugetlbfs_fallocate() goes through the motions of pasting a shared NUMA mempolicy onto its pseudo-vma, but how could there ever be a shared NUMA mempolicy for this file? hugetlb_vm_ops has never offered a set_policy method, and hugetlbfs_parse_param() has never supported any mpol options for a mount-wide default policy. It's just an illusion: clean it away so as not to confuse others, giving us more freedom to adjust shmem's set_policy/get_policy implementation. But hugetlbfs_inode_info is still required, just to accommodate seals. Yes, shared NUMA mempolicy support could be added to hugetlbfs, with a set_policy method and/or mpol mount option (Andi's first posting did include an admitted-unsatisfactory hugetlb_set_policy()); but it seems that nobody has bothered to add that in the nineteen years since v2.6.7 made it possible, and there is at least one company that has invested enough into hugetlbfs, that I guess they have learnt well enough how to manage its NUMA, without needing shared mempolicy. Remove linux/mempolicy.h from linux/hugetlb.h: include linux/pagemap.h in its place, because hugetlb.h's recently added use of filemap_lock_folio() requires that (although most .configs and .c's get it in some other way). Link: https://lkml.kernel.org/r/ebc0987e-beff-8bfb-9283-234c2cbd17c5@google.com Link: https://lkml.kernel.org/r/cae82d4b-904a-faaf-282a-34fcc188c81f@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun heo <tj@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 10969b55 Tue Oct 03 03:15:09 MDT 2023 Hugh Dickins <hughd@google.com> hugetlbfs: drop shared NUMA mempolicy pretence Patch series "mempolicy: cleanups leading to NUMA mpol without vma", v2. Mostly cleanups in mm/mempolicy.c, but finally removing the pseudo-vma from shmem folio allocation, and removing the mmap_lock around folio migration for mbind and migrate_pages syscalls. This patch (of 12): hugetlbfs_fallocate() goes through the motions of pasting a shared NUMA mempolicy onto its pseudo-vma, but how could there ever be a shared NUMA mempolicy for this file? hugetlb_vm_ops has never offered a set_policy method, and hugetlbfs_parse_param() has never supported any mpol options for a mount-wide default policy. It's just an illusion: clean it away so as not to confuse others, giving us more freedom to adjust shmem's set_policy/get_policy implementation. But hugetlbfs_inode_info is still required, just to accommodate seals. Yes, shared NUMA mempolicy support could be added to hugetlbfs, with a set_policy method and/or mpol mount option (Andi's first posting did include an admitted-unsatisfactory hugetlb_set_policy()); but it seems that nobody has bothered to add that in the nineteen years since v2.6.7 made it possible, and there is at least one company that has invested enough into hugetlbfs, that I guess they have learnt well enough how to manage its NUMA, without needing shared mempolicy. Remove linux/mempolicy.h from linux/hugetlb.h: include linux/pagemap.h in its place, because hugetlb.h's recently added use of filemap_lock_folio() requires that (although most .configs and .c's get it in some other way). Link: https://lkml.kernel.org/r/ebc0987e-beff-8bfb-9283-234c2cbd17c5@google.com Link: https://lkml.kernel.org/r/cae82d4b-904a-faaf-282a-34fcc188c81f@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun heo <tj@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
/linux-master/fs/ext4/ | ||
H A D | inode.c | diff 2cd8bdb5 Mon Sep 18 04:45:50 MDT 2023 Ojaswin Mujoo <ojaswin@linux.ibm.com> ext4: mark buffer new if it is unwritten to avoid stale data exposure ** Short Version ** In ext4 with dioread_nolock, we could have a scenario where the bh returned by get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we never zero out the range of bh that is not under write, causing whatever stale data is present in the folio at that time to be written out to disk. To fix this mark the buffer as new, in case it is unwritten, in ext4_get_block_unwritten(). ** Long Version ** The issue mentioned above was resulting in two different bugs: 1. On block size < page size case in ext4, generic/269 was reliably failing with dioread_nolock. The state of the write was as follows: * The write was extending i_size. * The last block of the file was fallocated and had an unwritten extent * We were near ENOSPC and hence we were switching to non-delayed alloc allocation. In this case, the back trace that triggers the bug is as follows: ext4_da_write_begin() /* switch to nodelalloc due to low space */ ext4_write_begin() ext4_should_dioread_nolock() // true since mount flags still have delalloc __block_write_begin(..., ext4_get_block_unwritten) __block_write_begin_int() for(each buffer head in page) { /* first iteration, this is bh1 which contains i_size */ if (!buffer_mapped) get_block() /* returns bh with only UNWRITTEN and MAPPED */ /* second iteration, bh2 */ if (!buffer_mapped) get_block() /* we fail here, could be ENOSPC */ } if (err) /* * this would zero out all new buffers and mark them uptodate. * Since bh1 was never marked new, we skip it here which causes * the bug later. */ folio_zero_new_buffers(); /* ext4_wrte_begin() error handling */ ext4_truncate_failed_write() ext4_truncate() ext4_block_truncate_page() __ext4_block_zero_page_range() if(!buffer_uptodate()) ext4_read_bh_lock() ext4_read_bh() -> ... ext4_submit_bh_wbc() BUG_ON(buffer_unwritten(bh)); /* !!! */ 2. The second issue is stale data exposure with page size >= blocksize with dioread_nolock. The conditions needed for it to happen are same as the previous issue ie dioread_nolock around ENOSPC condition. The issue is also similar where in __block_write_begin_int() when we call ext4_get_block_unwritten() on the buffer_head and the underlying extent is unwritten, we get an unwritten and mapped buffer head. Since it is not new, we never zero out the partial range which is not under write, thus writing stale data to disk. This can be easily observed with the following reproducer: fallocate -l 4k testfile xfs_io -c "pwrite 2k 2k" testfile # hexdump output will have stale data in from byte 0 to 2k in testfile hexdump -C testfile NOTE: To trigger this, we need dioread_nolock enabled and write happening via ext4_write_begin(), which is usually used when we have -o nodealloc. Since dioread_nolock is disabled with nodelalloc, the only alternate way to call ext4_write_begin() is to ensure that delayed alloc switches to nodelalloc ie ext4_da_write_begin() calls ext4_write_begin(). This will usually happen when ext4 is almost full like the way generic/269 was triggering it in Issue 1 above. This might make the issue harder to hit. Hence, for reliable replication, I used the below patch to temporarily allow dioread_nolock with nodelalloc and then mount the disk with -o nodealloc,dioread_nolock. With this you can hit the stale data issue 100% of times: @@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode) if (ext4_should_journal_data(inode)) return 0; /* temporary fix to prevent generic/422 test failures */ - if (!test_opt(inode->i_sb, DELALLOC)) - return 0; + // if (!test_opt(inode->i_sb, DELALLOC)) + // return 0; return 1; } After applying this patch to mark buffer as NEW, both the above issues are fixed. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Cc: stable@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/d0ed09d70a9733fbb5349c5c7b125caac186ecdf.1695033645.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> diff 2cd8bdb5 Mon Sep 18 04:45:50 MDT 2023 Ojaswin Mujoo <ojaswin@linux.ibm.com> ext4: mark buffer new if it is unwritten to avoid stale data exposure ** Short Version ** In ext4 with dioread_nolock, we could have a scenario where the bh returned by get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we never zero out the range of bh that is not under write, causing whatever stale data is present in the folio at that time to be written out to disk. To fix this mark the buffer as new, in case it is unwritten, in ext4_get_block_unwritten(). ** Long Version ** The issue mentioned above was resulting in two different bugs: 1. On block size < page size case in ext4, generic/269 was reliably failing with dioread_nolock. The state of the write was as follows: * The write was extending i_size. * The last block of the file was fallocated and had an unwritten extent * We were near ENOSPC and hence we were switching to non-delayed alloc allocation. In this case, the back trace that triggers the bug is as follows: ext4_da_write_begin() /* switch to nodelalloc due to low space */ ext4_write_begin() ext4_should_dioread_nolock() // true since mount flags still have delalloc __block_write_begin(..., ext4_get_block_unwritten) __block_write_begin_int() for(each buffer head in page) { /* first iteration, this is bh1 which contains i_size */ if (!buffer_mapped) get_block() /* returns bh with only UNWRITTEN and MAPPED */ /* second iteration, bh2 */ if (!buffer_mapped) get_block() /* we fail here, could be ENOSPC */ } if (err) /* * this would zero out all new buffers and mark them uptodate. * Since bh1 was never marked new, we skip it here which causes * the bug later. */ folio_zero_new_buffers(); /* ext4_wrte_begin() error handling */ ext4_truncate_failed_write() ext4_truncate() ext4_block_truncate_page() __ext4_block_zero_page_range() if(!buffer_uptodate()) ext4_read_bh_lock() ext4_read_bh() -> ... ext4_submit_bh_wbc() BUG_ON(buffer_unwritten(bh)); /* !!! */ 2. The second issue is stale data exposure with page size >= blocksize with dioread_nolock. The conditions needed for it to happen are same as the previous issue ie dioread_nolock around ENOSPC condition. The issue is also similar where in __block_write_begin_int() when we call ext4_get_block_unwritten() on the buffer_head and the underlying extent is unwritten, we get an unwritten and mapped buffer head. Since it is not new, we never zero out the partial range which is not under write, thus writing stale data to disk. This can be easily observed with the following reproducer: fallocate -l 4k testfile xfs_io -c "pwrite 2k 2k" testfile # hexdump output will have stale data in from byte 0 to 2k in testfile hexdump -C testfile NOTE: To trigger this, we need dioread_nolock enabled and write happening via ext4_write_begin(), which is usually used when we have -o nodealloc. Since dioread_nolock is disabled with nodelalloc, the only alternate way to call ext4_write_begin() is to ensure that delayed alloc switches to nodelalloc ie ext4_da_write_begin() calls ext4_write_begin(). This will usually happen when ext4 is almost full like the way generic/269 was triggering it in Issue 1 above. This might make the issue harder to hit. Hence, for reliable replication, I used the below patch to temporarily allow dioread_nolock with nodelalloc and then mount the disk with -o nodealloc,dioread_nolock. With this you can hit the stale data issue 100% of times: @@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode) if (ext4_should_journal_data(inode)) return 0; /* temporary fix to prevent generic/422 test failures */ - if (!test_opt(inode->i_sb, DELALLOC)) - return 0; + // if (!test_opt(inode->i_sb, DELALLOC)) + // return 0; return 1; } After applying this patch to mark buffer as NEW, both the above issues are fixed. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Cc: stable@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/d0ed09d70a9733fbb5349c5c7b125caac186ecdf.1695033645.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> diff 8216776c Mon Aug 14 12:29:01 MDT 2023 Eric Biggers <ebiggers@google.com> ext4: reject casefold inode flag without casefold feature It is invalid for the casefold inode flag to be set without the casefold superblock feature flag also being set. e2fsck already considers this case to be invalid and handles it by offering to clear the casefold flag on the inode. __ext4_iget() also already considered this to be invalid, sort of, but it only got so far as logging an error message; it didn't actually reject the inode. Make it reject the inode so that other code doesn't have to handle this case. This matches what f2fs does. Note: we could check 's_encoding != NULL' instead of ext4_has_feature_casefold(). This would make the check robust against the casefold feature being enabled by userspace writing to the page cache of the mounted block device. However, it's unsolvable in general for filesystems to be robust against concurrent writes to the page cache of the mounted block device. Though this very particular scenario involving the casefold feature is solvable, we should not pretend that we can support this model, so let's just check the casefold feature. tune2fs already forbids enabling casefold on a mounted filesystem. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20230814182903.37267-2-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> diff 1dedde69 Mon Mar 20 19:37:21 MDT 2023 Zhihao Cheng <chengzhihao1@huawei.com> ext4: fix i_disksize exceeding i_size problem in paritally written case It is possible for i_disksize can exceed i_size, triggering a warning. generic_perform_write copied = iov_iter_copy_from_user_atomic(len) // copied < len ext4_da_write_end | ext4_update_i_disksize | new_i_size = pos + copied; | WRITE_ONCE(EXT4_I(inode)->i_disksize, newsize) // update i_disksize | generic_write_end | copied = block_write_end(copied, len) // copied = 0 | if (unlikely(copied < len)) | if (!PageUptodate(page)) | copied = 0; | if (pos + copied > inode->i_size) // return false if (unlikely(copied == 0)) goto again; if (unlikely(iov_iter_fault_in_readable(i, bytes))) { status = -EFAULT; break; } We get i_disksize greater than i_size here, which could trigger WARNING check 'i_size_read(inode) < EXT4_I(inode)->i_disksize' while doing dio: ext4_dio_write_iter iomap_dio_rw __iomap_dio_rw // return err, length is not aligned to 512 ext4_handle_inode_extension WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize) // Oops WARNING: CPU: 2 PID: 2609 at fs/ext4/file.c:319 CPU: 2 PID: 2609 Comm: aa Not tainted 6.3.0-rc2 RIP: 0010:ext4_file_write_iter+0xbc7 Call Trace: vfs_write+0x3b1 ksys_write+0x77 do_syscall_64+0x39 Fix it by updating 'copied' value before updating i_disksize just like ext4_write_inline_data_end() does. A reproducer can be found in the buganizer link below. Link: https://bugzilla.kernel.org/show_bug.cgi?id=217209 Fixes: 64769240bd07 ("ext4: Add delayed allocation support in data=writeback mode") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230321013721.89818-1-chengzhihao1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> diff 5c099c4f Tue Dec 06 07:41:34 MST 2022 Ye Bin <yebin10@huawei.com> ext4: fix kernel BUG in 'ext4_write_inline_data_end()' Syzbot report follow issue: ------------[ cut here ]------------ kernel BUG at fs/ext4/inline.c:227! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3629 Comm: syz-executor212 Not tainted 6.1.0-rc5-syzkaller-00018-g59d0d52c30d4 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 RIP: 0010:ext4_write_inline_data+0x344/0x3e0 fs/ext4/inline.c:227 RSP: 0018:ffffc90003b3f368 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff8880704e16c0 RCX: 0000000000000000 RDX: ffff888021763a80 RSI: ffffffff821e31a4 RDI: 0000000000000006 RBP: 000000000006818e R08: 0000000000000006 R09: 0000000000068199 R10: 0000000000000079 R11: 0000000000000000 R12: 000000000000000b R13: 0000000000068199 R14: ffffc90003b3f408 R15: ffff8880704e1c82 FS: 000055555723e3c0(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffe8ac9080 CR3: 0000000079f81000 CR4: 0000000000350ee0 Call Trace: <TASK> ext4_write_inline_data_end+0x2a3/0x12f0 fs/ext4/inline.c:768 ext4_write_end+0x242/0xdd0 fs/ext4/inode.c:1313 ext4_da_write_end+0x3ed/0xa30 fs/ext4/inode.c:3063 generic_perform_write+0x316/0x570 mm/filemap.c:3764 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:285 ext4_file_write_iter+0x8bc/0x16e0 fs/ext4/file.c:700 call_write_iter include/linux/fs.h:2191 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1d0/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd ---[ end trace 0000000000000000 ]--- Above issue may happens as follows: ext4_da_write_begin ext4_da_write_inline_data_begin ext4_da_convert_inline_data_to_extent ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); ext4_da_write_end ext4_run_li_request ext4_mb_prefetch ext4_read_block_bitmap_nowait ext4_validate_block_bitmap ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_BBITMAP_CORRUPT) percpu_counter_sub(&sbi->s_freeclusters_counter,grp->bb_free); -> sbi->s_freeclusters_counter become zero ext4_da_write_begin if (ext4_nonda_switch(inode->i_sb)) -> As freeclusters_counter is zero will return true *fsdata = (void *)FALL_BACK_TO_NONDELALLOC; ext4_write_begin ext4_da_write_end if (write_mode == FALL_BACK_TO_NONDELALLOC) ext4_write_end if (inline_data) ext4_write_inline_data_end ext4_write_inline_data BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); -> As inode is already convert to extent, so 'pos + len' > inline_size -> then trigger BUG. To solve this issue, instead of checking ext4_has_inline_data() which is only cleared after data has been written back, check the EXT4_STATE_MAY_INLINE_DATA flag in ext4_write_end(). Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Reported-by: syzbot+4faa160fa96bfba639f8@syzkaller.appspotmail.com Reported-by: Jun Nie <jun.nie@linaro.org> Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20221206144134.1919987-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org diff 5c099c4f Tue Dec 06 07:41:34 MST 2022 Ye Bin <yebin10@huawei.com> ext4: fix kernel BUG in 'ext4_write_inline_data_end()' Syzbot report follow issue: ------------[ cut here ]------------ kernel BUG at fs/ext4/inline.c:227! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3629 Comm: syz-executor212 Not tainted 6.1.0-rc5-syzkaller-00018-g59d0d52c30d4 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 RIP: 0010:ext4_write_inline_data+0x344/0x3e0 fs/ext4/inline.c:227 RSP: 0018:ffffc90003b3f368 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff8880704e16c0 RCX: 0000000000000000 RDX: ffff888021763a80 RSI: ffffffff821e31a4 RDI: 0000000000000006 RBP: 000000000006818e R08: 0000000000000006 R09: 0000000000068199 R10: 0000000000000079 R11: 0000000000000000 R12: 000000000000000b R13: 0000000000068199 R14: ffffc90003b3f408 R15: ffff8880704e1c82 FS: 000055555723e3c0(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffe8ac9080 CR3: 0000000079f81000 CR4: 0000000000350ee0 Call Trace: <TASK> ext4_write_inline_data_end+0x2a3/0x12f0 fs/ext4/inline.c:768 ext4_write_end+0x242/0xdd0 fs/ext4/inode.c:1313 ext4_da_write_end+0x3ed/0xa30 fs/ext4/inode.c:3063 generic_perform_write+0x316/0x570 mm/filemap.c:3764 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:285 ext4_file_write_iter+0x8bc/0x16e0 fs/ext4/file.c:700 call_write_iter include/linux/fs.h:2191 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1d0/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd ---[ end trace 0000000000000000 ]--- Above issue may happens as follows: ext4_da_write_begin ext4_da_write_inline_data_begin ext4_da_convert_inline_data_to_extent ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); ext4_da_write_end ext4_run_li_request ext4_mb_prefetch ext4_read_block_bitmap_nowait ext4_validate_block_bitmap ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_BBITMAP_CORRUPT) percpu_counter_sub(&sbi->s_freeclusters_counter,grp->bb_free); -> sbi->s_freeclusters_counter become zero ext4_da_write_begin if (ext4_nonda_switch(inode->i_sb)) -> As freeclusters_counter is zero will return true *fsdata = (void *)FALL_BACK_TO_NONDELALLOC; ext4_write_begin ext4_da_write_end if (write_mode == FALL_BACK_TO_NONDELALLOC) ext4_write_end if (inline_data) ext4_write_inline_data_end ext4_write_inline_data BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); -> As inode is already convert to extent, so 'pos + len' > inline_size -> then trigger BUG. To solve this issue, instead of checking ext4_has_inline_data() which is only cleared after data has been written back, check the EXT4_STATE_MAY_INLINE_DATA flag in ext4_write_end(). Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Reported-by: syzbot+4faa160fa96bfba639f8@syzkaller.appspotmail.com Reported-by: Jun Nie <jun.nie@linaro.org> Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20221206144134.1919987-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org diff 5c099c4f Tue Dec 06 07:41:34 MST 2022 Ye Bin <yebin10@huawei.com> ext4: fix kernel BUG in 'ext4_write_inline_data_end()' Syzbot report follow issue: ------------[ cut here ]------------ kernel BUG at fs/ext4/inline.c:227! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3629 Comm: syz-executor212 Not tainted 6.1.0-rc5-syzkaller-00018-g59d0d52c30d4 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 RIP: 0010:ext4_write_inline_data+0x344/0x3e0 fs/ext4/inline.c:227 RSP: 0018:ffffc90003b3f368 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff8880704e16c0 RCX: 0000000000000000 RDX: ffff888021763a80 RSI: ffffffff821e31a4 RDI: 0000000000000006 RBP: 000000000006818e R08: 0000000000000006 R09: 0000000000068199 R10: 0000000000000079 R11: 0000000000000000 R12: 000000000000000b R13: 0000000000068199 R14: ffffc90003b3f408 R15: ffff8880704e1c82 FS: 000055555723e3c0(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffe8ac9080 CR3: 0000000079f81000 CR4: 0000000000350ee0 Call Trace: <TASK> ext4_write_inline_data_end+0x2a3/0x12f0 fs/ext4/inline.c:768 ext4_write_end+0x242/0xdd0 fs/ext4/inode.c:1313 ext4_da_write_end+0x3ed/0xa30 fs/ext4/inode.c:3063 generic_perform_write+0x316/0x570 mm/filemap.c:3764 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:285 ext4_file_write_iter+0x8bc/0x16e0 fs/ext4/file.c:700 call_write_iter include/linux/fs.h:2191 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1d0/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd ---[ end trace 0000000000000000 ]--- Above issue may happens as follows: ext4_da_write_begin ext4_da_write_inline_data_begin ext4_da_convert_inline_data_to_extent ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); ext4_da_write_end ext4_run_li_request ext4_mb_prefetch ext4_read_block_bitmap_nowait ext4_validate_block_bitmap ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_BBITMAP_CORRUPT) percpu_counter_sub(&sbi->s_freeclusters_counter,grp->bb_free); -> sbi->s_freeclusters_counter become zero ext4_da_write_begin if (ext4_nonda_switch(inode->i_sb)) -> As freeclusters_counter is zero will return true *fsdata = (void *)FALL_BACK_TO_NONDELALLOC; ext4_write_begin ext4_da_write_end if (write_mode == FALL_BACK_TO_NONDELALLOC) ext4_write_end if (inline_data) ext4_write_inline_data_end ext4_write_inline_data BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); -> As inode is already convert to extent, so 'pos + len' > inline_size -> then trigger BUG. To solve this issue, instead of checking ext4_has_inline_data() which is only cleared after data has been written back, check the EXT4_STATE_MAY_INLINE_DATA flag in ext4_write_end(). Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Reported-by: syzbot+4faa160fa96bfba639f8@syzkaller.appspotmail.com Reported-by: Jun Nie <jun.nie@linaro.org> Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20221206144134.1919987-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org diff 5c099c4f Tue Dec 06 07:41:34 MST 2022 Ye Bin <yebin10@huawei.com> ext4: fix kernel BUG in 'ext4_write_inline_data_end()' Syzbot report follow issue: ------------[ cut here ]------------ kernel BUG at fs/ext4/inline.c:227! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3629 Comm: syz-executor212 Not tainted 6.1.0-rc5-syzkaller-00018-g59d0d52c30d4 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 RIP: 0010:ext4_write_inline_data+0x344/0x3e0 fs/ext4/inline.c:227 RSP: 0018:ffffc90003b3f368 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff8880704e16c0 RCX: 0000000000000000 RDX: ffff888021763a80 RSI: ffffffff821e31a4 RDI: 0000000000000006 RBP: 000000000006818e R08: 0000000000000006 R09: 0000000000068199 R10: 0000000000000079 R11: 0000000000000000 R12: 000000000000000b R13: 0000000000068199 R14: ffffc90003b3f408 R15: ffff8880704e1c82 FS: 000055555723e3c0(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffe8ac9080 CR3: 0000000079f81000 CR4: 0000000000350ee0 Call Trace: <TASK> ext4_write_inline_data_end+0x2a3/0x12f0 fs/ext4/inline.c:768 ext4_write_end+0x242/0xdd0 fs/ext4/inode.c:1313 ext4_da_write_end+0x3ed/0xa30 fs/ext4/inode.c:3063 generic_perform_write+0x316/0x570 mm/filemap.c:3764 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:285 ext4_file_write_iter+0x8bc/0x16e0 fs/ext4/file.c:700 call_write_iter include/linux/fs.h:2191 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1d0/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd ---[ end trace 0000000000000000 ]--- Above issue may happens as follows: ext4_da_write_begin ext4_da_write_inline_data_begin ext4_da_convert_inline_data_to_extent ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); ext4_da_write_end ext4_run_li_request ext4_mb_prefetch ext4_read_block_bitmap_nowait ext4_validate_block_bitmap ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_BBITMAP_CORRUPT) percpu_counter_sub(&sbi->s_freeclusters_counter,grp->bb_free); -> sbi->s_freeclusters_counter become zero ext4_da_write_begin if (ext4_nonda_switch(inode->i_sb)) -> As freeclusters_counter is zero will return true *fsdata = (void *)FALL_BACK_TO_NONDELALLOC; ext4_write_begin ext4_da_write_end if (write_mode == FALL_BACK_TO_NONDELALLOC) ext4_write_end if (inline_data) ext4_write_inline_data_end ext4_write_inline_data BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); -> As inode is already convert to extent, so 'pos + len' > inline_size -> then trigger BUG. To solve this issue, instead of checking ext4_has_inline_data() which is only cleared after data has been written back, check the EXT4_STATE_MAY_INLINE_DATA flag in ext4_write_end(). Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Reported-by: syzbot+4faa160fa96bfba639f8@syzkaller.appspotmail.com Reported-by: Jun Nie <jun.nie@linaro.org> Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20221206144134.1919987-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org diff 5c099c4f Tue Dec 06 07:41:34 MST 2022 Ye Bin <yebin10@huawei.com> ext4: fix kernel BUG in 'ext4_write_inline_data_end()' Syzbot report follow issue: ------------[ cut here ]------------ kernel BUG at fs/ext4/inline.c:227! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3629 Comm: syz-executor212 Not tainted 6.1.0-rc5-syzkaller-00018-g59d0d52c30d4 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 RIP: 0010:ext4_write_inline_data+0x344/0x3e0 fs/ext4/inline.c:227 RSP: 0018:ffffc90003b3f368 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff8880704e16c0 RCX: 0000000000000000 RDX: ffff888021763a80 RSI: ffffffff821e31a4 RDI: 0000000000000006 RBP: 000000000006818e R08: 0000000000000006 R09: 0000000000068199 R10: 0000000000000079 R11: 0000000000000000 R12: 000000000000000b R13: 0000000000068199 R14: ffffc90003b3f408 R15: ffff8880704e1c82 FS: 000055555723e3c0(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffe8ac9080 CR3: 0000000079f81000 CR4: 0000000000350ee0 Call Trace: <TASK> ext4_write_inline_data_end+0x2a3/0x12f0 fs/ext4/inline.c:768 ext4_write_end+0x242/0xdd0 fs/ext4/inode.c:1313 ext4_da_write_end+0x3ed/0xa30 fs/ext4/inode.c:3063 generic_perform_write+0x316/0x570 mm/filemap.c:3764 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:285 ext4_file_write_iter+0x8bc/0x16e0 fs/ext4/file.c:700 call_write_iter include/linux/fs.h:2191 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1d0/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd ---[ end trace 0000000000000000 ]--- Above issue may happens as follows: ext4_da_write_begin ext4_da_write_inline_data_begin ext4_da_convert_inline_data_to_extent ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); ext4_da_write_end ext4_run_li_request ext4_mb_prefetch ext4_read_block_bitmap_nowait ext4_validate_block_bitmap ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_BBITMAP_CORRUPT) percpu_counter_sub(&sbi->s_freeclusters_counter,grp->bb_free); -> sbi->s_freeclusters_counter become zero ext4_da_write_begin if (ext4_nonda_switch(inode->i_sb)) -> As freeclusters_counter is zero will return true *fsdata = (void *)FALL_BACK_TO_NONDELALLOC; ext4_write_begin ext4_da_write_end if (write_mode == FALL_BACK_TO_NONDELALLOC) ext4_write_end if (inline_data) ext4_write_inline_data_end ext4_write_inline_data BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); -> As inode is already convert to extent, so 'pos + len' > inline_size -> then trigger BUG. To solve this issue, instead of checking ext4_has_inline_data() which is only cleared after data has been written back, check the EXT4_STATE_MAY_INLINE_DATA flag in ext4_write_end(). Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Reported-by: syzbot+4faa160fa96bfba639f8@syzkaller.appspotmail.com Reported-by: Jun Nie <jun.nie@linaro.org> Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20221206144134.1919987-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org diff 5c099c4f Tue Dec 06 07:41:34 MST 2022 Ye Bin <yebin10@huawei.com> ext4: fix kernel BUG in 'ext4_write_inline_data_end()' Syzbot report follow issue: ------------[ cut here ]------------ kernel BUG at fs/ext4/inline.c:227! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3629 Comm: syz-executor212 Not tainted 6.1.0-rc5-syzkaller-00018-g59d0d52c30d4 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 RIP: 0010:ext4_write_inline_data+0x344/0x3e0 fs/ext4/inline.c:227 RSP: 0018:ffffc90003b3f368 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff8880704e16c0 RCX: 0000000000000000 RDX: ffff888021763a80 RSI: ffffffff821e31a4 RDI: 0000000000000006 RBP: 000000000006818e R08: 0000000000000006 R09: 0000000000068199 R10: 0000000000000079 R11: 0000000000000000 R12: 000000000000000b R13: 0000000000068199 R14: ffffc90003b3f408 R15: ffff8880704e1c82 FS: 000055555723e3c0(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffe8ac9080 CR3: 0000000079f81000 CR4: 0000000000350ee0 Call Trace: <TASK> ext4_write_inline_data_end+0x2a3/0x12f0 fs/ext4/inline.c:768 ext4_write_end+0x242/0xdd0 fs/ext4/inode.c:1313 ext4_da_write_end+0x3ed/0xa30 fs/ext4/inode.c:3063 generic_perform_write+0x316/0x570 mm/filemap.c:3764 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:285 ext4_file_write_iter+0x8bc/0x16e0 fs/ext4/file.c:700 call_write_iter include/linux/fs.h:2191 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1d0/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd ---[ end trace 0000000000000000 ]--- Above issue may happens as follows: ext4_da_write_begin ext4_da_write_inline_data_begin ext4_da_convert_inline_data_to_extent ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); ext4_da_write_end ext4_run_li_request ext4_mb_prefetch ext4_read_block_bitmap_nowait ext4_validate_block_bitmap ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_BBITMAP_CORRUPT) percpu_counter_sub(&sbi->s_freeclusters_counter,grp->bb_free); -> sbi->s_freeclusters_counter become zero ext4_da_write_begin if (ext4_nonda_switch(inode->i_sb)) -> As freeclusters_counter is zero will return true *fsdata = (void *)FALL_BACK_TO_NONDELALLOC; ext4_write_begin ext4_da_write_end if (write_mode == FALL_BACK_TO_NONDELALLOC) ext4_write_end if (inline_data) ext4_write_inline_data_end ext4_write_inline_data BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); -> As inode is already convert to extent, so 'pos + len' > inline_size -> then trigger BUG. To solve this issue, instead of checking ext4_has_inline_data() which is only cleared after data has been written back, check the EXT4_STATE_MAY_INLINE_DATA flag in ext4_write_end(). Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Reported-by: syzbot+4faa160fa96bfba639f8@syzkaller.appspotmail.com Reported-by: Jun Nie <jun.nie@linaro.org> Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20221206144134.1919987-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org diff 5c099c4f Tue Dec 06 07:41:34 MST 2022 Ye Bin <yebin10@huawei.com> ext4: fix kernel BUG in 'ext4_write_inline_data_end()' Syzbot report follow issue: ------------[ cut here ]------------ kernel BUG at fs/ext4/inline.c:227! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3629 Comm: syz-executor212 Not tainted 6.1.0-rc5-syzkaller-00018-g59d0d52c30d4 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 RIP: 0010:ext4_write_inline_data+0x344/0x3e0 fs/ext4/inline.c:227 RSP: 0018:ffffc90003b3f368 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff8880704e16c0 RCX: 0000000000000000 RDX: ffff888021763a80 RSI: ffffffff821e31a4 RDI: 0000000000000006 RBP: 000000000006818e R08: 0000000000000006 R09: 0000000000068199 R10: 0000000000000079 R11: 0000000000000000 R12: 000000000000000b R13: 0000000000068199 R14: ffffc90003b3f408 R15: ffff8880704e1c82 FS: 000055555723e3c0(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffe8ac9080 CR3: 0000000079f81000 CR4: 0000000000350ee0 Call Trace: <TASK> ext4_write_inline_data_end+0x2a3/0x12f0 fs/ext4/inline.c:768 ext4_write_end+0x242/0xdd0 fs/ext4/inode.c:1313 ext4_da_write_end+0x3ed/0xa30 fs/ext4/inode.c:3063 generic_perform_write+0x316/0x570 mm/filemap.c:3764 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:285 ext4_file_write_iter+0x8bc/0x16e0 fs/ext4/file.c:700 call_write_iter include/linux/fs.h:2191 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1d0/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd ---[ end trace 0000000000000000 ]--- Above issue may happens as follows: ext4_da_write_begin ext4_da_write_inline_data_begin ext4_da_convert_inline_data_to_extent ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); ext4_da_write_end ext4_run_li_request ext4_mb_prefetch ext4_read_block_bitmap_nowait ext4_validate_block_bitmap ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_BBITMAP_CORRUPT) percpu_counter_sub(&sbi->s_freeclusters_counter,grp->bb_free); -> sbi->s_freeclusters_counter become zero ext4_da_write_begin if (ext4_nonda_switch(inode->i_sb)) -> As freeclusters_counter is zero will return true *fsdata = (void *)FALL_BACK_TO_NONDELALLOC; ext4_write_begin ext4_da_write_end if (write_mode == FALL_BACK_TO_NONDELALLOC) ext4_write_end if (inline_data) ext4_write_inline_data_end ext4_write_inline_data BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); -> As inode is already convert to extent, so 'pos + len' > inline_size -> then trigger BUG. To solve this issue, instead of checking ext4_has_inline_data() which is only cleared after data has been written back, check the EXT4_STATE_MAY_INLINE_DATA flag in ext4_write_end(). Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Reported-by: syzbot+4faa160fa96bfba639f8@syzkaller.appspotmail.com Reported-by: Jun Nie <jun.nie@linaro.org> Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20221206144134.1919987-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org diff 5c099c4f Tue Dec 06 07:41:34 MST 2022 Ye Bin <yebin10@huawei.com> ext4: fix kernel BUG in 'ext4_write_inline_data_end()' Syzbot report follow issue: ------------[ cut here ]------------ kernel BUG at fs/ext4/inline.c:227! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3629 Comm: syz-executor212 Not tainted 6.1.0-rc5-syzkaller-00018-g59d0d52c30d4 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 RIP: 0010:ext4_write_inline_data+0x344/0x3e0 fs/ext4/inline.c:227 RSP: 0018:ffffc90003b3f368 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff8880704e16c0 RCX: 0000000000000000 RDX: ffff888021763a80 RSI: ffffffff821e31a4 RDI: 0000000000000006 RBP: 000000000006818e R08: 0000000000000006 R09: 0000000000068199 R10: 0000000000000079 R11: 0000000000000000 R12: 000000000000000b R13: 0000000000068199 R14: ffffc90003b3f408 R15: ffff8880704e1c82 FS: 000055555723e3c0(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffe8ac9080 CR3: 0000000079f81000 CR4: 0000000000350ee0 Call Trace: <TASK> ext4_write_inline_data_end+0x2a3/0x12f0 fs/ext4/inline.c:768 ext4_write_end+0x242/0xdd0 fs/ext4/inode.c:1313 ext4_da_write_end+0x3ed/0xa30 fs/ext4/inode.c:3063 generic_perform_write+0x316/0x570 mm/filemap.c:3764 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:285 ext4_file_write_iter+0x8bc/0x16e0 fs/ext4/file.c:700 call_write_iter include/linux/fs.h:2191 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1d0/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd ---[ end trace 0000000000000000 ]--- Above issue may happens as follows: ext4_da_write_begin ext4_da_write_inline_data_begin ext4_da_convert_inline_data_to_extent ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); ext4_da_write_end ext4_run_li_request ext4_mb_prefetch ext4_read_block_bitmap_nowait ext4_validate_block_bitmap ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_BBITMAP_CORRUPT) percpu_counter_sub(&sbi->s_freeclusters_counter,grp->bb_free); -> sbi->s_freeclusters_counter become zero ext4_da_write_begin if (ext4_nonda_switch(inode->i_sb)) -> As freeclusters_counter is zero will return true *fsdata = (void *)FALL_BACK_TO_NONDELALLOC; ext4_write_begin ext4_da_write_end if (write_mode == FALL_BACK_TO_NONDELALLOC) ext4_write_end if (inline_data) ext4_write_inline_data_end ext4_write_inline_data BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); -> As inode is already convert to extent, so 'pos + len' > inline_size -> then trigger BUG. To solve this issue, instead of checking ext4_has_inline_data() which is only cleared after data has been written back, check the EXT4_STATE_MAY_INLINE_DATA flag in ext4_write_end(). Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Reported-by: syzbot+4faa160fa96bfba639f8@syzkaller.appspotmail.com Reported-by: Jun Nie <jun.nie@linaro.org> Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20221206144134.1919987-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org diff 5c099c4f Tue Dec 06 07:41:34 MST 2022 Ye Bin <yebin10@huawei.com> ext4: fix kernel BUG in 'ext4_write_inline_data_end()' Syzbot report follow issue: ------------[ cut here ]------------ kernel BUG at fs/ext4/inline.c:227! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3629 Comm: syz-executor212 Not tainted 6.1.0-rc5-syzkaller-00018-g59d0d52c30d4 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 RIP: 0010:ext4_write_inline_data+0x344/0x3e0 fs/ext4/inline.c:227 RSP: 0018:ffffc90003b3f368 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff8880704e16c0 RCX: 0000000000000000 RDX: ffff888021763a80 RSI: ffffffff821e31a4 RDI: 0000000000000006 RBP: 000000000006818e R08: 0000000000000006 R09: 0000000000068199 R10: 0000000000000079 R11: 0000000000000000 R12: 000000000000000b R13: 0000000000068199 R14: ffffc90003b3f408 R15: ffff8880704e1c82 FS: 000055555723e3c0(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffe8ac9080 CR3: 0000000079f81000 CR4: 0000000000350ee0 Call Trace: <TASK> ext4_write_inline_data_end+0x2a3/0x12f0 fs/ext4/inline.c:768 ext4_write_end+0x242/0xdd0 fs/ext4/inode.c:1313 ext4_da_write_end+0x3ed/0xa30 fs/ext4/inode.c:3063 generic_perform_write+0x316/0x570 mm/filemap.c:3764 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:285 ext4_file_write_iter+0x8bc/0x16e0 fs/ext4/file.c:700 call_write_iter include/linux/fs.h:2191 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1d0/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd ---[ end trace 0000000000000000 ]--- Above issue may happens as follows: ext4_da_write_begin ext4_da_write_inline_data_begin ext4_da_convert_inline_data_to_extent ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); ext4_da_write_end ext4_run_li_request ext4_mb_prefetch ext4_read_block_bitmap_nowait ext4_validate_block_bitmap ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_BBITMAP_CORRUPT) percpu_counter_sub(&sbi->s_freeclusters_counter,grp->bb_free); -> sbi->s_freeclusters_counter become zero ext4_da_write_begin if (ext4_nonda_switch(inode->i_sb)) -> As freeclusters_counter is zero will return true *fsdata = (void *)FALL_BACK_TO_NONDELALLOC; ext4_write_begin ext4_da_write_end if (write_mode == FALL_BACK_TO_NONDELALLOC) ext4_write_end if (inline_data) ext4_write_inline_data_end ext4_write_inline_data BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); -> As inode is already convert to extent, so 'pos + len' > inline_size -> then trigger BUG. To solve this issue, instead of checking ext4_has_inline_data() which is only cleared after data has been written back, check the EXT4_STATE_MAY_INLINE_DATA flag in ext4_write_end(). Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Reported-by: syzbot+4faa160fa96bfba639f8@syzkaller.appspotmail.com Reported-by: Jun Nie <jun.nie@linaro.org> Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20221206144134.1919987-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org diff 5c099c4f Tue Dec 06 07:41:34 MST 2022 Ye Bin <yebin10@huawei.com> ext4: fix kernel BUG in 'ext4_write_inline_data_end()' Syzbot report follow issue: ------------[ cut here ]------------ kernel BUG at fs/ext4/inline.c:227! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3629 Comm: syz-executor212 Not tainted 6.1.0-rc5-syzkaller-00018-g59d0d52c30d4 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 RIP: 0010:ext4_write_inline_data+0x344/0x3e0 fs/ext4/inline.c:227 RSP: 0018:ffffc90003b3f368 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff8880704e16c0 RCX: 0000000000000000 RDX: ffff888021763a80 RSI: ffffffff821e31a4 RDI: 0000000000000006 RBP: 000000000006818e R08: 0000000000000006 R09: 0000000000068199 R10: 0000000000000079 R11: 0000000000000000 R12: 000000000000000b R13: 0000000000068199 R14: ffffc90003b3f408 R15: ffff8880704e1c82 FS: 000055555723e3c0(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffe8ac9080 CR3: 0000000079f81000 CR4: 0000000000350ee0 Call Trace: <TASK> ext4_write_inline_data_end+0x2a3/0x12f0 fs/ext4/inline.c:768 ext4_write_end+0x242/0xdd0 fs/ext4/inode.c:1313 ext4_da_write_end+0x3ed/0xa30 fs/ext4/inode.c:3063 generic_perform_write+0x316/0x570 mm/filemap.c:3764 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:285 ext4_file_write_iter+0x8bc/0x16e0 fs/ext4/file.c:700 call_write_iter include/linux/fs.h:2191 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1d0/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd ---[ end trace 0000000000000000 ]--- Above issue may happens as follows: ext4_da_write_begin ext4_da_write_inline_data_begin ext4_da_convert_inline_data_to_extent ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); ext4_da_write_end ext4_run_li_request ext4_mb_prefetch ext4_read_block_bitmap_nowait ext4_validate_block_bitmap ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_BBITMAP_CORRUPT) percpu_counter_sub(&sbi->s_freeclusters_counter,grp->bb_free); -> sbi->s_freeclusters_counter become zero ext4_da_write_begin if (ext4_nonda_switch(inode->i_sb)) -> As freeclusters_counter is zero will return true *fsdata = (void *)FALL_BACK_TO_NONDELALLOC; ext4_write_begin ext4_da_write_end if (write_mode == FALL_BACK_TO_NONDELALLOC) ext4_write_end if (inline_data) ext4_write_inline_data_end ext4_write_inline_data BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); -> As inode is already convert to extent, so 'pos + len' > inline_size -> then trigger BUG. To solve this issue, instead of checking ext4_has_inline_data() which is only cleared after data has been written back, check the EXT4_STATE_MAY_INLINE_DATA flag in ext4_write_end(). Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Reported-by: syzbot+4faa160fa96bfba639f8@syzkaller.appspotmail.com Reported-by: Jun Nie <jun.nie@linaro.org> Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20221206144134.1919987-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org diff 5c099c4f Tue Dec 06 07:41:34 MST 2022 Ye Bin <yebin10@huawei.com> ext4: fix kernel BUG in 'ext4_write_inline_data_end()' Syzbot report follow issue: ------------[ cut here ]------------ kernel BUG at fs/ext4/inline.c:227! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3629 Comm: syz-executor212 Not tainted 6.1.0-rc5-syzkaller-00018-g59d0d52c30d4 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 RIP: 0010:ext4_write_inline_data+0x344/0x3e0 fs/ext4/inline.c:227 RSP: 0018:ffffc90003b3f368 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff8880704e16c0 RCX: 0000000000000000 RDX: ffff888021763a80 RSI: ffffffff821e31a4 RDI: 0000000000000006 RBP: 000000000006818e R08: 0000000000000006 R09: 0000000000068199 R10: 0000000000000079 R11: 0000000000000000 R12: 000000000000000b R13: 0000000000068199 R14: ffffc90003b3f408 R15: ffff8880704e1c82 FS: 000055555723e3c0(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffe8ac9080 CR3: 0000000079f81000 CR4: 0000000000350ee0 Call Trace: <TASK> ext4_write_inline_data_end+0x2a3/0x12f0 fs/ext4/inline.c:768 ext4_write_end+0x242/0xdd0 fs/ext4/inode.c:1313 ext4_da_write_end+0x3ed/0xa30 fs/ext4/inode.c:3063 generic_perform_write+0x316/0x570 mm/filemap.c:3764 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:285 ext4_file_write_iter+0x8bc/0x16e0 fs/ext4/file.c:700 call_write_iter include/linux/fs.h:2191 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1d0/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd ---[ end trace 0000000000000000 ]--- Above issue may happens as follows: ext4_da_write_begin ext4_da_write_inline_data_begin ext4_da_convert_inline_data_to_extent ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); ext4_da_write_end ext4_run_li_request ext4_mb_prefetch ext4_read_block_bitmap_nowait ext4_validate_block_bitmap ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_BBITMAP_CORRUPT) percpu_counter_sub(&sbi->s_freeclusters_counter,grp->bb_free); -> sbi->s_freeclusters_counter become zero ext4_da_write_begin if (ext4_nonda_switch(inode->i_sb)) -> As freeclusters_counter is zero will return true *fsdata = (void *)FALL_BACK_TO_NONDELALLOC; ext4_write_begin ext4_da_write_end if (write_mode == FALL_BACK_TO_NONDELALLOC) ext4_write_end if (inline_data) ext4_write_inline_data_end ext4_write_inline_data BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); -> As inode is already convert to extent, so 'pos + len' > inline_size -> then trigger BUG. To solve this issue, instead of checking ext4_has_inline_data() which is only cleared after data has been written back, check the EXT4_STATE_MAY_INLINE_DATA flag in ext4_write_end(). Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Reported-by: syzbot+4faa160fa96bfba639f8@syzkaller.appspotmail.com Reported-by: Jun Nie <jun.nie@linaro.org> Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20221206144134.1919987-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org diff 5c099c4f Tue Dec 06 07:41:34 MST 2022 Ye Bin <yebin10@huawei.com> ext4: fix kernel BUG in 'ext4_write_inline_data_end()' Syzbot report follow issue: ------------[ cut here ]------------ kernel BUG at fs/ext4/inline.c:227! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3629 Comm: syz-executor212 Not tainted 6.1.0-rc5-syzkaller-00018-g59d0d52c30d4 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 RIP: 0010:ext4_write_inline_data+0x344/0x3e0 fs/ext4/inline.c:227 RSP: 0018:ffffc90003b3f368 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff8880704e16c0 RCX: 0000000000000000 RDX: ffff888021763a80 RSI: ffffffff821e31a4 RDI: 0000000000000006 RBP: 000000000006818e R08: 0000000000000006 R09: 0000000000068199 R10: 0000000000000079 R11: 0000000000000000 R12: 000000000000000b R13: 0000000000068199 R14: ffffc90003b3f408 R15: ffff8880704e1c82 FS: 000055555723e3c0(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffe8ac9080 CR3: 0000000079f81000 CR4: 0000000000350ee0 Call Trace: <TASK> ext4_write_inline_data_end+0x2a3/0x12f0 fs/ext4/inline.c:768 ext4_write_end+0x242/0xdd0 fs/ext4/inode.c:1313 ext4_da_write_end+0x3ed/0xa30 fs/ext4/inode.c:3063 generic_perform_write+0x316/0x570 mm/filemap.c:3764 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:285 ext4_file_write_iter+0x8bc/0x16e0 fs/ext4/file.c:700 call_write_iter include/linux/fs.h:2191 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1d0/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd ---[ end trace 0000000000000000 ]--- Above issue may happens as follows: ext4_da_write_begin ext4_da_write_inline_data_begin ext4_da_convert_inline_data_to_extent ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); ext4_da_write_end ext4_run_li_request ext4_mb_prefetch ext4_read_block_bitmap_nowait ext4_validate_block_bitmap ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_BBITMAP_CORRUPT) percpu_counter_sub(&sbi->s_freeclusters_counter,grp->bb_free); -> sbi->s_freeclusters_counter become zero ext4_da_write_begin if (ext4_nonda_switch(inode->i_sb)) -> As freeclusters_counter is zero will return true *fsdata = (void *)FALL_BACK_TO_NONDELALLOC; ext4_write_begin ext4_da_write_end if (write_mode == FALL_BACK_TO_NONDELALLOC) ext4_write_end if (inline_data) ext4_write_inline_data_end ext4_write_inline_data BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); -> As inode is already convert to extent, so 'pos + len' > inline_size -> then trigger BUG. To solve this issue, instead of checking ext4_has_inline_data() which is only cleared after data has been written back, check the EXT4_STATE_MAY_INLINE_DATA flag in ext4_write_end(). Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Reported-by: syzbot+4faa160fa96bfba639f8@syzkaller.appspotmail.com Reported-by: Jun Nie <jun.nie@linaro.org> Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20221206144134.1919987-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org diff 5c099c4f Tue Dec 06 07:41:34 MST 2022 Ye Bin <yebin10@huawei.com> ext4: fix kernel BUG in 'ext4_write_inline_data_end()' Syzbot report follow issue: ------------[ cut here ]------------ kernel BUG at fs/ext4/inline.c:227! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3629 Comm: syz-executor212 Not tainted 6.1.0-rc5-syzkaller-00018-g59d0d52c30d4 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 RIP: 0010:ext4_write_inline_data+0x344/0x3e0 fs/ext4/inline.c:227 RSP: 0018:ffffc90003b3f368 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff8880704e16c0 RCX: 0000000000000000 RDX: ffff888021763a80 RSI: ffffffff821e31a4 RDI: 0000000000000006 RBP: 000000000006818e R08: 0000000000000006 R09: 0000000000068199 R10: 0000000000000079 R11: 0000000000000000 R12: 000000000000000b R13: 0000000000068199 R14: ffffc90003b3f408 R15: ffff8880704e1c82 FS: 000055555723e3c0(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffe8ac9080 CR3: 0000000079f81000 CR4: 0000000000350ee0 Call Trace: <TASK> ext4_write_inline_data_end+0x2a3/0x12f0 fs/ext4/inline.c:768 ext4_write_end+0x242/0xdd0 fs/ext4/inode.c:1313 ext4_da_write_end+0x3ed/0xa30 fs/ext4/inode.c:3063 generic_perform_write+0x316/0x570 mm/filemap.c:3764 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:285 ext4_file_write_iter+0x8bc/0x16e0 fs/ext4/file.c:700 call_write_iter include/linux/fs.h:2191 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1d0/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd ---[ end trace 0000000000000000 ]--- Above issue may happens as follows: ext4_da_write_begin ext4_da_write_inline_data_begin ext4_da_convert_inline_data_to_extent ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); ext4_da_write_end ext4_run_li_request ext4_mb_prefetch ext4_read_block_bitmap_nowait ext4_validate_block_bitmap ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_BBITMAP_CORRUPT) percpu_counter_sub(&sbi->s_freeclusters_counter,grp->bb_free); -> sbi->s_freeclusters_counter become zero ext4_da_write_begin if (ext4_nonda_switch(inode->i_sb)) -> As freeclusters_counter is zero will return true *fsdata = (void *)FALL_BACK_TO_NONDELALLOC; ext4_write_begin ext4_da_write_end if (write_mode == FALL_BACK_TO_NONDELALLOC) ext4_write_end if (inline_data) ext4_write_inline_data_end ext4_write_inline_data BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); -> As inode is already convert to extent, so 'pos + len' > inline_size -> then trigger BUG. To solve this issue, instead of checking ext4_has_inline_data() which is only cleared after data has been written back, check the EXT4_STATE_MAY_INLINE_DATA flag in ext4_write_end(). Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Reported-by: syzbot+4faa160fa96bfba639f8@syzkaller.appspotmail.com Reported-by: Jun Nie <jun.nie@linaro.org> Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20221206144134.1919987-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org diff 5c099c4f Tue Dec 06 07:41:34 MST 2022 Ye Bin <yebin10@huawei.com> ext4: fix kernel BUG in 'ext4_write_inline_data_end()' Syzbot report follow issue: ------------[ cut here ]------------ kernel BUG at fs/ext4/inline.c:227! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3629 Comm: syz-executor212 Not tainted 6.1.0-rc5-syzkaller-00018-g59d0d52c30d4 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 RIP: 0010:ext4_write_inline_data+0x344/0x3e0 fs/ext4/inline.c:227 RSP: 0018:ffffc90003b3f368 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff8880704e16c0 RCX: 0000000000000000 RDX: ffff888021763a80 RSI: ffffffff821e31a4 RDI: 0000000000000006 RBP: 000000000006818e R08: 0000000000000006 R09: 0000000000068199 R10: 0000000000000079 R11: 0000000000000000 R12: 000000000000000b R13: 0000000000068199 R14: ffffc90003b3f408 R15: ffff8880704e1c82 FS: 000055555723e3c0(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffe8ac9080 CR3: 0000000079f81000 CR4: 0000000000350ee0 Call Trace: <TASK> ext4_write_inline_data_end+0x2a3/0x12f0 fs/ext4/inline.c:768 ext4_write_end+0x242/0xdd0 fs/ext4/inode.c:1313 ext4_da_write_end+0x3ed/0xa30 fs/ext4/inode.c:3063 generic_perform_write+0x316/0x570 mm/filemap.c:3764 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:285 ext4_file_write_iter+0x8bc/0x16e0 fs/ext4/file.c:700 call_write_iter include/linux/fs.h:2191 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1d0/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd ---[ end trace 0000000000000000 ]--- Above issue may happens as follows: ext4_da_write_begin ext4_da_write_inline_data_begin ext4_da_convert_inline_data_to_extent ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); ext4_da_write_end ext4_run_li_request ext4_mb_prefetch ext4_read_block_bitmap_nowait ext4_validate_block_bitmap ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_BBITMAP_CORRUPT) percpu_counter_sub(&sbi->s_freeclusters_counter,grp->bb_free); -> sbi->s_freeclusters_counter become zero ext4_da_write_begin if (ext4_nonda_switch(inode->i_sb)) -> As freeclusters_counter is zero will return true *fsdata = (void *)FALL_BACK_TO_NONDELALLOC; ext4_write_begin ext4_da_write_end if (write_mode == FALL_BACK_TO_NONDELALLOC) ext4_write_end if (inline_data) ext4_write_inline_data_end ext4_write_inline_data BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); -> As inode is already convert to extent, so 'pos + len' > inline_size -> then trigger BUG. To solve this issue, instead of checking ext4_has_inline_data() which is only cleared after data has been written back, check the EXT4_STATE_MAY_INLINE_DATA flag in ext4_write_end(). Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Reported-by: syzbot+4faa160fa96bfba639f8@syzkaller.appspotmail.com Reported-by: Jun Nie <jun.nie@linaro.org> Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20221206144134.1919987-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org diff 5c099c4f Tue Dec 06 07:41:34 MST 2022 Ye Bin <yebin10@huawei.com> ext4: fix kernel BUG in 'ext4_write_inline_data_end()' Syzbot report follow issue: ------------[ cut here ]------------ kernel BUG at fs/ext4/inline.c:227! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3629 Comm: syz-executor212 Not tainted 6.1.0-rc5-syzkaller-00018-g59d0d52c30d4 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 RIP: 0010:ext4_write_inline_data+0x344/0x3e0 fs/ext4/inline.c:227 RSP: 0018:ffffc90003b3f368 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff8880704e16c0 RCX: 0000000000000000 RDX: ffff888021763a80 RSI: ffffffff821e31a4 RDI: 0000000000000006 RBP: 000000000006818e R08: 0000000000000006 R09: 0000000000068199 R10: 0000000000000079 R11: 0000000000000000 R12: 000000000000000b R13: 0000000000068199 R14: ffffc90003b3f408 R15: ffff8880704e1c82 FS: 000055555723e3c0(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffe8ac9080 CR3: 0000000079f81000 CR4: 0000000000350ee0 Call Trace: <TASK> ext4_write_inline_data_end+0x2a3/0x12f0 fs/ext4/inline.c:768 ext4_write_end+0x242/0xdd0 fs/ext4/inode.c:1313 ext4_da_write_end+0x3ed/0xa30 fs/ext4/inode.c:3063 generic_perform_write+0x316/0x570 mm/filemap.c:3764 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:285 ext4_file_write_iter+0x8bc/0x16e0 fs/ext4/file.c:700 call_write_iter include/linux/fs.h:2191 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1d0/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd ---[ end trace 0000000000000000 ]--- Above issue may happens as follows: ext4_da_write_begin ext4_da_write_inline_data_begin ext4_da_convert_inline_data_to_extent ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); ext4_da_write_end ext4_run_li_request ext4_mb_prefetch ext4_read_block_bitmap_nowait ext4_validate_block_bitmap ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_BBITMAP_CORRUPT) percpu_counter_sub(&sbi->s_freeclusters_counter,grp->bb_free); -> sbi->s_freeclusters_counter become zero ext4_da_write_begin if (ext4_nonda_switch(inode->i_sb)) -> As freeclusters_counter is zero will return true *fsdata = (void *)FALL_BACK_TO_NONDELALLOC; ext4_write_begin ext4_da_write_end if (write_mode == FALL_BACK_TO_NONDELALLOC) ext4_write_end if (inline_data) ext4_write_inline_data_end ext4_write_inline_data BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); -> As inode is already convert to extent, so 'pos + len' > inline_size -> then trigger BUG. To solve this issue, instead of checking ext4_has_inline_data() which is only cleared after data has been written back, check the EXT4_STATE_MAY_INLINE_DATA flag in ext4_write_end(). Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Reported-by: syzbot+4faa160fa96bfba639f8@syzkaller.appspotmail.com Reported-by: Jun Nie <jun.nie@linaro.org> Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20221206144134.1919987-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org diff 5c099c4f Tue Dec 06 07:41:34 MST 2022 Ye Bin <yebin10@huawei.com> ext4: fix kernel BUG in 'ext4_write_inline_data_end()' Syzbot report follow issue: ------------[ cut here ]------------ kernel BUG at fs/ext4/inline.c:227! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3629 Comm: syz-executor212 Not tainted 6.1.0-rc5-syzkaller-00018-g59d0d52c30d4 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 RIP: 0010:ext4_write_inline_data+0x344/0x3e0 fs/ext4/inline.c:227 RSP: 0018:ffffc90003b3f368 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff8880704e16c0 RCX: 0000000000000000 RDX: ffff888021763a80 RSI: ffffffff821e31a4 RDI: 0000000000000006 RBP: 000000000006818e R08: 0000000000000006 R09: 0000000000068199 R10: 0000000000000079 R11: 0000000000000000 R12: 000000000000000b R13: 0000000000068199 R14: ffffc90003b3f408 R15: ffff8880704e1c82 FS: 000055555723e3c0(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffe8ac9080 CR3: 0000000079f81000 CR4: 0000000000350ee0 Call Trace: <TASK> ext4_write_inline_data_end+0x2a3/0x12f0 fs/ext4/inline.c:768 ext4_write_end+0x242/0xdd0 fs/ext4/inode.c:1313 ext4_da_write_end+0x3ed/0xa30 fs/ext4/inode.c:3063 generic_perform_write+0x316/0x570 mm/filemap.c:3764 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:285 ext4_file_write_iter+0x8bc/0x16e0 fs/ext4/file.c:700 call_write_iter include/linux/fs.h:2191 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1d0/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd ---[ end trace 0000000000000000 ]--- Above issue may happens as follows: ext4_da_write_begin ext4_da_write_inline_data_begin ext4_da_convert_inline_data_to_extent ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); ext4_da_write_end ext4_run_li_request ext4_mb_prefetch ext4_read_block_bitmap_nowait ext4_validate_block_bitmap ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_BBITMAP_CORRUPT) percpu_counter_sub(&sbi->s_freeclusters_counter,grp->bb_free); -> sbi->s_freeclusters_counter become zero ext4_da_write_begin if (ext4_nonda_switch(inode->i_sb)) -> As freeclusters_counter is zero will return true *fsdata = (void *)FALL_BACK_TO_NONDELALLOC; ext4_write_begin ext4_da_write_end if (write_mode == FALL_BACK_TO_NONDELALLOC) ext4_write_end if (inline_data) ext4_write_inline_data_end ext4_write_inline_data BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); -> As inode is already convert to extent, so 'pos + len' > inline_size -> then trigger BUG. To solve this issue, instead of checking ext4_has_inline_data() which is only cleared after data has been written back, check the EXT4_STATE_MAY_INLINE_DATA flag in ext4_write_end(). Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Reported-by: syzbot+4faa160fa96bfba639f8@syzkaller.appspotmail.com Reported-by: Jun Nie <jun.nie@linaro.org> Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20221206144134.1919987-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org diff 5c099c4f Tue Dec 06 07:41:34 MST 2022 Ye Bin <yebin10@huawei.com> ext4: fix kernel BUG in 'ext4_write_inline_data_end()' Syzbot report follow issue: ------------[ cut here ]------------ kernel BUG at fs/ext4/inline.c:227! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3629 Comm: syz-executor212 Not tainted 6.1.0-rc5-syzkaller-00018-g59d0d52c30d4 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 RIP: 0010:ext4_write_inline_data+0x344/0x3e0 fs/ext4/inline.c:227 RSP: 0018:ffffc90003b3f368 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff8880704e16c0 RCX: 0000000000000000 RDX: ffff888021763a80 RSI: ffffffff821e31a4 RDI: 0000000000000006 RBP: 000000000006818e R08: 0000000000000006 R09: 0000000000068199 R10: 0000000000000079 R11: 0000000000000000 R12: 000000000000000b R13: 0000000000068199 R14: ffffc90003b3f408 R15: ffff8880704e1c82 FS: 000055555723e3c0(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffe8ac9080 CR3: 0000000079f81000 CR4: 0000000000350ee0 Call Trace: <TASK> ext4_write_inline_data_end+0x2a3/0x12f0 fs/ext4/inline.c:768 ext4_write_end+0x242/0xdd0 fs/ext4/inode.c:1313 ext4_da_write_end+0x3ed/0xa30 fs/ext4/inode.c:3063 generic_perform_write+0x316/0x570 mm/filemap.c:3764 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:285 ext4_file_write_iter+0x8bc/0x16e0 fs/ext4/file.c:700 call_write_iter include/linux/fs.h:2191 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1d0/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd ---[ end trace 0000000000000000 ]--- Above issue may happens as follows: ext4_da_write_begin ext4_da_write_inline_data_begin ext4_da_convert_inline_data_to_extent ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); ext4_da_write_end ext4_run_li_request ext4_mb_prefetch ext4_read_block_bitmap_nowait ext4_validate_block_bitmap ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_BBITMAP_CORRUPT) percpu_counter_sub(&sbi->s_freeclusters_counter,grp->bb_free); -> sbi->s_freeclusters_counter become zero ext4_da_write_begin if (ext4_nonda_switch(inode->i_sb)) -> As freeclusters_counter is zero will return true *fsdata = (void *)FALL_BACK_TO_NONDELALLOC; ext4_write_begin ext4_da_write_end if (write_mode == FALL_BACK_TO_NONDELALLOC) ext4_write_end if (inline_data) ext4_write_inline_data_end ext4_write_inline_data BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); -> As inode is already convert to extent, so 'pos + len' > inline_size -> then trigger BUG. To solve this issue, instead of checking ext4_has_inline_data() which is only cleared after data has been written back, check the EXT4_STATE_MAY_INLINE_DATA flag in ext4_write_end(). Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Reported-by: syzbot+4faa160fa96bfba639f8@syzkaller.appspotmail.com Reported-by: Jun Nie <jun.nie@linaro.org> Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20221206144134.1919987-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org diff 5c099c4f Tue Dec 06 07:41:34 MST 2022 Ye Bin <yebin10@huawei.com> ext4: fix kernel BUG in 'ext4_write_inline_data_end()' Syzbot report follow issue: ------------[ cut here ]------------ kernel BUG at fs/ext4/inline.c:227! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3629 Comm: syz-executor212 Not tainted 6.1.0-rc5-syzkaller-00018-g59d0d52c30d4 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 RIP: 0010:ext4_write_inline_data+0x344/0x3e0 fs/ext4/inline.c:227 RSP: 0018:ffffc90003b3f368 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff8880704e16c0 RCX: 0000000000000000 RDX: ffff888021763a80 RSI: ffffffff821e31a4 RDI: 0000000000000006 RBP: 000000000006818e R08: 0000000000000006 R09: 0000000000068199 R10: 0000000000000079 R11: 0000000000000000 R12: 000000000000000b R13: 0000000000068199 R14: ffffc90003b3f408 R15: ffff8880704e1c82 FS: 000055555723e3c0(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffe8ac9080 CR3: 0000000079f81000 CR4: 0000000000350ee0 Call Trace: <TASK> ext4_write_inline_data_end+0x2a3/0x12f0 fs/ext4/inline.c:768 ext4_write_end+0x242/0xdd0 fs/ext4/inode.c:1313 ext4_da_write_end+0x3ed/0xa30 fs/ext4/inode.c:3063 generic_perform_write+0x316/0x570 mm/filemap.c:3764 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:285 ext4_file_write_iter+0x8bc/0x16e0 fs/ext4/file.c:700 call_write_iter include/linux/fs.h:2191 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1d0/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd ---[ end trace 0000000000000000 ]--- Above issue may happens as follows: ext4_da_write_begin ext4_da_write_inline_data_begin ext4_da_convert_inline_data_to_extent ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); ext4_da_write_end ext4_run_li_request ext4_mb_prefetch ext4_read_block_bitmap_nowait ext4_validate_block_bitmap ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_BBITMAP_CORRUPT) percpu_counter_sub(&sbi->s_freeclusters_counter,grp->bb_free); -> sbi->s_freeclusters_counter become zero ext4_da_write_begin if (ext4_nonda_switch(inode->i_sb)) -> As freeclusters_counter is zero will return true *fsdata = (void *)FALL_BACK_TO_NONDELALLOC; ext4_write_begin ext4_da_write_end if (write_mode == FALL_BACK_TO_NONDELALLOC) ext4_write_end if (inline_data) ext4_write_inline_data_end ext4_write_inline_data BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); -> As inode is already convert to extent, so 'pos + len' > inline_size -> then trigger BUG. To solve this issue, instead of checking ext4_has_inline_data() which is only cleared after data has been written back, check the EXT4_STATE_MAY_INLINE_DATA flag in ext4_write_end(). Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Reported-by: syzbot+4faa160fa96bfba639f8@syzkaller.appspotmail.com Reported-by: Jun Nie <jun.nie@linaro.org> Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20221206144134.1919987-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org diff 5c099c4f Tue Dec 06 07:41:34 MST 2022 Ye Bin <yebin10@huawei.com> ext4: fix kernel BUG in 'ext4_write_inline_data_end()' Syzbot report follow issue: ------------[ cut here ]------------ kernel BUG at fs/ext4/inline.c:227! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3629 Comm: syz-executor212 Not tainted 6.1.0-rc5-syzkaller-00018-g59d0d52c30d4 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 RIP: 0010:ext4_write_inline_data+0x344/0x3e0 fs/ext4/inline.c:227 RSP: 0018:ffffc90003b3f368 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff8880704e16c0 RCX: 0000000000000000 RDX: ffff888021763a80 RSI: ffffffff821e31a4 RDI: 0000000000000006 RBP: 000000000006818e R08: 0000000000000006 R09: 0000000000068199 R10: 0000000000000079 R11: 0000000000000000 R12: 000000000000000b R13: 0000000000068199 R14: ffffc90003b3f408 R15: ffff8880704e1c82 FS: 000055555723e3c0(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffe8ac9080 CR3: 0000000079f81000 CR4: 0000000000350ee0 Call Trace: <TASK> ext4_write_inline_data_end+0x2a3/0x12f0 fs/ext4/inline.c:768 ext4_write_end+0x242/0xdd0 fs/ext4/inode.c:1313 ext4_da_write_end+0x3ed/0xa30 fs/ext4/inode.c:3063 generic_perform_write+0x316/0x570 mm/filemap.c:3764 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:285 ext4_file_write_iter+0x8bc/0x16e0 fs/ext4/file.c:700 call_write_iter include/linux/fs.h:2191 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1d0/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd ---[ end trace 0000000000000000 ]--- Above issue may happens as follows: ext4_da_write_begin ext4_da_write_inline_data_begin ext4_da_convert_inline_data_to_extent ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); ext4_da_write_end ext4_run_li_request ext4_mb_prefetch ext4_read_block_bitmap_nowait ext4_validate_block_bitmap ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_BBITMAP_CORRUPT) percpu_counter_sub(&sbi->s_freeclusters_counter,grp->bb_free); -> sbi->s_freeclusters_counter become zero ext4_da_write_begin if (ext4_nonda_switch(inode->i_sb)) -> As freeclusters_counter is zero will return true *fsdata = (void *)FALL_BACK_TO_NONDELALLOC; ext4_write_begin ext4_da_write_end if (write_mode == FALL_BACK_TO_NONDELALLOC) ext4_write_end if (inline_data) ext4_write_inline_data_end ext4_write_inline_data BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); -> As inode is already convert to extent, so 'pos + len' > inline_size -> then trigger BUG. To solve this issue, instead of checking ext4_has_inline_data() which is only cleared after data has been written back, check the EXT4_STATE_MAY_INLINE_DATA flag in ext4_write_end(). Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Reported-by: syzbot+4faa160fa96bfba639f8@syzkaller.appspotmail.com Reported-by: Jun Nie <jun.nie@linaro.org> Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20221206144134.1919987-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org diff 5c099c4f Tue Dec 06 07:41:34 MST 2022 Ye Bin <yebin10@huawei.com> ext4: fix kernel BUG in 'ext4_write_inline_data_end()' Syzbot report follow issue: ------------[ cut here ]------------ kernel BUG at fs/ext4/inline.c:227! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3629 Comm: syz-executor212 Not tainted 6.1.0-rc5-syzkaller-00018-g59d0d52c30d4 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 RIP: 0010:ext4_write_inline_data+0x344/0x3e0 fs/ext4/inline.c:227 RSP: 0018:ffffc90003b3f368 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff8880704e16c0 RCX: 0000000000000000 RDX: ffff888021763a80 RSI: ffffffff821e31a4 RDI: 0000000000000006 RBP: 000000000006818e R08: 0000000000000006 R09: 0000000000068199 R10: 0000000000000079 R11: 0000000000000000 R12: 000000000000000b R13: 0000000000068199 R14: ffffc90003b3f408 R15: ffff8880704e1c82 FS: 000055555723e3c0(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffe8ac9080 CR3: 0000000079f81000 CR4: 0000000000350ee0 Call Trace: <TASK> ext4_write_inline_data_end+0x2a3/0x12f0 fs/ext4/inline.c:768 ext4_write_end+0x242/0xdd0 fs/ext4/inode.c:1313 ext4_da_write_end+0x3ed/0xa30 fs/ext4/inode.c:3063 generic_perform_write+0x316/0x570 mm/filemap.c:3764 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:285 ext4_file_write_iter+0x8bc/0x16e0 fs/ext4/file.c:700 call_write_iter include/linux/fs.h:2191 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1d0/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd ---[ end trace 0000000000000000 ]--- Above issue may happens as follows: ext4_da_write_begin ext4_da_write_inline_data_begin ext4_da_convert_inline_data_to_extent ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); ext4_da_write_end ext4_run_li_request ext4_mb_prefetch ext4_read_block_bitmap_nowait ext4_validate_block_bitmap ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_BBITMAP_CORRUPT) percpu_counter_sub(&sbi->s_freeclusters_counter,grp->bb_free); -> sbi->s_freeclusters_counter become zero ext4_da_write_begin if (ext4_nonda_switch(inode->i_sb)) -> As freeclusters_counter is zero will return true *fsdata = (void *)FALL_BACK_TO_NONDELALLOC; ext4_write_begin ext4_da_write_end if (write_mode == FALL_BACK_TO_NONDELALLOC) ext4_write_end if (inline_data) ext4_write_inline_data_end ext4_write_inline_data BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); -> As inode is already convert to extent, so 'pos + len' > inline_size -> then trigger BUG. To solve this issue, instead of checking ext4_has_inline_data() which is only cleared after data has been written back, check the EXT4_STATE_MAY_INLINE_DATA flag in ext4_write_end(). Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Reported-by: syzbot+4faa160fa96bfba639f8@syzkaller.appspotmail.com Reported-by: Jun Nie <jun.nie@linaro.org> Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20221206144134.1919987-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org diff 5c099c4f Tue Dec 06 07:41:34 MST 2022 Ye Bin <yebin10@huawei.com> ext4: fix kernel BUG in 'ext4_write_inline_data_end()' Syzbot report follow issue: ------------[ cut here ]------------ kernel BUG at fs/ext4/inline.c:227! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3629 Comm: syz-executor212 Not tainted 6.1.0-rc5-syzkaller-00018-g59d0d52c30d4 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 RIP: 0010:ext4_write_inline_data+0x344/0x3e0 fs/ext4/inline.c:227 RSP: 0018:ffffc90003b3f368 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff8880704e16c0 RCX: 0000000000000000 RDX: ffff888021763a80 RSI: ffffffff821e31a4 RDI: 0000000000000006 RBP: 000000000006818e R08: 0000000000000006 R09: 0000000000068199 R10: 0000000000000079 R11: 0000000000000000 R12: 000000000000000b R13: 0000000000068199 R14: ffffc90003b3f408 R15: ffff8880704e1c82 FS: 000055555723e3c0(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffe8ac9080 CR3: 0000000079f81000 CR4: 0000000000350ee0 Call Trace: <TASK> ext4_write_inline_data_end+0x2a3/0x12f0 fs/ext4/inline.c:768 ext4_write_end+0x242/0xdd0 fs/ext4/inode.c:1313 ext4_da_write_end+0x3ed/0xa30 fs/ext4/inode.c:3063 generic_perform_write+0x316/0x570 mm/filemap.c:3764 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:285 ext4_file_write_iter+0x8bc/0x16e0 fs/ext4/file.c:700 call_write_iter include/linux/fs.h:2191 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1d0/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd ---[ end trace 0000000000000000 ]--- Above issue may happens as follows: ext4_da_write_begin ext4_da_write_inline_data_begin ext4_da_convert_inline_data_to_extent ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); ext4_da_write_end ext4_run_li_request ext4_mb_prefetch ext4_read_block_bitmap_nowait ext4_validate_block_bitmap ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_BBITMAP_CORRUPT) percpu_counter_sub(&sbi->s_freeclusters_counter,grp->bb_free); -> sbi->s_freeclusters_counter become zero ext4_da_write_begin if (ext4_nonda_switch(inode->i_sb)) -> As freeclusters_counter is zero will return true *fsdata = (void *)FALL_BACK_TO_NONDELALLOC; ext4_write_begin ext4_da_write_end if (write_mode == FALL_BACK_TO_NONDELALLOC) ext4_write_end if (inline_data) ext4_write_inline_data_end ext4_write_inline_data BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); -> As inode is already convert to extent, so 'pos + len' > inline_size -> then trigger BUG. To solve this issue, instead of checking ext4_has_inline_data() which is only cleared after data has been written back, check the EXT4_STATE_MAY_INLINE_DATA flag in ext4_write_end(). Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Reported-by: syzbot+4faa160fa96bfba639f8@syzkaller.appspotmail.com Reported-by: Jun Nie <jun.nie@linaro.org> Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20221206144134.1919987-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org diff 5c099c4f Tue Dec 06 07:41:34 MST 2022 Ye Bin <yebin10@huawei.com> ext4: fix kernel BUG in 'ext4_write_inline_data_end()' Syzbot report follow issue: ------------[ cut here ]------------ kernel BUG at fs/ext4/inline.c:227! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3629 Comm: syz-executor212 Not tainted 6.1.0-rc5-syzkaller-00018-g59d0d52c30d4 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 RIP: 0010:ext4_write_inline_data+0x344/0x3e0 fs/ext4/inline.c:227 RSP: 0018:ffffc90003b3f368 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff8880704e16c0 RCX: 0000000000000000 RDX: ffff888021763a80 RSI: ffffffff821e31a4 RDI: 0000000000000006 RBP: 000000000006818e R08: 0000000000000006 R09: 0000000000068199 R10: 0000000000000079 R11: 0000000000000000 R12: 000000000000000b R13: 0000000000068199 R14: ffffc90003b3f408 R15: ffff8880704e1c82 FS: 000055555723e3c0(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffe8ac9080 CR3: 0000000079f81000 CR4: 0000000000350ee0 Call Trace: <TASK> ext4_write_inline_data_end+0x2a3/0x12f0 fs/ext4/inline.c:768 ext4_write_end+0x242/0xdd0 fs/ext4/inode.c:1313 ext4_da_write_end+0x3ed/0xa30 fs/ext4/inode.c:3063 generic_perform_write+0x316/0x570 mm/filemap.c:3764 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:285 ext4_file_write_iter+0x8bc/0x16e0 fs/ext4/file.c:700 call_write_iter include/linux/fs.h:2191 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1d0/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd ---[ end trace 0000000000000000 ]--- Above issue may happens as follows: ext4_da_write_begin ext4_da_write_inline_data_begin ext4_da_convert_inline_data_to_extent ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); ext4_da_write_end ext4_run_li_request ext4_mb_prefetch ext4_read_block_bitmap_nowait ext4_validate_block_bitmap ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_BBITMAP_CORRUPT) percpu_counter_sub(&sbi->s_freeclusters_counter,grp->bb_free); -> sbi->s_freeclusters_counter become zero ext4_da_write_begin if (ext4_nonda_switch(inode->i_sb)) -> As freeclusters_counter is zero will return true *fsdata = (void *)FALL_BACK_TO_NONDELALLOC; ext4_write_begin ext4_da_write_end if (write_mode == FALL_BACK_TO_NONDELALLOC) ext4_write_end if (inline_data) ext4_write_inline_data_end ext4_write_inline_data BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); -> As inode is already convert to extent, so 'pos + len' > inline_size -> then trigger BUG. To solve this issue, instead of checking ext4_has_inline_data() which is only cleared after data has been written back, check the EXT4_STATE_MAY_INLINE_DATA flag in ext4_write_end(). Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Reported-by: syzbot+4faa160fa96bfba639f8@syzkaller.appspotmail.com Reported-by: Jun Nie <jun.nie@linaro.org> Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20221206144134.1919987-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org |
/linux-master/include/linux/ | ||
H A D | fs.h | diff d3b1a9a7 Fri Feb 02 01:33:04 MST 2024 JonasZhou <JonasZhou@zhaoxin.com> fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap. In the struct address_space, there is a 32-byte gap between i_mmap and i_mmap_rwsem. Due to the alignment of struct address_space variables to 8 bytes, in certain situations, i_mmap and i_mmap_rwsem may end up in the same CACHE line. While running Unixbench/execl, we observe high false sharing issues when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem after i_private_list, ensuring a 64-byte gap between i_mmap and i_mmap_rwsem. For Intel Silver machines (2 sockets) using kernel v6.8 rc-2, the score of Unixbench/execl improves by ~3.94%, and the score of Unixbench/shell improves by ~3.26%. Baseline: ------------------------------------------------------------- 162 546 748 11374 21 0xffff92e266af90c0 ------------------------------------------------------------- 46.89% 44.65% 0.00% 0.00% 0x0 1 1 0xffffffff86d5fb96 460 258 271 1069 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1 4.21% 4.41% 0.00% 0.00% 0x4 1 1 0xffffffff86d0ed54 473 311 288 95 28 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1 0.00% 0.00% 0.04% 4.76% 0x8 1 1 0xffffffff86d4bcf1 0 0 0 5 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1 6.41% 6.02% 0.00% 0.00% 0x8 1 1 0xffffffff86d4ba85 411 271 339 210 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1 0.00% 0.00% 0.47% 95.24% 0x10 1 1 0xffffffff86d4bd34 0 0 0 74 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1 0.37% 0.13% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4f 328 212 380 7 5 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 5.13% 5.08% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4b 416 255 357 197 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 1.10% 0.53% 0.00% 0.00% 0x28 1 1 0xffffffff86e06eb8 395 228 351 24 14 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1 1.10% 2.14% 57.07% 0.00% 0x38 1 1 0xffffffff878c9225 1364 792 462 7003 32 [k] down_write [kernel.vmlinux] atomic64_64.h:109 0 1 0.00% 0.00% 0.01% 0.00% 0x38 1 1 0xffffffff878c8e75 0 0 252 3 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1 0.00% 0.13% 0.00% 0.00% 0x38 1 1 0xffffffff878c8e23 0 596 63 2 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1 2.38% 2.94% 6.53% 0.00% 0x38 1 1 0xffffffff878c8ccb 1150 818 570 1197 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1 30.59% 32.22% 0.00% 0.00% 0x38 1 1 0xffffffff878c8cb4 423 251 380 648 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1 1.83% 1.74% 35.88% 0.00% 0x38 1 1 0xffffffff86b4f833 1217 1112 565 4586 32 [k] up_write [kernel.vmlinux] atomic64_64.h:91 0 1 with this change: ------------------------------------------------------------- 360 12 300 57 35 0xffff982cdae76400 ------------------------------------------------------------- 50.00% 59.67% 0.00% 0.00% 0x0 1 1 0xffffffff8215fb86 352 200 191 558 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1 8.33% 5.00% 0.00% 0.00% 0x4 1 1 0xffffffff8210ed44 370 284 263 42 24 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1 0.00% 0.00% 5.26% 2.86% 0x8 1 1 0xffffffff8214bce1 0 0 0 4 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1 33.33% 14.33% 0.00% 0.00% 0x8 1 1 0xffffffff8214ba75 344 186 219 140 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1 0.00% 0.00% 94.74% 97.14% 0x10 1 1 0xffffffff8214bd24 0 0 0 88 29 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1 8.33% 20.00% 0.00% 0.00% 0x10 1 1 0xffffffff8214bb3b 296 209 226 167 31 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 0.00% 0.67% 0.00% 0.00% 0x28 1 1 0xffffffff82206f45 0 140 334 4 3 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1 0.00% 0.33% 0.00% 0.00% 0x38 1 1 0xffffffff8250a6c4 0 286 126 5 5 [k] errseq_sample [kernel.vmlinux] errseq.c:125 0 Signed-off-by: JonasZhou <JonasZhou@zhaoxin.com> Link: https://lore.kernel.org/r/20240202083304.10995-1-JonasZhou-oc@zhaoxin.com Signed-off-by: Christian Brauner <brauner@kernel.org> diff d3b1a9a7 Fri Feb 02 01:33:04 MST 2024 JonasZhou <JonasZhou@zhaoxin.com> fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap. In the struct address_space, there is a 32-byte gap between i_mmap and i_mmap_rwsem. Due to the alignment of struct address_space variables to 8 bytes, in certain situations, i_mmap and i_mmap_rwsem may end up in the same CACHE line. While running Unixbench/execl, we observe high false sharing issues when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem after i_private_list, ensuring a 64-byte gap between i_mmap and i_mmap_rwsem. For Intel Silver machines (2 sockets) using kernel v6.8 rc-2, the score of Unixbench/execl improves by ~3.94%, and the score of Unixbench/shell improves by ~3.26%. Baseline: ------------------------------------------------------------- 162 546 748 11374 21 0xffff92e266af90c0 ------------------------------------------------------------- 46.89% 44.65% 0.00% 0.00% 0x0 1 1 0xffffffff86d5fb96 460 258 271 1069 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1 4.21% 4.41% 0.00% 0.00% 0x4 1 1 0xffffffff86d0ed54 473 311 288 95 28 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1 0.00% 0.00% 0.04% 4.76% 0x8 1 1 0xffffffff86d4bcf1 0 0 0 5 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1 6.41% 6.02% 0.00% 0.00% 0x8 1 1 0xffffffff86d4ba85 411 271 339 210 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1 0.00% 0.00% 0.47% 95.24% 0x10 1 1 0xffffffff86d4bd34 0 0 0 74 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1 0.37% 0.13% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4f 328 212 380 7 5 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 5.13% 5.08% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4b 416 255 357 197 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 1.10% 0.53% 0.00% 0.00% 0x28 1 1 0xffffffff86e06eb8 395 228 351 24 14 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1 1.10% 2.14% 57.07% 0.00% 0x38 1 1 0xffffffff878c9225 1364 792 462 7003 32 [k] down_write [kernel.vmlinux] atomic64_64.h:109 0 1 0.00% 0.00% 0.01% 0.00% 0x38 1 1 0xffffffff878c8e75 0 0 252 3 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1 0.00% 0.13% 0.00% 0.00% 0x38 1 1 0xffffffff878c8e23 0 596 63 2 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1 2.38% 2.94% 6.53% 0.00% 0x38 1 1 0xffffffff878c8ccb 1150 818 570 1197 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1 30.59% 32.22% 0.00% 0.00% 0x38 1 1 0xffffffff878c8cb4 423 251 380 648 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1 1.83% 1.74% 35.88% 0.00% 0x38 1 1 0xffffffff86b4f833 1217 1112 565 4586 32 [k] up_write [kernel.vmlinux] atomic64_64.h:91 0 1 with this change: ------------------------------------------------------------- 360 12 300 57 35 0xffff982cdae76400 ------------------------------------------------------------- 50.00% 59.67% 0.00% 0.00% 0x0 1 1 0xffffffff8215fb86 352 200 191 558 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1 8.33% 5.00% 0.00% 0.00% 0x4 1 1 0xffffffff8210ed44 370 284 263 42 24 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1 0.00% 0.00% 5.26% 2.86% 0x8 1 1 0xffffffff8214bce1 0 0 0 4 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1 33.33% 14.33% 0.00% 0.00% 0x8 1 1 0xffffffff8214ba75 344 186 219 140 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1 0.00% 0.00% 94.74% 97.14% 0x10 1 1 0xffffffff8214bd24 0 0 0 88 29 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1 8.33% 20.00% 0.00% 0.00% 0x10 1 1 0xffffffff8214bb3b 296 209 226 167 31 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 0.00% 0.67% 0.00% 0.00% 0x28 1 1 0xffffffff82206f45 0 140 334 4 3 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1 0.00% 0.33% 0.00% 0.00% 0x38 1 1 0xffffffff8250a6c4 0 286 126 5 5 [k] errseq_sample [kernel.vmlinux] errseq.c:125 0 Signed-off-by: JonasZhou <JonasZhou@zhaoxin.com> Link: https://lore.kernel.org/r/20240202083304.10995-1-JonasZhou-oc@zhaoxin.com Signed-off-by: Christian Brauner <brauner@kernel.org> diff d3b1a9a7 Fri Feb 02 01:33:04 MST 2024 JonasZhou <JonasZhou@zhaoxin.com> fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap. In the struct address_space, there is a 32-byte gap between i_mmap and i_mmap_rwsem. Due to the alignment of struct address_space variables to 8 bytes, in certain situations, i_mmap and i_mmap_rwsem may end up in the same CACHE line. While running Unixbench/execl, we observe high false sharing issues when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem after i_private_list, ensuring a 64-byte gap between i_mmap and i_mmap_rwsem. For Intel Silver machines (2 sockets) using kernel v6.8 rc-2, the score of Unixbench/execl improves by ~3.94%, and the score of Unixbench/shell improves by ~3.26%. Baseline: ------------------------------------------------------------- 162 546 748 11374 21 0xffff92e266af90c0 ------------------------------------------------------------- 46.89% 44.65% 0.00% 0.00% 0x0 1 1 0xffffffff86d5fb96 460 258 271 1069 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1 4.21% 4.41% 0.00% 0.00% 0x4 1 1 0xffffffff86d0ed54 473 311 288 95 28 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1 0.00% 0.00% 0.04% 4.76% 0x8 1 1 0xffffffff86d4bcf1 0 0 0 5 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1 6.41% 6.02% 0.00% 0.00% 0x8 1 1 0xffffffff86d4ba85 411 271 339 210 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1 0.00% 0.00% 0.47% 95.24% 0x10 1 1 0xffffffff86d4bd34 0 0 0 74 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1 0.37% 0.13% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4f 328 212 380 7 5 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 5.13% 5.08% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4b 416 255 357 197 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 1.10% 0.53% 0.00% 0.00% 0x28 1 1 0xffffffff86e06eb8 395 228 351 24 14 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1 1.10% 2.14% 57.07% 0.00% 0x38 1 1 0xffffffff878c9225 1364 792 462 7003 32 [k] down_write [kernel.vmlinux] atomic64_64.h:109 0 1 0.00% 0.00% 0.01% 0.00% 0x38 1 1 0xffffffff878c8e75 0 0 252 3 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1 0.00% 0.13% 0.00% 0.00% 0x38 1 1 0xffffffff878c8e23 0 596 63 2 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1 2.38% 2.94% 6.53% 0.00% 0x38 1 1 0xffffffff878c8ccb 1150 818 570 1197 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1 30.59% 32.22% 0.00% 0.00% 0x38 1 1 0xffffffff878c8cb4 423 251 380 648 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1 1.83% 1.74% 35.88% 0.00% 0x38 1 1 0xffffffff86b4f833 1217 1112 565 4586 32 [k] up_write [kernel.vmlinux] atomic64_64.h:91 0 1 with this change: ------------------------------------------------------------- 360 12 300 57 35 0xffff982cdae76400 ------------------------------------------------------------- 50.00% 59.67% 0.00% 0.00% 0x0 1 1 0xffffffff8215fb86 352 200 191 558 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1 8.33% 5.00% 0.00% 0.00% 0x4 1 1 0xffffffff8210ed44 370 284 263 42 24 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1 0.00% 0.00% 5.26% 2.86% 0x8 1 1 0xffffffff8214bce1 0 0 0 4 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1 33.33% 14.33% 0.00% 0.00% 0x8 1 1 0xffffffff8214ba75 344 186 219 140 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1 0.00% 0.00% 94.74% 97.14% 0x10 1 1 0xffffffff8214bd24 0 0 0 88 29 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1 8.33% 20.00% 0.00% 0.00% 0x10 1 1 0xffffffff8214bb3b 296 209 226 167 31 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 0.00% 0.67% 0.00% 0.00% 0x28 1 1 0xffffffff82206f45 0 140 334 4 3 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1 0.00% 0.33% 0.00% 0.00% 0x38 1 1 0xffffffff8250a6c4 0 286 126 5 5 [k] errseq_sample [kernel.vmlinux] errseq.c:125 0 Signed-off-by: JonasZhou <JonasZhou@zhaoxin.com> Link: https://lore.kernel.org/r/20240202083304.10995-1-JonasZhou-oc@zhaoxin.com Signed-off-by: Christian Brauner <brauner@kernel.org> diff d3b1a9a7 Fri Feb 02 01:33:04 MST 2024 JonasZhou <JonasZhou@zhaoxin.com> fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap. In the struct address_space, there is a 32-byte gap between i_mmap and i_mmap_rwsem. Due to the alignment of struct address_space variables to 8 bytes, in certain situations, i_mmap and i_mmap_rwsem may end up in the same CACHE line. While running Unixbench/execl, we observe high false sharing issues when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem after i_private_list, ensuring a 64-byte gap between i_mmap and i_mmap_rwsem. For Intel Silver machines (2 sockets) using kernel v6.8 rc-2, the score of Unixbench/execl improves by ~3.94%, and the score of Unixbench/shell improves by ~3.26%. Baseline: ------------------------------------------------------------- 162 546 748 11374 21 0xffff92e266af90c0 ------------------------------------------------------------- 46.89% 44.65% 0.00% 0.00% 0x0 1 1 0xffffffff86d5fb96 460 258 271 1069 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1 4.21% 4.41% 0.00% 0.00% 0x4 1 1 0xffffffff86d0ed54 473 311 288 95 28 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1 0.00% 0.00% 0.04% 4.76% 0x8 1 1 0xffffffff86d4bcf1 0 0 0 5 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1 6.41% 6.02% 0.00% 0.00% 0x8 1 1 0xffffffff86d4ba85 411 271 339 210 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1 0.00% 0.00% 0.47% 95.24% 0x10 1 1 0xffffffff86d4bd34 0 0 0 74 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1 0.37% 0.13% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4f 328 212 380 7 5 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 5.13% 5.08% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4b 416 255 357 197 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 1.10% 0.53% 0.00% 0.00% 0x28 1 1 0xffffffff86e06eb8 395 228 351 24 14 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1 1.10% 2.14% 57.07% 0.00% 0x38 1 1 0xffffffff878c9225 1364 792 462 7003 32 [k] down_write [kernel.vmlinux] atomic64_64.h:109 0 1 0.00% 0.00% 0.01% 0.00% 0x38 1 1 0xffffffff878c8e75 0 0 252 3 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1 0.00% 0.13% 0.00% 0.00% 0x38 1 1 0xffffffff878c8e23 0 596 63 2 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1 2.38% 2.94% 6.53% 0.00% 0x38 1 1 0xffffffff878c8ccb 1150 818 570 1197 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1 30.59% 32.22% 0.00% 0.00% 0x38 1 1 0xffffffff878c8cb4 423 251 380 648 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1 1.83% 1.74% 35.88% 0.00% 0x38 1 1 0xffffffff86b4f833 1217 1112 565 4586 32 [k] up_write [kernel.vmlinux] atomic64_64.h:91 0 1 with this change: ------------------------------------------------------------- 360 12 300 57 35 0xffff982cdae76400 ------------------------------------------------------------- 50.00% 59.67% 0.00% 0.00% 0x0 1 1 0xffffffff8215fb86 352 200 191 558 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1 8.33% 5.00% 0.00% 0.00% 0x4 1 1 0xffffffff8210ed44 370 284 263 42 24 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1 0.00% 0.00% 5.26% 2.86% 0x8 1 1 0xffffffff8214bce1 0 0 0 4 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1 33.33% 14.33% 0.00% 0.00% 0x8 1 1 0xffffffff8214ba75 344 186 219 140 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1 0.00% 0.00% 94.74% 97.14% 0x10 1 1 0xffffffff8214bd24 0 0 0 88 29 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1 8.33% 20.00% 0.00% 0.00% 0x10 1 1 0xffffffff8214bb3b 296 209 226 167 31 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 0.00% 0.67% 0.00% 0.00% 0x28 1 1 0xffffffff82206f45 0 140 334 4 3 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1 0.00% 0.33% 0.00% 0.00% 0x38 1 1 0xffffffff8250a6c4 0 286 126 5 5 [k] errseq_sample [kernel.vmlinux] errseq.c:125 0 Signed-off-by: JonasZhou <JonasZhou@zhaoxin.com> Link: https://lore.kernel.org/r/20240202083304.10995-1-JonasZhou-oc@zhaoxin.com Signed-off-by: Christian Brauner <brauner@kernel.org> diff d3b1a9a7 Fri Feb 02 01:33:04 MST 2024 JonasZhou <JonasZhou@zhaoxin.com> fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap. In the struct address_space, there is a 32-byte gap between i_mmap and i_mmap_rwsem. Due to the alignment of struct address_space variables to 8 bytes, in certain situations, i_mmap and i_mmap_rwsem may end up in the same CACHE line. While running Unixbench/execl, we observe high false sharing issues when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem after i_private_list, ensuring a 64-byte gap between i_mmap and i_mmap_rwsem. For Intel Silver machines (2 sockets) using kernel v6.8 rc-2, the score of Unixbench/execl improves by ~3.94%, and the score of Unixbench/shell improves by ~3.26%. Baseline: ------------------------------------------------------------- 162 546 748 11374 21 0xffff92e266af90c0 ------------------------------------------------------------- 46.89% 44.65% 0.00% 0.00% 0x0 1 1 0xffffffff86d5fb96 460 258 271 1069 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1 4.21% 4.41% 0.00% 0.00% 0x4 1 1 0xffffffff86d0ed54 473 311 288 95 28 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1 0.00% 0.00% 0.04% 4.76% 0x8 1 1 0xffffffff86d4bcf1 0 0 0 5 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1 6.41% 6.02% 0.00% 0.00% 0x8 1 1 0xffffffff86d4ba85 411 271 339 210 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1 0.00% 0.00% 0.47% 95.24% 0x10 1 1 0xffffffff86d4bd34 0 0 0 74 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1 0.37% 0.13% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4f 328 212 380 7 5 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 5.13% 5.08% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4b 416 255 357 197 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 1.10% 0.53% 0.00% 0.00% 0x28 1 1 0xffffffff86e06eb8 395 228 351 24 14 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1 1.10% 2.14% 57.07% 0.00% 0x38 1 1 0xffffffff878c9225 1364 792 462 7003 32 [k] down_write [kernel.vmlinux] atomic64_64.h:109 0 1 0.00% 0.00% 0.01% 0.00% 0x38 1 1 0xffffffff878c8e75 0 0 252 3 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1 0.00% 0.13% 0.00% 0.00% 0x38 1 1 0xffffffff878c8e23 0 596 63 2 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1 2.38% 2.94% 6.53% 0.00% 0x38 1 1 0xffffffff878c8ccb 1150 818 570 1197 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1 30.59% 32.22% 0.00% 0.00% 0x38 1 1 0xffffffff878c8cb4 423 251 380 648 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1 1.83% 1.74% 35.88% 0.00% 0x38 1 1 0xffffffff86b4f833 1217 1112 565 4586 32 [k] up_write [kernel.vmlinux] atomic64_64.h:91 0 1 with this change: ------------------------------------------------------------- 360 12 300 57 35 0xffff982cdae76400 ------------------------------------------------------------- 50.00% 59.67% 0.00% 0.00% 0x0 1 1 0xffffffff8215fb86 352 200 191 558 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1 8.33% 5.00% 0.00% 0.00% 0x4 1 1 0xffffffff8210ed44 370 284 263 42 24 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1 0.00% 0.00% 5.26% 2.86% 0x8 1 1 0xffffffff8214bce1 0 0 0 4 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1 33.33% 14.33% 0.00% 0.00% 0x8 1 1 0xffffffff8214ba75 344 186 219 140 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1 0.00% 0.00% 94.74% 97.14% 0x10 1 1 0xffffffff8214bd24 0 0 0 88 29 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1 8.33% 20.00% 0.00% 0.00% 0x10 1 1 0xffffffff8214bb3b 296 209 226 167 31 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 0.00% 0.67% 0.00% 0.00% 0x28 1 1 0xffffffff82206f45 0 140 334 4 3 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1 0.00% 0.33% 0.00% 0.00% 0x38 1 1 0xffffffff8250a6c4 0 286 126 5 5 [k] errseq_sample [kernel.vmlinux] errseq.c:125 0 Signed-off-by: JonasZhou <JonasZhou@zhaoxin.com> Link: https://lore.kernel.org/r/20240202083304.10995-1-JonasZhou-oc@zhaoxin.com Signed-off-by: Christian Brauner <brauner@kernel.org> diff d3b1a9a7 Fri Feb 02 01:33:04 MST 2024 JonasZhou <JonasZhou@zhaoxin.com> fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap. In the struct address_space, there is a 32-byte gap between i_mmap and i_mmap_rwsem. Due to the alignment of struct address_space variables to 8 bytes, in certain situations, i_mmap and i_mmap_rwsem may end up in the same CACHE line. While running Unixbench/execl, we observe high false sharing issues when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem after i_private_list, ensuring a 64-byte gap between i_mmap and i_mmap_rwsem. For Intel Silver machines (2 sockets) using kernel v6.8 rc-2, the score of Unixbench/execl improves by ~3.94%, and the score of Unixbench/shell improves by ~3.26%. Baseline: ------------------------------------------------------------- 162 546 748 11374 21 0xffff92e266af90c0 ------------------------------------------------------------- 46.89% 44.65% 0.00% 0.00% 0x0 1 1 0xffffffff86d5fb96 460 258 271 1069 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1 4.21% 4.41% 0.00% 0.00% 0x4 1 1 0xffffffff86d0ed54 473 311 288 95 28 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1 0.00% 0.00% 0.04% 4.76% 0x8 1 1 0xffffffff86d4bcf1 0 0 0 5 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1 6.41% 6.02% 0.00% 0.00% 0x8 1 1 0xffffffff86d4ba85 411 271 339 210 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1 0.00% 0.00% 0.47% 95.24% 0x10 1 1 0xffffffff86d4bd34 0 0 0 74 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1 0.37% 0.13% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4f 328 212 380 7 5 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 5.13% 5.08% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4b 416 255 357 197 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 1.10% 0.53% 0.00% 0.00% 0x28 1 1 0xffffffff86e06eb8 395 228 351 24 14 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1 1.10% 2.14% 57.07% 0.00% 0x38 1 1 0xffffffff878c9225 1364 792 462 7003 32 [k] down_write [kernel.vmlinux] atomic64_64.h:109 0 1 0.00% 0.00% 0.01% 0.00% 0x38 1 1 0xffffffff878c8e75 0 0 252 3 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1 0.00% 0.13% 0.00% 0.00% 0x38 1 1 0xffffffff878c8e23 0 596 63 2 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1 2.38% 2.94% 6.53% 0.00% 0x38 1 1 0xffffffff878c8ccb 1150 818 570 1197 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1 30.59% 32.22% 0.00% 0.00% 0x38 1 1 0xffffffff878c8cb4 423 251 380 648 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1 1.83% 1.74% 35.88% 0.00% 0x38 1 1 0xffffffff86b4f833 1217 1112 565 4586 32 [k] up_write [kernel.vmlinux] atomic64_64.h:91 0 1 with this change: ------------------------------------------------------------- 360 12 300 57 35 0xffff982cdae76400 ------------------------------------------------------------- 50.00% 59.67% 0.00% 0.00% 0x0 1 1 0xffffffff8215fb86 352 200 191 558 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1 8.33% 5.00% 0.00% 0.00% 0x4 1 1 0xffffffff8210ed44 370 284 263 42 24 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1 0.00% 0.00% 5.26% 2.86% 0x8 1 1 0xffffffff8214bce1 0 0 0 4 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1 33.33% 14.33% 0.00% 0.00% 0x8 1 1 0xffffffff8214ba75 344 186 219 140 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1 0.00% 0.00% 94.74% 97.14% 0x10 1 1 0xffffffff8214bd24 0 0 0 88 29 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1 8.33% 20.00% 0.00% 0.00% 0x10 1 1 0xffffffff8214bb3b 296 209 226 167 31 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 0.00% 0.67% 0.00% 0.00% 0x28 1 1 0xffffffff82206f45 0 140 334 4 3 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1 0.00% 0.33% 0.00% 0.00% 0x38 1 1 0xffffffff8250a6c4 0 286 126 5 5 [k] errseq_sample [kernel.vmlinux] errseq.c:125 0 Signed-off-by: JonasZhou <JonasZhou@zhaoxin.com> Link: https://lore.kernel.org/r/20240202083304.10995-1-JonasZhou-oc@zhaoxin.com Signed-off-by: Christian Brauner <brauner@kernel.org> diff d3b1a9a7 Fri Feb 02 01:33:04 MST 2024 JonasZhou <JonasZhou@zhaoxin.com> fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap. In the struct address_space, there is a 32-byte gap between i_mmap and i_mmap_rwsem. Due to the alignment of struct address_space variables to 8 bytes, in certain situations, i_mmap and i_mmap_rwsem may end up in the same CACHE line. While running Unixbench/execl, we observe high false sharing issues when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem after i_private_list, ensuring a 64-byte gap between i_mmap and i_mmap_rwsem. For Intel Silver machines (2 sockets) using kernel v6.8 rc-2, the score of Unixbench/execl improves by ~3.94%, and the score of Unixbench/shell improves by ~3.26%. Baseline: ------------------------------------------------------------- 162 546 748 11374 21 0xffff92e266af90c0 ------------------------------------------------------------- 46.89% 44.65% 0.00% 0.00% 0x0 1 1 0xffffffff86d5fb96 460 258 271 1069 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1 4.21% 4.41% 0.00% 0.00% 0x4 1 1 0xffffffff86d0ed54 473 311 288 95 28 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1 0.00% 0.00% 0.04% 4.76% 0x8 1 1 0xffffffff86d4bcf1 0 0 0 5 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1 6.41% 6.02% 0.00% 0.00% 0x8 1 1 0xffffffff86d4ba85 411 271 339 210 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1 0.00% 0.00% 0.47% 95.24% 0x10 1 1 0xffffffff86d4bd34 0 0 0 74 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1 0.37% 0.13% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4f 328 212 380 7 5 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 5.13% 5.08% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4b 416 255 357 197 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 1.10% 0.53% 0.00% 0.00% 0x28 1 1 0xffffffff86e06eb8 395 228 351 24 14 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1 1.10% 2.14% 57.07% 0.00% 0x38 1 1 0xffffffff878c9225 1364 792 462 7003 32 [k] down_write [kernel.vmlinux] atomic64_64.h:109 0 1 0.00% 0.00% 0.01% 0.00% 0x38 1 1 0xffffffff878c8e75 0 0 252 3 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1 0.00% 0.13% 0.00% 0.00% 0x38 1 1 0xffffffff878c8e23 0 596 63 2 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1 2.38% 2.94% 6.53% 0.00% 0x38 1 1 0xffffffff878c8ccb 1150 818 570 1197 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1 30.59% 32.22% 0.00% 0.00% 0x38 1 1 0xffffffff878c8cb4 423 251 380 648 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1 1.83% 1.74% 35.88% 0.00% 0x38 1 1 0xffffffff86b4f833 1217 1112 565 4586 32 [k] up_write [kernel.vmlinux] atomic64_64.h:91 0 1 with this change: ------------------------------------------------------------- 360 12 300 57 35 0xffff982cdae76400 ------------------------------------------------------------- 50.00% 59.67% 0.00% 0.00% 0x0 1 1 0xffffffff8215fb86 352 200 191 558 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1 8.33% 5.00% 0.00% 0.00% 0x4 1 1 0xffffffff8210ed44 370 284 263 42 24 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1 0.00% 0.00% 5.26% 2.86% 0x8 1 1 0xffffffff8214bce1 0 0 0 4 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1 33.33% 14.33% 0.00% 0.00% 0x8 1 1 0xffffffff8214ba75 344 186 219 140 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1 0.00% 0.00% 94.74% 97.14% 0x10 1 1 0xffffffff8214bd24 0 0 0 88 29 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1 8.33% 20.00% 0.00% 0.00% 0x10 1 1 0xffffffff8214bb3b 296 209 226 167 31 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 0.00% 0.67% 0.00% 0.00% 0x28 1 1 0xffffffff82206f45 0 140 334 4 3 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1 0.00% 0.33% 0.00% 0.00% 0x38 1 1 0xffffffff8250a6c4 0 286 126 5 5 [k] errseq_sample [kernel.vmlinux] errseq.c:125 0 Signed-off-by: JonasZhou <JonasZhou@zhaoxin.com> Link: https://lore.kernel.org/r/20240202083304.10995-1-JonasZhou-oc@zhaoxin.com Signed-off-by: Christian Brauner <brauner@kernel.org> diff b820de74 Thu Feb 15 01:47:38 MST 2024 Bart Van Assche <bvanassche@acm.org> fs/aio: Restrict kiocb_set_cancel_fn() to I/O submitted via libaio If kiocb_set_cancel_fn() is called for I/O submitted via io_uring, the following kernel warning appears: WARNING: CPU: 3 PID: 368 at fs/aio.c:598 kiocb_set_cancel_fn+0x9c/0xa8 Call trace: kiocb_set_cancel_fn+0x9c/0xa8 ffs_epfile_read_iter+0x144/0x1d0 io_read+0x19c/0x498 io_issue_sqe+0x118/0x27c io_submit_sqes+0x25c/0x5fc __arm64_sys_io_uring_enter+0x104/0xab0 invoke_syscall+0x58/0x11c el0_svc_common+0xb4/0xf4 do_el0_svc+0x2c/0xb0 el0_svc+0x2c/0xa4 el0t_64_sync_handler+0x68/0xb4 el0t_64_sync+0x1a4/0x1a8 Fix this by setting the IOCB_AIO_RW flag for read and write I/O that is submitted by libaio. Suggested-by: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Cc: Avi Kivity <avi@scylladb.com> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: stable@vger.kernel.org Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240215204739.2677806-2-bvanassche@acm.org Signed-off-by: Christian Brauner <brauner@kernel.org> diff f91a704f Mon Oct 02 08:19:46 MDT 2023 Amir Goldstein <amir73il@gmail.com> fs: prepare for stackable filesystems backing file helpers In preparation for factoring out some backing file io helpers from overlayfs, move backing_file_open() into a new file fs/backing-file.c and header. Add a MAINTAINERS entry for stackable filesystems and add a Kconfig FS_STACK which stackable filesystems need to select. For now, the backing_file struct, the backing_file alloc/free functions and the backing_file_real_path() accessor remain internal to file_table.c. We may change that in the future. Signed-off-by: Amir Goldstein <amir73il@gmail.com> diff f91a704f Mon Oct 02 08:19:46 MDT 2023 Amir Goldstein <amir73il@gmail.com> fs: prepare for stackable filesystems backing file helpers In preparation for factoring out some backing file io helpers from overlayfs, move backing_file_open() into a new file fs/backing-file.c and header. Add a MAINTAINERS entry for stackable filesystems and add a Kconfig FS_STACK which stackable filesystems need to select. For now, the backing_file struct, the backing_file alloc/free functions and the backing_file_real_path() accessor remain internal to file_table.c. We may change that in the future. Signed-off-by: Amir Goldstein <amir73il@gmail.com> |
Completed in 2296 milliseconds