Searched +hist:4 +hist:b6f5d20 (Results 101 - 106 of 106) sorted by relevance

12345

/linux-master/net/
H A Dsocket.cdiff 1406245c Mon Oct 16 07:47:41 MDT 2023 Breno Leitao <leitao@debian.org> net/socket: Break down __sys_setsockopt

Split __sys_setsockopt() into two functions by removing the core
logic into a sub-function (do_sock_setsockopt()). This will avoid
code duplication when doing the same operation in other callers, for
instance.

do_sock_setsockopt() will be called by io_uring setsockopt() command
operation in the following patch.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20231016134750.1381153-4-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff c889a99a Thu Sep 21 17:46:42 MDT 2023 Jordan Rife <jrife@google.com> net: prevent address rewrite in kernel_bind()

Similar to the change in commit 0bdf399342c5("net: Avoid address
overwrite in kernel_connect"), BPF hooks run on bind may rewrite the
address passed to kernel_bind(). This change

1) Makes a copy of the bind address in kernel_bind() to insulate
callers.
2) Replaces direct calls to sock->ops->bind() in net with kernel_bind()

Link: https://lore.kernel.org/netdev/20230912013332.2048422-1-jrife@google.com/
Fixes: 4fbac77d2d09 ("bpf: Hooks for sys_bind")
Cc: stable@vger.kernel.org
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jordan Rife <jrife@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 2bfc6685 Wed Jun 07 12:19:10 MDT 2023 David Howells <dhowells@redhat.com> splice, net: Add a splice_eof op to file-ops and socket-ops

Add an optional method, ->splice_eof(), to allow splice to indicate the
premature termination of a splice to struct file_operations and struct
proto_ops.

This is called if sendfile() or splice() encounters all of the following
conditions inside splice_direct_to_actor():

(1) the user did not set SPLICE_F_MORE (splice only), and

(2) an EOF condition occurred (->splice_read() returned 0), and

(3) we haven't read enough to fulfill the request (ie. len > 0 still), and

(4) we have already spliced at least one byte.

A further patch will modify the behaviour of SPLICE_F_MORE to always be
passed to the actor if either the user set it or we haven't yet read
sufficient data to fulfill the request.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/CAHk-=wh=V579PDYvkpnTobCLGczbgxpMgGmmhqiTyE34Cpi5Gg@mail.gmail.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: David Hildenbrand <david@redhat.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Chuck Lever <chuck.lever@oracle.com>
cc: Boris Pismenny <borisp@nvidia.com>
cc: John Fastabend <john.fastabend@gmail.com>
cc: linux-mm@kvack.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 649c15c7 Tue Mar 07 10:37:07 MST 2023 Thadeu Lima de Souza Cascardo <cascardo@canonical.com> net: avoid double iput when sock_alloc_file fails

When sock_alloc_file fails to allocate a file, it will call sock_release.
__sys_socket_file should then not call sock_release again, otherwise there
will be a double free.

[ 89.319884] ------------[ cut here ]------------
[ 89.320286] kernel BUG at fs/inode.c:1764!
[ 89.320656] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 89.321051] CPU: 7 PID: 125 Comm: iou-sqp-124 Not tainted 6.2.0+ #361
[ 89.321535] RIP: 0010:iput+0x1ff/0x240
[ 89.321808] Code: d1 83 e1 03 48 83 f9 02 75 09 48 81 fa 00 10 00 00 77 05 83 e2 01 75 1f 4c 89 ef e8 fb d2 ba 00 e9 80 fe ff ff c3 cc cc cc cc <0f> 0b 0f 0b e9 d0 fe ff ff 0f 0b eb 8d 49 8d b4 24 08 01 00 00 48
[ 89.322760] RSP: 0018:ffffbdd60068bd50 EFLAGS: 00010202
[ 89.323036] RAX: 0000000000000000 RBX: ffff9d7ad3cacac0 RCX: 0000000000001107
[ 89.323412] RDX: 000000000003af00 RSI: 0000000000000000 RDI: ffff9d7ad3cacb40
[ 89.323785] RBP: ffffbdd60068bd68 R08: ffffffffffffffff R09: ffffffffab606438
[ 89.324157] R10: ffffffffacb3dfa0 R11: 6465686361657256 R12: ffff9d7ad3cacb40
[ 89.324529] R13: 0000000080000001 R14: 0000000080000001 R15: 0000000000000002
[ 89.324904] FS: 00007f7b28516740(0000) GS:ffff9d7aeb1c0000(0000) knlGS:0000000000000000
[ 89.325328] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 89.325629] CR2: 00007f0af52e96c0 CR3: 0000000002a02006 CR4: 0000000000770ee0
[ 89.326004] PKRU: 55555554
[ 89.326161] Call Trace:
[ 89.326298] <TASK>
[ 89.326419] __sock_release+0xb5/0xc0
[ 89.326632] __sys_socket_file+0xb2/0xd0
[ 89.326844] io_socket+0x88/0x100
[ 89.327039] ? io_issue_sqe+0x6a/0x430
[ 89.327258] io_issue_sqe+0x67/0x430
[ 89.327450] io_submit_sqes+0x1fe/0x670
[ 89.327661] io_sq_thread+0x2e6/0x530
[ 89.327859] ? __pfx_autoremove_wake_function+0x10/0x10
[ 89.328145] ? __pfx_io_sq_thread+0x10/0x10
[ 89.328367] ret_from_fork+0x29/0x50
[ 89.328576] RIP: 0033:0x0
[ 89.328732] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[ 89.329073] RSP: 002b:0000000000000000 EFLAGS: 00000202 ORIG_RAX: 00000000000001a9
[ 89.329477] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f7b28637a3d
[ 89.329845] RDX: 00007fff4e4318a8 RSI: 00007fff4e4318b0 RDI: 0000000000000400
[ 89.330216] RBP: 00007fff4e431830 R08: 00007fff4e431711 R09: 00007fff4e4318b0
[ 89.330584] R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff4e441b38
[ 89.330950] R13: 0000563835e3e725 R14: 0000563835e40d10 R15: 00007f7b28784040
[ 89.331318] </TASK>
[ 89.331441] Modules linked in:
[ 89.331617] ---[ end trace 0000000000000000 ]---

Fixes: da214a475f8b ("net: add __sys_socket_file()")
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230307173707.468744-1-cascardo@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 2558b803 Mon Feb 13 09:00:59 MST 2023 Eric Dumazet <edumazet@google.com> net: use a bounce buffer for copying skb->mark

syzbot found arm64 builds would crash in sock_recv_mark()
when CONFIG_HARDENED_USERCOPY=y

x86 and powerpc are not detecting the issue because
they define user_access_begin.
This will be handled in a different patch,
because a check_object_size() is missing.

Only data from skb->cb[] can be copied directly to/from user space,
as explained in commit 79a8a642bf05 ("net: Whitelist
the skbuff_head_cache "cb" field")

syzbot report was:
usercopy: Kernel memory exposure attempt detected from SLUB object 'skbuff_head_cache' (offset 168, size 4)!
------------[ cut here ]------------
kernel BUG at mm/usercopy.c:102 !
Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 4410 Comm: syz-executor533 Not tainted 6.2.0-rc7-syzkaller-17907-g2d3827b3f393 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/21/2023
pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : usercopy_abort+0x90/0x94 mm/usercopy.c:90
lr : usercopy_abort+0x90/0x94 mm/usercopy.c:90
sp : ffff80000fb9b9a0
x29: ffff80000fb9b9b0 x28: ffff0000c6073400 x27: 0000000020001a00
x26: 0000000000000014 x25: ffff80000cf52000 x24: fffffc0000000000
x23: 05ffc00000000200 x22: fffffc000324bf80 x21: ffff0000c92fe1a8
x20: 0000000000000001 x19: 0000000000000004 x18: 0000000000000000
x17: 656a626f2042554c x16: ffff0000c6073dd0 x15: ffff80000dbd2118
x14: ffff0000c6073400 x13: 00000000ffffffff x12: ffff0000c6073400
x11: ff808000081bbb4c x10: 0000000000000000 x9 : 7b0572d7cc0ccf00
x8 : 7b0572d7cc0ccf00 x7 : ffff80000bf650d4 x6 : 0000000000000000
x5 : 0000000000000001 x4 : 0000000000000001 x3 : 0000000000000000
x2 : ffff0001fefbff08 x1 : 0000000100000000 x0 : 000000000000006c
Call trace:
usercopy_abort+0x90/0x94 mm/usercopy.c:90
__check_heap_object+0xa8/0x100 mm/slub.c:4761
check_heap_object mm/usercopy.c:196 [inline]
__check_object_size+0x208/0x6b8 mm/usercopy.c:251
check_object_size include/linux/thread_info.h:199 [inline]
__copy_to_user include/linux/uaccess.h:115 [inline]
put_cmsg+0x408/0x464 net/core/scm.c:238
sock_recv_mark net/socket.c:975 [inline]
__sock_recv_cmsgs+0x1fc/0x248 net/socket.c:984
sock_recv_cmsgs include/net/sock.h:2728 [inline]
packet_recvmsg+0x2d8/0x678 net/packet/af_packet.c:3482
____sys_recvmsg+0x110/0x3a0
___sys_recvmsg net/socket.c:2737 [inline]
__sys_recvmsg+0x194/0x210 net/socket.c:2767
__do_sys_recvmsg net/socket.c:2777 [inline]
__se_sys_recvmsg net/socket.c:2774 [inline]
__arm64_sys_recvmsg+0x2c/0x3c net/socket.c:2774
__invoke_syscall arch/arm64/kernel/syscall.c:38 [inline]
invoke_syscall+0x64/0x178 arch/arm64/kernel/syscall.c:52
el0_svc_common+0xbc/0x180 arch/arm64/kernel/syscall.c:142
do_el0_svc+0x48/0x110 arch/arm64/kernel/syscall.c:193
el0_svc+0x58/0x14c arch/arm64/kernel/entry-common.c:637
el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:655
el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:591
Code: 91388800 aa0903e1 f90003e8 94e6d752 (d4210000)

Fixes: 6fd1d51cfa25 ("net: SO_RCVMARK socket option for SO_MARK with recvmsg()")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Erin MacNeil <lnx.erin@gmail.com>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/r/20230213160059.3829741-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 4a367299 Fri Jul 17 00:23:11 MDT 2020 Christoph Hellwig <hch@lst.de> net: streamline __sys_setsockopt

Return early when sockfd_lookup_light fails to reduce a level of
indentation for most of the function body.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff a648a592 Wed Jul 03 08:06:54 MDT 2019 Paolo Abeni <pabeni@redhat.com> net: adjust socket level ICW to cope with ipv6 variant of {recv, send}msg

After the previous patch we have ipv{6,4} variants for {recv,send}msg,
we should use the generic _INET ICW variant to call into the proper
build-in.

This also allows dropping the now unused and rather ugly _INET4 ICW macro

v1 -> v2:
- use ICW macro to declare inet6_{recv,send}msg
- fix a couple of checkpatch offender in the code context

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff c6c9fee3 Fri Jan 25 14:43:19 MST 2019 Johannes Berg <johannes.berg@intel.com> net: socket: fix SIOCGIFNAME in compat

As reported by Robert O'Callahan in
https://bugzilla.kernel.org/show_bug.cgi?id=202273
reverting the previous changes in this area broke
the SIOCGIFNAME ioctl in compat again (I'd previously
fixed it after his previous report of breakage in
https://bugzilla.kernel.org/show_bug.cgi?id=199469).

This is obviously because I fixed SIOCGIFNAME more or
less by accident.

Fix it explicitly now by making it pass through the
restored compat translation code.

Cc: stable@vger.kernel.org
Fixes: 4cf808e7ac32 ("kill dev_ifname32()")
Reported-by: Robert O'Callahan <robert@ocallahan.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff bf2ae2e4 Sat Mar 10 03:57:50 MST 2018 Xin Long <lucien.xin@gmail.com> sock_diag: request _diag module only when the family or proto has been registered

Now when using 'ss' in iproute, kernel would try to load all _diag
modules, which also causes corresponding family and proto modules
to be loaded as well due to module dependencies.

Like after running 'ss', sctp, dccp, af_packet (if it works as a module)
would be loaded.

For example:

$ lsmod|grep sctp
$ ss
$ lsmod|grep sctp
sctp_diag 16384 0
sctp 323584 5 sctp_diag
inet_diag 24576 4 raw_diag,tcp_diag,sctp_diag,udp_diag
libcrc32c 16384 3 nf_conntrack,nf_nat,sctp

As these family and proto modules are loaded unintentionally, it
could cause some problems, like:

- Some debug tools use 'ss' to collect the socket info, which loads all
those diag and family and protocol modules. It's noisy for identifying
issues.

- Users usually expect to drop sctp init packet silently when they
have no sense of sctp protocol instead of sending abort back.

- It wastes resources (especially with multiple netns), and SCTP module
can't be unloaded once it's loaded.

...

In short, it's really inappropriate to have these family and proto
modules loaded unexpectedly when just doing debugging with inet_diag.

This patch is to introduce sock_load_diag_module() where it loads
the _diag module only when it's corresponding family or proto has
been already registered.

Note that we can't just load _diag module without the family or
proto loaded, as some symbols used in _diag module are from the
family or proto module.

v1->v2:
- move inet proto check to inet_diag to avoid a compiling err.
v2->v3:
- define sock_load_diag_module in sock.c and export one symbol
only.
- improve the changelog.

Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Phil Sutter <phil@nwl.cc>
Acked-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 4cf808e7 Sun Oct 01 19:12:09 MDT 2017 Al Viro <viro@zeniv.linux.org.uk> kill dev_ifname32()

same story...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
/linux-master/fs/hugetlbfs/
H A Dinode.cdiff 79d72c68 Tue Jan 30 14:04:18 MST 2024 Oscar Salvador <osalvador@suse.de> fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super

When configuring a hugetlb filesystem via the fsconfig() syscall, there is
a possible NULL dereference in hugetlbfs_fill_super() caused by assigning
NULL to ctx->hstate in hugetlbfs_parse_param() when the requested pagesize
is non valid.

E.g: Taking the following steps:

fd = fsopen("hugetlbfs", FSOPEN_CLOEXEC);
fsconfig(fd, FSCONFIG_SET_STRING, "pagesize", "1024", 0);
fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);

Given that the requested "pagesize" is invalid, ctxt->hstate will be replaced
with NULL, losing its previous value, and we will print an error:

...
...
case Opt_pagesize:
ps = memparse(param->string, &rest);
ctx->hstate = h;
if (!ctx->hstate) {
pr_err("Unsupported page size %lu MB\n", ps / SZ_1M);
return -EINVAL;
}
return 0;
...
...

This is a problem because later on, we will dereference ctxt->hstate in
hugetlbfs_fill_super()

...
...
sb->s_blocksize = huge_page_size(ctx->hstate);
...
...

Causing below Oops.

Fix this by replacing cxt->hstate value only when then pagesize is known
to be valid.

kernel: hugetlbfs: Unsupported page size 0 MB
kernel: BUG: kernel NULL pointer dereference, address: 0000000000000028
kernel: #PF: supervisor read access in kernel mode
kernel: #PF: error_code(0x0000) - not-present page
kernel: PGD 800000010f66c067 P4D 800000010f66c067 PUD 1b22f8067 PMD 0
kernel: Oops: 0000 [#1] PREEMPT SMP PTI
kernel: CPU: 4 PID: 5659 Comm: syscall Tainted: G E 6.8.0-rc2-default+ #22 5a47c3fef76212addcc6eb71344aabc35190ae8f
kernel: Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0
kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28
kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004
kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000
kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004
kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000
kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400
kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0
kernel: Call Trace:
kernel: <TASK>
kernel: ? __die_body+0x1a/0x60
kernel: ? page_fault_oops+0x16f/0x4a0
kernel: ? search_bpf_extables+0x65/0x70
kernel: ? fixup_exception+0x22/0x310
kernel: ? exc_page_fault+0x69/0x150
kernel: ? asm_exc_page_fault+0x22/0x30
kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10
kernel: ? hugetlbfs_fill_super+0xb4/0x1a0
kernel: ? hugetlbfs_fill_super+0x28/0x1a0
kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10
kernel: vfs_get_super+0x40/0xa0
kernel: ? __pfx_bpf_lsm_capable+0x10/0x10
kernel: vfs_get_tree+0x25/0xd0
kernel: vfs_cmd_create+0x64/0xe0
kernel: __x64_sys_fsconfig+0x395/0x410
kernel: do_syscall_64+0x80/0x160
kernel: ? syscall_exit_to_user_mode+0x82/0x240
kernel: ? do_syscall_64+0x8d/0x160
kernel: ? syscall_exit_to_user_mode+0x82/0x240
kernel: ? do_syscall_64+0x8d/0x160
kernel: ? exc_page_fault+0x69/0x150
kernel: entry_SYSCALL_64_after_hwframe+0x6e/0x76
kernel: RIP: 0033:0x7ffbc0cb87c9
kernel: Code: 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 97 96 0d 00 f7 d8 64 89 01 48
kernel: RSP: 002b:00007ffc29d2f388 EFLAGS: 00000206 ORIG_RAX: 00000000000001af
kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffbc0cb87c9
kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
kernel: RBP: 00007ffc29d2f3b0 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
kernel: R13: 00007ffc29d2f4c0 R14: 0000000000000000 R15: 0000000000000000
kernel: </TASK>
kernel: Modules linked in: rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) sunrpc(E) netfs(E) af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) intel_rapl_msr(E) intel_rapl_common(E) iTCO_wdt(E) intel_pmc_bxt(E) sb_edac(E) iTCO_vendor_support(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) rfkill(E) ipmi_ssif(E) kvm(E) acpi_ipmi(E) irqbypass(E) pcspkr(E) igb(E) ipmi_si(E) mei_me(E) i2c_i801(E) joydev(E) intel_pch_thermal(E) i2c_smbus(E) dca(E) lpc_ich(E) mei(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_pad(E) tiny_power_button(E) button(E) fuse(E) efi_pstore(E) configfs(E) ip_tables(E) x_tables(E) ext4(E) mbcache(E) jbd2(E) hid_generic(E) usbhid(E) sd_mod(E) t10_pi(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) polyval_clmulni(E) ahci(E) xhci_pci(E) polyval_generic(E) gf128mul(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha256_ssse3(E) xhci_pci_renesas(E) libahci(E) ehci_pci(E) sha1_ssse3(E) xhci_hcd(E) ehci_hcd(E) libata(E)
kernel: mgag200(E) i2c_algo_bit(E) usbcore(E) wmi(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) scsi_common(E) aesni_intel(E) crypto_simd(E) cryptd(E)
kernel: Unloaded tainted modules: acpi_cpufreq(E):1 fjes(E):1
kernel: CR2: 0000000000000028
kernel: ---[ end trace 0000000000000000 ]---
kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0
kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28
kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004
kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000
kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004
kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000
kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400
kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0

Link: https://lkml.kernel.org/r/20240130210418.3771-1-osalvador@suse.de
Fixes: 32021982a324 ("hugetlbfs: Convert to fs_context")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 79d72c68 Tue Jan 30 14:04:18 MST 2024 Oscar Salvador <osalvador@suse.de> fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super

When configuring a hugetlb filesystem via the fsconfig() syscall, there is
a possible NULL dereference in hugetlbfs_fill_super() caused by assigning
NULL to ctx->hstate in hugetlbfs_parse_param() when the requested pagesize
is non valid.

E.g: Taking the following steps:

fd = fsopen("hugetlbfs", FSOPEN_CLOEXEC);
fsconfig(fd, FSCONFIG_SET_STRING, "pagesize", "1024", 0);
fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);

Given that the requested "pagesize" is invalid, ctxt->hstate will be replaced
with NULL, losing its previous value, and we will print an error:

...
...
case Opt_pagesize:
ps = memparse(param->string, &rest);
ctx->hstate = h;
if (!ctx->hstate) {
pr_err("Unsupported page size %lu MB\n", ps / SZ_1M);
return -EINVAL;
}
return 0;
...
...

This is a problem because later on, we will dereference ctxt->hstate in
hugetlbfs_fill_super()

...
...
sb->s_blocksize = huge_page_size(ctx->hstate);
...
...

Causing below Oops.

Fix this by replacing cxt->hstate value only when then pagesize is known
to be valid.

kernel: hugetlbfs: Unsupported page size 0 MB
kernel: BUG: kernel NULL pointer dereference, address: 0000000000000028
kernel: #PF: supervisor read access in kernel mode
kernel: #PF: error_code(0x0000) - not-present page
kernel: PGD 800000010f66c067 P4D 800000010f66c067 PUD 1b22f8067 PMD 0
kernel: Oops: 0000 [#1] PREEMPT SMP PTI
kernel: CPU: 4 PID: 5659 Comm: syscall Tainted: G E 6.8.0-rc2-default+ #22 5a47c3fef76212addcc6eb71344aabc35190ae8f
kernel: Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0
kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28
kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004
kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000
kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004
kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000
kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400
kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0
kernel: Call Trace:
kernel: <TASK>
kernel: ? __die_body+0x1a/0x60
kernel: ? page_fault_oops+0x16f/0x4a0
kernel: ? search_bpf_extables+0x65/0x70
kernel: ? fixup_exception+0x22/0x310
kernel: ? exc_page_fault+0x69/0x150
kernel: ? asm_exc_page_fault+0x22/0x30
kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10
kernel: ? hugetlbfs_fill_super+0xb4/0x1a0
kernel: ? hugetlbfs_fill_super+0x28/0x1a0
kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10
kernel: vfs_get_super+0x40/0xa0
kernel: ? __pfx_bpf_lsm_capable+0x10/0x10
kernel: vfs_get_tree+0x25/0xd0
kernel: vfs_cmd_create+0x64/0xe0
kernel: __x64_sys_fsconfig+0x395/0x410
kernel: do_syscall_64+0x80/0x160
kernel: ? syscall_exit_to_user_mode+0x82/0x240
kernel: ? do_syscall_64+0x8d/0x160
kernel: ? syscall_exit_to_user_mode+0x82/0x240
kernel: ? do_syscall_64+0x8d/0x160
kernel: ? exc_page_fault+0x69/0x150
kernel: entry_SYSCALL_64_after_hwframe+0x6e/0x76
kernel: RIP: 0033:0x7ffbc0cb87c9
kernel: Code: 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 97 96 0d 00 f7 d8 64 89 01 48
kernel: RSP: 002b:00007ffc29d2f388 EFLAGS: 00000206 ORIG_RAX: 00000000000001af
kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffbc0cb87c9
kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
kernel: RBP: 00007ffc29d2f3b0 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
kernel: R13: 00007ffc29d2f4c0 R14: 0000000000000000 R15: 0000000000000000
kernel: </TASK>
kernel: Modules linked in: rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) sunrpc(E) netfs(E) af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) intel_rapl_msr(E) intel_rapl_common(E) iTCO_wdt(E) intel_pmc_bxt(E) sb_edac(E) iTCO_vendor_support(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) rfkill(E) ipmi_ssif(E) kvm(E) acpi_ipmi(E) irqbypass(E) pcspkr(E) igb(E) ipmi_si(E) mei_me(E) i2c_i801(E) joydev(E) intel_pch_thermal(E) i2c_smbus(E) dca(E) lpc_ich(E) mei(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_pad(E) tiny_power_button(E) button(E) fuse(E) efi_pstore(E) configfs(E) ip_tables(E) x_tables(E) ext4(E) mbcache(E) jbd2(E) hid_generic(E) usbhid(E) sd_mod(E) t10_pi(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) polyval_clmulni(E) ahci(E) xhci_pci(E) polyval_generic(E) gf128mul(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha256_ssse3(E) xhci_pci_renesas(E) libahci(E) ehci_pci(E) sha1_ssse3(E) xhci_hcd(E) ehci_hcd(E) libata(E)
kernel: mgag200(E) i2c_algo_bit(E) usbcore(E) wmi(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) scsi_common(E) aesni_intel(E) crypto_simd(E) cryptd(E)
kernel: Unloaded tainted modules: acpi_cpufreq(E):1 fjes(E):1
kernel: CR2: 0000000000000028
kernel: ---[ end trace 0000000000000000 ]---
kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0
kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28
kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004
kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000
kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004
kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000
kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400
kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0

Link: https://lkml.kernel.org/r/20240130210418.3771-1-osalvador@suse.de
Fixes: 32021982a324 ("hugetlbfs: Convert to fs_context")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 79d72c68 Tue Jan 30 14:04:18 MST 2024 Oscar Salvador <osalvador@suse.de> fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super

When configuring a hugetlb filesystem via the fsconfig() syscall, there is
a possible NULL dereference in hugetlbfs_fill_super() caused by assigning
NULL to ctx->hstate in hugetlbfs_parse_param() when the requested pagesize
is non valid.

E.g: Taking the following steps:

fd = fsopen("hugetlbfs", FSOPEN_CLOEXEC);
fsconfig(fd, FSCONFIG_SET_STRING, "pagesize", "1024", 0);
fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);

Given that the requested "pagesize" is invalid, ctxt->hstate will be replaced
with NULL, losing its previous value, and we will print an error:

...
...
case Opt_pagesize:
ps = memparse(param->string, &rest);
ctx->hstate = h;
if (!ctx->hstate) {
pr_err("Unsupported page size %lu MB\n", ps / SZ_1M);
return -EINVAL;
}
return 0;
...
...

This is a problem because later on, we will dereference ctxt->hstate in
hugetlbfs_fill_super()

...
...
sb->s_blocksize = huge_page_size(ctx->hstate);
...
...

Causing below Oops.

Fix this by replacing cxt->hstate value only when then pagesize is known
to be valid.

kernel: hugetlbfs: Unsupported page size 0 MB
kernel: BUG: kernel NULL pointer dereference, address: 0000000000000028
kernel: #PF: supervisor read access in kernel mode
kernel: #PF: error_code(0x0000) - not-present page
kernel: PGD 800000010f66c067 P4D 800000010f66c067 PUD 1b22f8067 PMD 0
kernel: Oops: 0000 [#1] PREEMPT SMP PTI
kernel: CPU: 4 PID: 5659 Comm: syscall Tainted: G E 6.8.0-rc2-default+ #22 5a47c3fef76212addcc6eb71344aabc35190ae8f
kernel: Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0
kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28
kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004
kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000
kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004
kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000
kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400
kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0
kernel: Call Trace:
kernel: <TASK>
kernel: ? __die_body+0x1a/0x60
kernel: ? page_fault_oops+0x16f/0x4a0
kernel: ? search_bpf_extables+0x65/0x70
kernel: ? fixup_exception+0x22/0x310
kernel: ? exc_page_fault+0x69/0x150
kernel: ? asm_exc_page_fault+0x22/0x30
kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10
kernel: ? hugetlbfs_fill_super+0xb4/0x1a0
kernel: ? hugetlbfs_fill_super+0x28/0x1a0
kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10
kernel: vfs_get_super+0x40/0xa0
kernel: ? __pfx_bpf_lsm_capable+0x10/0x10
kernel: vfs_get_tree+0x25/0xd0
kernel: vfs_cmd_create+0x64/0xe0
kernel: __x64_sys_fsconfig+0x395/0x410
kernel: do_syscall_64+0x80/0x160
kernel: ? syscall_exit_to_user_mode+0x82/0x240
kernel: ? do_syscall_64+0x8d/0x160
kernel: ? syscall_exit_to_user_mode+0x82/0x240
kernel: ? do_syscall_64+0x8d/0x160
kernel: ? exc_page_fault+0x69/0x150
kernel: entry_SYSCALL_64_after_hwframe+0x6e/0x76
kernel: RIP: 0033:0x7ffbc0cb87c9
kernel: Code: 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 97 96 0d 00 f7 d8 64 89 01 48
kernel: RSP: 002b:00007ffc29d2f388 EFLAGS: 00000206 ORIG_RAX: 00000000000001af
kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffbc0cb87c9
kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
kernel: RBP: 00007ffc29d2f3b0 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
kernel: R13: 00007ffc29d2f4c0 R14: 0000000000000000 R15: 0000000000000000
kernel: </TASK>
kernel: Modules linked in: rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) sunrpc(E) netfs(E) af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) intel_rapl_msr(E) intel_rapl_common(E) iTCO_wdt(E) intel_pmc_bxt(E) sb_edac(E) iTCO_vendor_support(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) rfkill(E) ipmi_ssif(E) kvm(E) acpi_ipmi(E) irqbypass(E) pcspkr(E) igb(E) ipmi_si(E) mei_me(E) i2c_i801(E) joydev(E) intel_pch_thermal(E) i2c_smbus(E) dca(E) lpc_ich(E) mei(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_pad(E) tiny_power_button(E) button(E) fuse(E) efi_pstore(E) configfs(E) ip_tables(E) x_tables(E) ext4(E) mbcache(E) jbd2(E) hid_generic(E) usbhid(E) sd_mod(E) t10_pi(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) polyval_clmulni(E) ahci(E) xhci_pci(E) polyval_generic(E) gf128mul(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha256_ssse3(E) xhci_pci_renesas(E) libahci(E) ehci_pci(E) sha1_ssse3(E) xhci_hcd(E) ehci_hcd(E) libata(E)
kernel: mgag200(E) i2c_algo_bit(E) usbcore(E) wmi(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) scsi_common(E) aesni_intel(E) crypto_simd(E) cryptd(E)
kernel: Unloaded tainted modules: acpi_cpufreq(E):1 fjes(E):1
kernel: CR2: 0000000000000028
kernel: ---[ end trace 0000000000000000 ]---
kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0
kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28
kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004
kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000
kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004
kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000
kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400
kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0

Link: https://lkml.kernel.org/r/20240130210418.3771-1-osalvador@suse.de
Fixes: 32021982a324 ("hugetlbfs: Convert to fs_context")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 79d72c68 Tue Jan 30 14:04:18 MST 2024 Oscar Salvador <osalvador@suse.de> fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super

When configuring a hugetlb filesystem via the fsconfig() syscall, there is
a possible NULL dereference in hugetlbfs_fill_super() caused by assigning
NULL to ctx->hstate in hugetlbfs_parse_param() when the requested pagesize
is non valid.

E.g: Taking the following steps:

fd = fsopen("hugetlbfs", FSOPEN_CLOEXEC);
fsconfig(fd, FSCONFIG_SET_STRING, "pagesize", "1024", 0);
fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);

Given that the requested "pagesize" is invalid, ctxt->hstate will be replaced
with NULL, losing its previous value, and we will print an error:

...
...
case Opt_pagesize:
ps = memparse(param->string, &rest);
ctx->hstate = h;
if (!ctx->hstate) {
pr_err("Unsupported page size %lu MB\n", ps / SZ_1M);
return -EINVAL;
}
return 0;
...
...

This is a problem because later on, we will dereference ctxt->hstate in
hugetlbfs_fill_super()

...
...
sb->s_blocksize = huge_page_size(ctx->hstate);
...
...

Causing below Oops.

Fix this by replacing cxt->hstate value only when then pagesize is known
to be valid.

kernel: hugetlbfs: Unsupported page size 0 MB
kernel: BUG: kernel NULL pointer dereference, address: 0000000000000028
kernel: #PF: supervisor read access in kernel mode
kernel: #PF: error_code(0x0000) - not-present page
kernel: PGD 800000010f66c067 P4D 800000010f66c067 PUD 1b22f8067 PMD 0
kernel: Oops: 0000 [#1] PREEMPT SMP PTI
kernel: CPU: 4 PID: 5659 Comm: syscall Tainted: G E 6.8.0-rc2-default+ #22 5a47c3fef76212addcc6eb71344aabc35190ae8f
kernel: Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0
kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28
kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004
kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000
kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004
kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000
kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400
kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0
kernel: Call Trace:
kernel: <TASK>
kernel: ? __die_body+0x1a/0x60
kernel: ? page_fault_oops+0x16f/0x4a0
kernel: ? search_bpf_extables+0x65/0x70
kernel: ? fixup_exception+0x22/0x310
kernel: ? exc_page_fault+0x69/0x150
kernel: ? asm_exc_page_fault+0x22/0x30
kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10
kernel: ? hugetlbfs_fill_super+0xb4/0x1a0
kernel: ? hugetlbfs_fill_super+0x28/0x1a0
kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10
kernel: vfs_get_super+0x40/0xa0
kernel: ? __pfx_bpf_lsm_capable+0x10/0x10
kernel: vfs_get_tree+0x25/0xd0
kernel: vfs_cmd_create+0x64/0xe0
kernel: __x64_sys_fsconfig+0x395/0x410
kernel: do_syscall_64+0x80/0x160
kernel: ? syscall_exit_to_user_mode+0x82/0x240
kernel: ? do_syscall_64+0x8d/0x160
kernel: ? syscall_exit_to_user_mode+0x82/0x240
kernel: ? do_syscall_64+0x8d/0x160
kernel: ? exc_page_fault+0x69/0x150
kernel: entry_SYSCALL_64_after_hwframe+0x6e/0x76
kernel: RIP: 0033:0x7ffbc0cb87c9
kernel: Code: 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 97 96 0d 00 f7 d8 64 89 01 48
kernel: RSP: 002b:00007ffc29d2f388 EFLAGS: 00000206 ORIG_RAX: 00000000000001af
kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffbc0cb87c9
kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
kernel: RBP: 00007ffc29d2f3b0 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
kernel: R13: 00007ffc29d2f4c0 R14: 0000000000000000 R15: 0000000000000000
kernel: </TASK>
kernel: Modules linked in: rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) sunrpc(E) netfs(E) af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) intel_rapl_msr(E) intel_rapl_common(E) iTCO_wdt(E) intel_pmc_bxt(E) sb_edac(E) iTCO_vendor_support(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) rfkill(E) ipmi_ssif(E) kvm(E) acpi_ipmi(E) irqbypass(E) pcspkr(E) igb(E) ipmi_si(E) mei_me(E) i2c_i801(E) joydev(E) intel_pch_thermal(E) i2c_smbus(E) dca(E) lpc_ich(E) mei(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_pad(E) tiny_power_button(E) button(E) fuse(E) efi_pstore(E) configfs(E) ip_tables(E) x_tables(E) ext4(E) mbcache(E) jbd2(E) hid_generic(E) usbhid(E) sd_mod(E) t10_pi(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) polyval_clmulni(E) ahci(E) xhci_pci(E) polyval_generic(E) gf128mul(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha256_ssse3(E) xhci_pci_renesas(E) libahci(E) ehci_pci(E) sha1_ssse3(E) xhci_hcd(E) ehci_hcd(E) libata(E)
kernel: mgag200(E) i2c_algo_bit(E) usbcore(E) wmi(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) scsi_common(E) aesni_intel(E) crypto_simd(E) cryptd(E)
kernel: Unloaded tainted modules: acpi_cpufreq(E):1 fjes(E):1
kernel: CR2: 0000000000000028
kernel: ---[ end trace 0000000000000000 ]---
kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0
kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28
kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004
kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000
kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004
kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000
kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400
kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0

Link: https://lkml.kernel.org/r/20240130210418.3771-1-osalvador@suse.de
Fixes: 32021982a324 ("hugetlbfs: Convert to fs_context")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 79d72c68 Tue Jan 30 14:04:18 MST 2024 Oscar Salvador <osalvador@suse.de> fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super

When configuring a hugetlb filesystem via the fsconfig() syscall, there is
a possible NULL dereference in hugetlbfs_fill_super() caused by assigning
NULL to ctx->hstate in hugetlbfs_parse_param() when the requested pagesize
is non valid.

E.g: Taking the following steps:

fd = fsopen("hugetlbfs", FSOPEN_CLOEXEC);
fsconfig(fd, FSCONFIG_SET_STRING, "pagesize", "1024", 0);
fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);

Given that the requested "pagesize" is invalid, ctxt->hstate will be replaced
with NULL, losing its previous value, and we will print an error:

...
...
case Opt_pagesize:
ps = memparse(param->string, &rest);
ctx->hstate = h;
if (!ctx->hstate) {
pr_err("Unsupported page size %lu MB\n", ps / SZ_1M);
return -EINVAL;
}
return 0;
...
...

This is a problem because later on, we will dereference ctxt->hstate in
hugetlbfs_fill_super()

...
...
sb->s_blocksize = huge_page_size(ctx->hstate);
...
...

Causing below Oops.

Fix this by replacing cxt->hstate value only when then pagesize is known
to be valid.

kernel: hugetlbfs: Unsupported page size 0 MB
kernel: BUG: kernel NULL pointer dereference, address: 0000000000000028
kernel: #PF: supervisor read access in kernel mode
kernel: #PF: error_code(0x0000) - not-present page
kernel: PGD 800000010f66c067 P4D 800000010f66c067 PUD 1b22f8067 PMD 0
kernel: Oops: 0000 [#1] PREEMPT SMP PTI
kernel: CPU: 4 PID: 5659 Comm: syscall Tainted: G E 6.8.0-rc2-default+ #22 5a47c3fef76212addcc6eb71344aabc35190ae8f
kernel: Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0
kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28
kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004
kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000
kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004
kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000
kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400
kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0
kernel: Call Trace:
kernel: <TASK>
kernel: ? __die_body+0x1a/0x60
kernel: ? page_fault_oops+0x16f/0x4a0
kernel: ? search_bpf_extables+0x65/0x70
kernel: ? fixup_exception+0x22/0x310
kernel: ? exc_page_fault+0x69/0x150
kernel: ? asm_exc_page_fault+0x22/0x30
kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10
kernel: ? hugetlbfs_fill_super+0xb4/0x1a0
kernel: ? hugetlbfs_fill_super+0x28/0x1a0
kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10
kernel: vfs_get_super+0x40/0xa0
kernel: ? __pfx_bpf_lsm_capable+0x10/0x10
kernel: vfs_get_tree+0x25/0xd0
kernel: vfs_cmd_create+0x64/0xe0
kernel: __x64_sys_fsconfig+0x395/0x410
kernel: do_syscall_64+0x80/0x160
kernel: ? syscall_exit_to_user_mode+0x82/0x240
kernel: ? do_syscall_64+0x8d/0x160
kernel: ? syscall_exit_to_user_mode+0x82/0x240
kernel: ? do_syscall_64+0x8d/0x160
kernel: ? exc_page_fault+0x69/0x150
kernel: entry_SYSCALL_64_after_hwframe+0x6e/0x76
kernel: RIP: 0033:0x7ffbc0cb87c9
kernel: Code: 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 97 96 0d 00 f7 d8 64 89 01 48
kernel: RSP: 002b:00007ffc29d2f388 EFLAGS: 00000206 ORIG_RAX: 00000000000001af
kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffbc0cb87c9
kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
kernel: RBP: 00007ffc29d2f3b0 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
kernel: R13: 00007ffc29d2f4c0 R14: 0000000000000000 R15: 0000000000000000
kernel: </TASK>
kernel: Modules linked in: rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) sunrpc(E) netfs(E) af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) intel_rapl_msr(E) intel_rapl_common(E) iTCO_wdt(E) intel_pmc_bxt(E) sb_edac(E) iTCO_vendor_support(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) rfkill(E) ipmi_ssif(E) kvm(E) acpi_ipmi(E) irqbypass(E) pcspkr(E) igb(E) ipmi_si(E) mei_me(E) i2c_i801(E) joydev(E) intel_pch_thermal(E) i2c_smbus(E) dca(E) lpc_ich(E) mei(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_pad(E) tiny_power_button(E) button(E) fuse(E) efi_pstore(E) configfs(E) ip_tables(E) x_tables(E) ext4(E) mbcache(E) jbd2(E) hid_generic(E) usbhid(E) sd_mod(E) t10_pi(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) polyval_clmulni(E) ahci(E) xhci_pci(E) polyval_generic(E) gf128mul(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha256_ssse3(E) xhci_pci_renesas(E) libahci(E) ehci_pci(E) sha1_ssse3(E) xhci_hcd(E) ehci_hcd(E) libata(E)
kernel: mgag200(E) i2c_algo_bit(E) usbcore(E) wmi(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) scsi_common(E) aesni_intel(E) crypto_simd(E) cryptd(E)
kernel: Unloaded tainted modules: acpi_cpufreq(E):1 fjes(E):1
kernel: CR2: 0000000000000028
kernel: ---[ end trace 0000000000000000 ]---
kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0
kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28
kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004
kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000
kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004
kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000
kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400
kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0

Link: https://lkml.kernel.org/r/20240130210418.3771-1-osalvador@suse.de
Fixes: 32021982a324 ("hugetlbfs: Convert to fs_context")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 79d72c68 Tue Jan 30 14:04:18 MST 2024 Oscar Salvador <osalvador@suse.de> fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super

When configuring a hugetlb filesystem via the fsconfig() syscall, there is
a possible NULL dereference in hugetlbfs_fill_super() caused by assigning
NULL to ctx->hstate in hugetlbfs_parse_param() when the requested pagesize
is non valid.

E.g: Taking the following steps:

fd = fsopen("hugetlbfs", FSOPEN_CLOEXEC);
fsconfig(fd, FSCONFIG_SET_STRING, "pagesize", "1024", 0);
fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);

Given that the requested "pagesize" is invalid, ctxt->hstate will be replaced
with NULL, losing its previous value, and we will print an error:

...
...
case Opt_pagesize:
ps = memparse(param->string, &rest);
ctx->hstate = h;
if (!ctx->hstate) {
pr_err("Unsupported page size %lu MB\n", ps / SZ_1M);
return -EINVAL;
}
return 0;
...
...

This is a problem because later on, we will dereference ctxt->hstate in
hugetlbfs_fill_super()

...
...
sb->s_blocksize = huge_page_size(ctx->hstate);
...
...

Causing below Oops.

Fix this by replacing cxt->hstate value only when then pagesize is known
to be valid.

kernel: hugetlbfs: Unsupported page size 0 MB
kernel: BUG: kernel NULL pointer dereference, address: 0000000000000028
kernel: #PF: supervisor read access in kernel mode
kernel: #PF: error_code(0x0000) - not-present page
kernel: PGD 800000010f66c067 P4D 800000010f66c067 PUD 1b22f8067 PMD 0
kernel: Oops: 0000 [#1] PREEMPT SMP PTI
kernel: CPU: 4 PID: 5659 Comm: syscall Tainted: G E 6.8.0-rc2-default+ #22 5a47c3fef76212addcc6eb71344aabc35190ae8f
kernel: Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0
kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28
kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004
kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000
kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004
kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000
kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400
kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0
kernel: Call Trace:
kernel: <TASK>
kernel: ? __die_body+0x1a/0x60
kernel: ? page_fault_oops+0x16f/0x4a0
kernel: ? search_bpf_extables+0x65/0x70
kernel: ? fixup_exception+0x22/0x310
kernel: ? exc_page_fault+0x69/0x150
kernel: ? asm_exc_page_fault+0x22/0x30
kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10
kernel: ? hugetlbfs_fill_super+0xb4/0x1a0
kernel: ? hugetlbfs_fill_super+0x28/0x1a0
kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10
kernel: vfs_get_super+0x40/0xa0
kernel: ? __pfx_bpf_lsm_capable+0x10/0x10
kernel: vfs_get_tree+0x25/0xd0
kernel: vfs_cmd_create+0x64/0xe0
kernel: __x64_sys_fsconfig+0x395/0x410
kernel: do_syscall_64+0x80/0x160
kernel: ? syscall_exit_to_user_mode+0x82/0x240
kernel: ? do_syscall_64+0x8d/0x160
kernel: ? syscall_exit_to_user_mode+0x82/0x240
kernel: ? do_syscall_64+0x8d/0x160
kernel: ? exc_page_fault+0x69/0x150
kernel: entry_SYSCALL_64_after_hwframe+0x6e/0x76
kernel: RIP: 0033:0x7ffbc0cb87c9
kernel: Code: 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 97 96 0d 00 f7 d8 64 89 01 48
kernel: RSP: 002b:00007ffc29d2f388 EFLAGS: 00000206 ORIG_RAX: 00000000000001af
kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffbc0cb87c9
kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
kernel: RBP: 00007ffc29d2f3b0 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
kernel: R13: 00007ffc29d2f4c0 R14: 0000000000000000 R15: 0000000000000000
kernel: </TASK>
kernel: Modules linked in: rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) sunrpc(E) netfs(E) af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) intel_rapl_msr(E) intel_rapl_common(E) iTCO_wdt(E) intel_pmc_bxt(E) sb_edac(E) iTCO_vendor_support(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) rfkill(E) ipmi_ssif(E) kvm(E) acpi_ipmi(E) irqbypass(E) pcspkr(E) igb(E) ipmi_si(E) mei_me(E) i2c_i801(E) joydev(E) intel_pch_thermal(E) i2c_smbus(E) dca(E) lpc_ich(E) mei(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_pad(E) tiny_power_button(E) button(E) fuse(E) efi_pstore(E) configfs(E) ip_tables(E) x_tables(E) ext4(E) mbcache(E) jbd2(E) hid_generic(E) usbhid(E) sd_mod(E) t10_pi(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) polyval_clmulni(E) ahci(E) xhci_pci(E) polyval_generic(E) gf128mul(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha256_ssse3(E) xhci_pci_renesas(E) libahci(E) ehci_pci(E) sha1_ssse3(E) xhci_hcd(E) ehci_hcd(E) libata(E)
kernel: mgag200(E) i2c_algo_bit(E) usbcore(E) wmi(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) scsi_common(E) aesni_intel(E) crypto_simd(E) cryptd(E)
kernel: Unloaded tainted modules: acpi_cpufreq(E):1 fjes(E):1
kernel: CR2: 0000000000000028
kernel: ---[ end trace 0000000000000000 ]---
kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0
kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28
kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004
kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000
kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004
kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000
kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400
kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0

Link: https://lkml.kernel.org/r/20240130210418.3771-1-osalvador@suse.de
Fixes: 32021982a324 ("hugetlbfs: Convert to fs_context")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 79d72c68 Tue Jan 30 14:04:18 MST 2024 Oscar Salvador <osalvador@suse.de> fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super

When configuring a hugetlb filesystem via the fsconfig() syscall, there is
a possible NULL dereference in hugetlbfs_fill_super() caused by assigning
NULL to ctx->hstate in hugetlbfs_parse_param() when the requested pagesize
is non valid.

E.g: Taking the following steps:

fd = fsopen("hugetlbfs", FSOPEN_CLOEXEC);
fsconfig(fd, FSCONFIG_SET_STRING, "pagesize", "1024", 0);
fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);

Given that the requested "pagesize" is invalid, ctxt->hstate will be replaced
with NULL, losing its previous value, and we will print an error:

...
...
case Opt_pagesize:
ps = memparse(param->string, &rest);
ctx->hstate = h;
if (!ctx->hstate) {
pr_err("Unsupported page size %lu MB\n", ps / SZ_1M);
return -EINVAL;
}
return 0;
...
...

This is a problem because later on, we will dereference ctxt->hstate in
hugetlbfs_fill_super()

...
...
sb->s_blocksize = huge_page_size(ctx->hstate);
...
...

Causing below Oops.

Fix this by replacing cxt->hstate value only when then pagesize is known
to be valid.

kernel: hugetlbfs: Unsupported page size 0 MB
kernel: BUG: kernel NULL pointer dereference, address: 0000000000000028
kernel: #PF: supervisor read access in kernel mode
kernel: #PF: error_code(0x0000) - not-present page
kernel: PGD 800000010f66c067 P4D 800000010f66c067 PUD 1b22f8067 PMD 0
kernel: Oops: 0000 [#1] PREEMPT SMP PTI
kernel: CPU: 4 PID: 5659 Comm: syscall Tainted: G E 6.8.0-rc2-default+ #22 5a47c3fef76212addcc6eb71344aabc35190ae8f
kernel: Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0
kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28
kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004
kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000
kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004
kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000
kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400
kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0
kernel: Call Trace:
kernel: <TASK>
kernel: ? __die_body+0x1a/0x60
kernel: ? page_fault_oops+0x16f/0x4a0
kernel: ? search_bpf_extables+0x65/0x70
kernel: ? fixup_exception+0x22/0x310
kernel: ? exc_page_fault+0x69/0x150
kernel: ? asm_exc_page_fault+0x22/0x30
kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10
kernel: ? hugetlbfs_fill_super+0xb4/0x1a0
kernel: ? hugetlbfs_fill_super+0x28/0x1a0
kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10
kernel: vfs_get_super+0x40/0xa0
kernel: ? __pfx_bpf_lsm_capable+0x10/0x10
kernel: vfs_get_tree+0x25/0xd0
kernel: vfs_cmd_create+0x64/0xe0
kernel: __x64_sys_fsconfig+0x395/0x410
kernel: do_syscall_64+0x80/0x160
kernel: ? syscall_exit_to_user_mode+0x82/0x240
kernel: ? do_syscall_64+0x8d/0x160
kernel: ? syscall_exit_to_user_mode+0x82/0x240
kernel: ? do_syscall_64+0x8d/0x160
kernel: ? exc_page_fault+0x69/0x150
kernel: entry_SYSCALL_64_after_hwframe+0x6e/0x76
kernel: RIP: 0033:0x7ffbc0cb87c9
kernel: Code: 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 97 96 0d 00 f7 d8 64 89 01 48
kernel: RSP: 002b:00007ffc29d2f388 EFLAGS: 00000206 ORIG_RAX: 00000000000001af
kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffbc0cb87c9
kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
kernel: RBP: 00007ffc29d2f3b0 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
kernel: R13: 00007ffc29d2f4c0 R14: 0000000000000000 R15: 0000000000000000
kernel: </TASK>
kernel: Modules linked in: rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) sunrpc(E) netfs(E) af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) intel_rapl_msr(E) intel_rapl_common(E) iTCO_wdt(E) intel_pmc_bxt(E) sb_edac(E) iTCO_vendor_support(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) rfkill(E) ipmi_ssif(E) kvm(E) acpi_ipmi(E) irqbypass(E) pcspkr(E) igb(E) ipmi_si(E) mei_me(E) i2c_i801(E) joydev(E) intel_pch_thermal(E) i2c_smbus(E) dca(E) lpc_ich(E) mei(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_pad(E) tiny_power_button(E) button(E) fuse(E) efi_pstore(E) configfs(E) ip_tables(E) x_tables(E) ext4(E) mbcache(E) jbd2(E) hid_generic(E) usbhid(E) sd_mod(E) t10_pi(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) polyval_clmulni(E) ahci(E) xhci_pci(E) polyval_generic(E) gf128mul(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha256_ssse3(E) xhci_pci_renesas(E) libahci(E) ehci_pci(E) sha1_ssse3(E) xhci_hcd(E) ehci_hcd(E) libata(E)
kernel: mgag200(E) i2c_algo_bit(E) usbcore(E) wmi(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) scsi_common(E) aesni_intel(E) crypto_simd(E) cryptd(E)
kernel: Unloaded tainted modules: acpi_cpufreq(E):1 fjes(E):1
kernel: CR2: 0000000000000028
kernel: ---[ end trace 0000000000000000 ]---
kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0
kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28
kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004
kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000
kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004
kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000
kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400
kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0

Link: https://lkml.kernel.org/r/20240130210418.3771-1-osalvador@suse.de
Fixes: 32021982a324 ("hugetlbfs: Convert to fs_context")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 38c1ddbd Wed Jul 12 18:18:32 MDT 2023 Jiaqi Yan <jiaqiyan@google.com> hugetlbfs: improve read HWPOISON hugepage

When a hugepage contains HWPOISON pages, read() fails to read any byte of
the hugepage and returns -EIO, although many bytes in the HWPOISON
hugepage are readable.

Improve this by allowing hugetlbfs_read_iter returns as many bytes as
possible. For a requested range [offset, offset + len) that contains
HWPOISON page, return [offset, first HWPOISON page addr); the next read
attempt will fail and return -EIO.

Link: https://lkml.kernel.org/r/20230713001833.3778937-4-jiaqiyan@google.com
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 243b1f2d Fri Dec 16 08:50:52 MST 2022 Peter Xu <peterx@redhat.com> mm/hugetlb: let vma_offset_start() to return start

Patch series "mm/hugetlb: Make huge_pte_offset() thread-safe for pmd
unshare", v4.

Problem
=======

huge_pte_offset() is a major helper used by hugetlb code paths to walk a
hugetlb pgtable. It's used mostly everywhere since that's needed even
before taking the pgtable lock.

huge_pte_offset() is always called with mmap lock held with either read or
write. It was assumed to be safe but it's actually not. One race
condition can easily trigger by: (1) firstly trigger pmd share on a memory
range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
another thread unshare the pmd range, and the pgtable page is prone to lost
if the other shared process wants to free it completely (by either munmap
or exit mm).

The recent work from Mike on vma lock can resolve most of this already.
It's achieved by forbidden pmd unsharing during the lock being taken, so no
further risk of the pgtable page being freed. It means if we can take the
vma lock around all huge_pte_offset() callers it'll be safe.

There're already a bunch of them that we did as per the latest mm-unstable,
but also quite a few others that we didn't for various reasons especially
on huge_pte_offset() usage.

One more thing to mention is that besides the vma lock, i_mmap_rwsem can
also be used to protect the pgtable page (along with its pgtable lock) from
being freed from under us. IOW, huge_pte_offset() callers need to either
hold the vma lock or i_mmap_rwsem to safely walk the pgtables.

A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is
very hard to trigger, one needs to apply another kernel delay patch too,
see below):

======8<=======
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <unistd.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <linux/memfd.h>
#include <assert.h>
#include <pthread.h>

#define MSIZE (1UL << 30) /* 1GB */
#define PSIZE (2UL << 20) /* 2MB */

#define HOLD_SEC (1)

int pipefd[2];
void *buf;

void *do_map(int fd)
{
unsigned char *tmpbuf, *p;
int ret;

ret = posix_memalign((void **)&tmpbuf, MSIZE, MSIZE);
if (ret) {
perror("posix_memalign() failed");
return NULL;
}

tmpbuf = mmap(tmpbuf, MSIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_FIXED, fd, 0);
if (tmpbuf == MAP_FAILED) {
perror("mmap() failed");
return NULL;
}
printf("mmap() -> %p\n", tmpbuf);

for (p = tmpbuf; p < tmpbuf + MSIZE; p += PSIZE) {
*p = 1;
}

return tmpbuf;
}

void do_unmap(void *buf)
{
munmap(buf, MSIZE);
}

void proc2(int fd)
{
unsigned char c;

buf = do_map(fd);
if (!buf)
return;

read(pipefd[0], &c, 1);
/*
* This frees the shared pgtable page, causing use-after-free in
* proc1_thread1 when soft walking hugetlb pgtable.
*/
do_unmap(buf);

printf("Proc2 quitting\n");
}

void *proc1_thread1(void *data)
{
/*
* Trigger follow-page on 1st 2m page. Kernel hack patch needed to
* withhold this procedure for easier reproduce.
*/
madvise(buf, PSIZE, MADV_POPULATE_WRITE);
printf("Proc1-thread1 quitting\n");
return NULL;
}

void *proc1_thread2(void *data)
{
unsigned char c;

/* Wait a while until proc1_thread1() start to wait */
sleep(0.5);
/* Trigger pmd unshare */
madvise(buf, PSIZE, MADV_DONTNEED);
/* Kick off proc2 to release the pgtable */
write(pipefd[1], &c, 1);

printf("Proc1-thread2 quitting\n");
return NULL;
}

void proc1(int fd)
{
pthread_t tid1, tid2;
int ret;

buf = do_map(fd);
if (!buf)
return;

ret = pthread_create(&tid1, NULL, proc1_thread1, NULL);
assert(ret == 0);
ret = pthread_create(&tid2, NULL, proc1_thread2, NULL);
assert(ret == 0);

/* Kick the child to share the PUD entry */
pthread_join(tid1, NULL);
pthread_join(tid2, NULL);

do_unmap(buf);
}

int main(void)
{
int fd, ret;

fd = memfd_create("test-huge", MFD_HUGETLB | MFD_HUGE_2MB);
if (fd < 0) {
perror("open failed");
return -1;
}

ret = ftruncate(fd, MSIZE);
if (ret) {
perror("ftruncate() failed");
return -1;
}

ret = pipe(pipefd);
if (ret) {
perror("pipe() failed");
return -1;
}

if (fork()) {
proc1(fd);
} else {
proc2(fd);
}

close(pipefd[0]);
close(pipefd[1]);
close(fd);

return 0;
}
======8<=======

The kernel patch needed to present such a race so it'll trigger 100%:

======8<=======
: diff --git a/mm/hugetlb.c b/mm/hugetlb.c
: index 9d97c9a2a15d..f8d99dad5004 100644
: --- a/mm/hugetlb.c
: +++ b/mm/hugetlb.c
: @@ -38,6 +38,7 @@
: #include <asm/page.h>
: #include <asm/pgalloc.h>
: #include <asm/tlb.h>
: +#include <asm/delay.h>
:
: #include <linux/io.h>
: #include <linux/hugetlb.h>
: @@ -6290,6 +6291,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
: bool unshare = false;
: int absent;
: struct page *page;
: + unsigned long c = 0;
:
: /*
: * If we have a pending SIGKILL, don't keep faulting pages and
: @@ -6309,6 +6311,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
: */
: pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
: huge_page_size(h));
: +
: + pr_info("%s: withhold 1 sec...\n", __func__);
: + for (c = 0; c < 100; c++) {
: + udelay(10000);
: + }
: + pr_info("%s: withhold 1 sec...done\n", __func__);
: +
: if (pte)
: ptl = huge_pte_lock(h, mm, pte);
: absent = !pte || huge_pte_none(huge_ptep_get(pte));
: ======8<=======

It'll trigger use-after-free of the pgtable spinlock:

======8<=======
[ 16.959907] follow_hugetlb_page: withhold 1 sec...
[ 17.960315] follow_hugetlb_page: withhold 1 sec...done
[ 17.960550] ------------[ cut here ]------------
[ 17.960742] DEBUG_LOCKS_WARN_ON(1)
[ 17.960756] WARNING: CPU: 3 PID: 542 at kernel/locking/lockdep.c:231 __lock_acquire+0x955/0x1fa0
[ 17.961264] Modules linked in:
[ 17.961394] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Not tainted 6.1.0-rc4-peterx+ #46
[ 17.961704] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 17.962266] RIP: 0010:__lock_acquire+0x955/0x1fa0
[ 17.962516] Code: c0 0f 84 5f fe ff ff 44 8b 1d 0f 9a 29 02 45 85 db 0f 85 4f fe ff ff 48 c7 c6 75 50 83 82 48 c7 c7 1b 4b 7d 82 e8 d3 22 d8 00 <0f> 0b 31 c0 4c 8b 54 24 08 4c 8b 04 24 e9
[ 17.963494] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010096
[ 17.963704] RAX: 0000000000000016 RBX: fffffffffd3925a8 RCX: 0000000000000000
[ 17.963989] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff
[ 17.964276] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffc90000e4fa58
[ 17.964557] R10: 0000000000000003 R11: ffffffff83162688 R12: 0000000000000000
[ 17.964839] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001
[ 17.965123] FS: 00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
[ 17.965443] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 17.965672] CR2: 00007f17c09ffef8 CR3: 000000010c87a005 CR4: 0000000000770ee0
[ 17.965956] PKRU: 55555554
[ 17.966068] Call Trace:
[ 17.966172] <TASK>
[ 17.966268] ? tick_nohz_tick_stopped+0x12/0x30
[ 17.966455] lock_acquire+0xbf/0x2b0
[ 17.966603] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.966799] ? _printk+0x48/0x4e
[ 17.966934] _raw_spin_lock+0x2f/0x40
[ 17.967087] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.967285] follow_hugetlb_page.cold+0x75/0x5c4
[ 17.967473] __get_user_pages+0xbb/0x620
[ 17.967635] faultin_vma_page_range+0x9a/0x100
[ 17.967817] madvise_vma_behavior+0x3c0/0xbd0
[ 17.967998] ? mas_prev+0x11/0x290
[ 17.968141] ? find_vma_prev+0x5e/0xa0
[ 17.968304] ? madvise_vma_anon_name+0x70/0x70
[ 17.968486] madvise_walk_vmas+0xa9/0x120
[ 17.968650] do_madvise.part.0+0xfa/0x270
[ 17.968813] __x64_sys_madvise+0x5a/0x70
[ 17.968974] do_syscall_64+0x37/0x90
[ 17.969123] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 17.969329] RIP: 0033:0x7f1840f0efdb
[ 17.969477] Code: c3 66 0f 1f 44 00 00 48 8b 15 39 6e 0e 00 f7 d8 64 89 02 b8 ff ff ff ff eb bc 0f 1f 44 00 00 f3 0f 1e fa b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0d 68
[ 17.970205] RSP: 002b:00007f17c09ffe38 EFLAGS: 00000202 ORIG_RAX: 000000000000001c
[ 17.970504] RAX: ffffffffffffffda RBX: 00007f17c0a00640 RCX: 00007f1840f0efdb
[ 17.970786] RDX: 0000000000000017 RSI: 0000000000200000 RDI: 00007f1800000000
[ 17.971068] RBP: 00007f17c09ffe50 R08: 0000000000000000 R09: 00007ffd3954164f
[ 17.971353] R10: 00007f1840e10348 R11: 0000000000000202 R12: ffffffffffffff80
[ 17.971709] R13: 0000000000000000 R14: 00007ffd39541550 R15: 00007f17c0200000
[ 17.972083] </TASK>
[ 17.972199] irq event stamp: 2353
[ 17.972372] hardirqs last enabled at (2353): [<ffffffff8117fe4e>] __up_console_sem+0x5e/0x70
[ 17.972869] hardirqs last disabled at (2352): [<ffffffff8117fe33>] __up_console_sem+0x43/0x70
[ 17.973365] softirqs last enabled at (2330): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160
[ 17.973857] softirqs last disabled at (2323): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160
[ 17.974341] ---[ end trace 0000000000000000 ]---
[ 17.974614] BUG: kernel NULL pointer dereference, address: 00000000000000b8
[ 17.975012] #PF: supervisor read access in kernel mode
[ 17.975314] #PF: error_code(0x0000) - not-present page
[ 17.975615] PGD 103f7b067 P4D 103f7b067 PUD 106cd7067 PMD 0
[ 17.975943] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 17.976197] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Tainted: G W 6.1.0-rc4-peterx+ #46
[ 17.976712] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 17.977370] RIP: 0010:__lock_acquire+0x190/0x1fa0
[ 17.977655] Code: 98 00 00 00 41 89 46 24 81 e2 ff 1f 00 00 48 0f a3 15 e4 ba dd 02 0f 83 ff 05 00 00 48 8d 04 52 48 c1 e0 06 48 05 c0 d2 f4 83 <44> 0f b6 a0 b8 00 00 00 41 0f b7 46 20 6f
[ 17.979170] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010046
[ 17.979787] RAX: 0000000000000000 RBX: fffffffffd3925a8 RCX: 0000000000000000
[ 17.980838] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff
[ 17.982048] RBP: 0000000000000000 R08: ffff888105eac720 R09: ffffc90000e4fa58
[ 17.982892] R10: ffff888105eab900 R11: ffffffff83162688 R12: 0000000000000000
[ 17.983771] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001
[ 17.984815] FS: 00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
[ 17.985924] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 17.986265] CR2: 00000000000000b8 CR3: 000000010c87a005 CR4: 0000000000770ee0
[ 17.986674] PKRU: 55555554
[ 17.986832] Call Trace:
[ 17.987012] <TASK>
[ 17.987266] ? tick_nohz_tick_stopped+0x12/0x30
[ 17.987770] lock_acquire+0xbf/0x2b0
[ 17.988118] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.988575] ? _printk+0x48/0x4e
[ 17.988889] _raw_spin_lock+0x2f/0x40
[ 17.989243] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.989687] follow_hugetlb_page.cold+0x75/0x5c4
[ 17.990119] __get_user_pages+0xbb/0x620
[ 17.990500] faultin_vma_page_range+0x9a/0x100
[ 17.990928] madvise_vma_behavior+0x3c0/0xbd0
[ 17.991354] ? mas_prev+0x11/0x290
[ 17.991678] ? find_vma_prev+0x5e/0xa0
[ 17.992024] ? madvise_vma_anon_name+0x70/0x70
[ 17.992421] madvise_walk_vmas+0xa9/0x120
[ 17.992793] do_madvise.part.0+0xfa/0x270
[ 17.993166] __x64_sys_madvise+0x5a/0x70
[ 17.993539] do_syscall_64+0x37/0x90
[ 17.993879] entry_SYSCALL_64_after_hwframe+0x63/0xcd
======8<=======

Resolution
==========

This patchset protects all the huge_pte_offset() callers to also take the
vma lock properly.

Patch Layout
============

Patch 1-2: cleanup, or dependency of the follow up patches
Patch 3: before fixing, document huge_pte_offset() on lock required
Patch 4-8: each patch resolves one possible race condition
Patch 9: introduce hugetlb_walk() to replace huge_pte_offset()

Tests
=====

The series is verified with the above reproducer so the race cannot
trigger anymore. It also passes all hugetlb kselftests.


This patch (of 9):

Even though vma_offset_start() is named like that, it's not returning "the
start address of the range" but rather the offset we should use to offset
the vma->vm_start address.

Make it return the real value of the start vaddr, and it also helps for
all the callers because whenever the retval is used, it'll be ultimately
added into the vma->vm_start anyway, so it's better.

Link: https://lkml.kernel.org/r/20221216155100.2043537-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20221216155100.2043537-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 243b1f2d Fri Dec 16 08:50:52 MST 2022 Peter Xu <peterx@redhat.com> mm/hugetlb: let vma_offset_start() to return start

Patch series "mm/hugetlb: Make huge_pte_offset() thread-safe for pmd
unshare", v4.

Problem
=======

huge_pte_offset() is a major helper used by hugetlb code paths to walk a
hugetlb pgtable. It's used mostly everywhere since that's needed even
before taking the pgtable lock.

huge_pte_offset() is always called with mmap lock held with either read or
write. It was assumed to be safe but it's actually not. One race
condition can easily trigger by: (1) firstly trigger pmd share on a memory
range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
another thread unshare the pmd range, and the pgtable page is prone to lost
if the other shared process wants to free it completely (by either munmap
or exit mm).

The recent work from Mike on vma lock can resolve most of this already.
It's achieved by forbidden pmd unsharing during the lock being taken, so no
further risk of the pgtable page being freed. It means if we can take the
vma lock around all huge_pte_offset() callers it'll be safe.

There're already a bunch of them that we did as per the latest mm-unstable,
but also quite a few others that we didn't for various reasons especially
on huge_pte_offset() usage.

One more thing to mention is that besides the vma lock, i_mmap_rwsem can
also be used to protect the pgtable page (along with its pgtable lock) from
being freed from under us. IOW, huge_pte_offset() callers need to either
hold the vma lock or i_mmap_rwsem to safely walk the pgtables.

A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is
very hard to trigger, one needs to apply another kernel delay patch too,
see below):

======8<=======
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <unistd.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <linux/memfd.h>
#include <assert.h>
#include <pthread.h>

#define MSIZE (1UL << 30) /* 1GB */
#define PSIZE (2UL << 20) /* 2MB */

#define HOLD_SEC (1)

int pipefd[2];
void *buf;

void *do_map(int fd)
{
unsigned char *tmpbuf, *p;
int ret;

ret = posix_memalign((void **)&tmpbuf, MSIZE, MSIZE);
if (ret) {
perror("posix_memalign() failed");
return NULL;
}

tmpbuf = mmap(tmpbuf, MSIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_FIXED, fd, 0);
if (tmpbuf == MAP_FAILED) {
perror("mmap() failed");
return NULL;
}
printf("mmap() -> %p\n", tmpbuf);

for (p = tmpbuf; p < tmpbuf + MSIZE; p += PSIZE) {
*p = 1;
}

return tmpbuf;
}

void do_unmap(void *buf)
{
munmap(buf, MSIZE);
}

void proc2(int fd)
{
unsigned char c;

buf = do_map(fd);
if (!buf)
return;

read(pipefd[0], &c, 1);
/*
* This frees the shared pgtable page, causing use-after-free in
* proc1_thread1 when soft walking hugetlb pgtable.
*/
do_unmap(buf);

printf("Proc2 quitting\n");
}

void *proc1_thread1(void *data)
{
/*
* Trigger follow-page on 1st 2m page. Kernel hack patch needed to
* withhold this procedure for easier reproduce.
*/
madvise(buf, PSIZE, MADV_POPULATE_WRITE);
printf("Proc1-thread1 quitting\n");
return NULL;
}

void *proc1_thread2(void *data)
{
unsigned char c;

/* Wait a while until proc1_thread1() start to wait */
sleep(0.5);
/* Trigger pmd unshare */
madvise(buf, PSIZE, MADV_DONTNEED);
/* Kick off proc2 to release the pgtable */
write(pipefd[1], &c, 1);

printf("Proc1-thread2 quitting\n");
return NULL;
}

void proc1(int fd)
{
pthread_t tid1, tid2;
int ret;

buf = do_map(fd);
if (!buf)
return;

ret = pthread_create(&tid1, NULL, proc1_thread1, NULL);
assert(ret == 0);
ret = pthread_create(&tid2, NULL, proc1_thread2, NULL);
assert(ret == 0);

/* Kick the child to share the PUD entry */
pthread_join(tid1, NULL);
pthread_join(tid2, NULL);

do_unmap(buf);
}

int main(void)
{
int fd, ret;

fd = memfd_create("test-huge", MFD_HUGETLB | MFD_HUGE_2MB);
if (fd < 0) {
perror("open failed");
return -1;
}

ret = ftruncate(fd, MSIZE);
if (ret) {
perror("ftruncate() failed");
return -1;
}

ret = pipe(pipefd);
if (ret) {
perror("pipe() failed");
return -1;
}

if (fork()) {
proc1(fd);
} else {
proc2(fd);
}

close(pipefd[0]);
close(pipefd[1]);
close(fd);

return 0;
}
======8<=======

The kernel patch needed to present such a race so it'll trigger 100%:

======8<=======
: diff --git a/mm/hugetlb.c b/mm/hugetlb.c
: index 9d97c9a2a15d..f8d99dad5004 100644
: --- a/mm/hugetlb.c
: +++ b/mm/hugetlb.c
: @@ -38,6 +38,7 @@
: #include <asm/page.h>
: #include <asm/pgalloc.h>
: #include <asm/tlb.h>
: +#include <asm/delay.h>
:
: #include <linux/io.h>
: #include <linux/hugetlb.h>
: @@ -6290,6 +6291,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
: bool unshare = false;
: int absent;
: struct page *page;
: + unsigned long c = 0;
:
: /*
: * If we have a pending SIGKILL, don't keep faulting pages and
: @@ -6309,6 +6311,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
: */
: pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
: huge_page_size(h));
: +
: + pr_info("%s: withhold 1 sec...\n", __func__);
: + for (c = 0; c < 100; c++) {
: + udelay(10000);
: + }
: + pr_info("%s: withhold 1 sec...done\n", __func__);
: +
: if (pte)
: ptl = huge_pte_lock(h, mm, pte);
: absent = !pte || huge_pte_none(huge_ptep_get(pte));
: ======8<=======

It'll trigger use-after-free of the pgtable spinlock:

======8<=======
[ 16.959907] follow_hugetlb_page: withhold 1 sec...
[ 17.960315] follow_hugetlb_page: withhold 1 sec...done
[ 17.960550] ------------[ cut here ]------------
[ 17.960742] DEBUG_LOCKS_WARN_ON(1)
[ 17.960756] WARNING: CPU: 3 PID: 542 at kernel/locking/lockdep.c:231 __lock_acquire+0x955/0x1fa0
[ 17.961264] Modules linked in:
[ 17.961394] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Not tainted 6.1.0-rc4-peterx+ #46
[ 17.961704] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 17.962266] RIP: 0010:__lock_acquire+0x955/0x1fa0
[ 17.962516] Code: c0 0f 84 5f fe ff ff 44 8b 1d 0f 9a 29 02 45 85 db 0f 85 4f fe ff ff 48 c7 c6 75 50 83 82 48 c7 c7 1b 4b 7d 82 e8 d3 22 d8 00 <0f> 0b 31 c0 4c 8b 54 24 08 4c 8b 04 24 e9
[ 17.963494] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010096
[ 17.963704] RAX: 0000000000000016 RBX: fffffffffd3925a8 RCX: 0000000000000000
[ 17.963989] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff
[ 17.964276] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffc90000e4fa58
[ 17.964557] R10: 0000000000000003 R11: ffffffff83162688 R12: 0000000000000000
[ 17.964839] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001
[ 17.965123] FS: 00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
[ 17.965443] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 17.965672] CR2: 00007f17c09ffef8 CR3: 000000010c87a005 CR4: 0000000000770ee0
[ 17.965956] PKRU: 55555554
[ 17.966068] Call Trace:
[ 17.966172] <TASK>
[ 17.966268] ? tick_nohz_tick_stopped+0x12/0x30
[ 17.966455] lock_acquire+0xbf/0x2b0
[ 17.966603] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.966799] ? _printk+0x48/0x4e
[ 17.966934] _raw_spin_lock+0x2f/0x40
[ 17.967087] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.967285] follow_hugetlb_page.cold+0x75/0x5c4
[ 17.967473] __get_user_pages+0xbb/0x620
[ 17.967635] faultin_vma_page_range+0x9a/0x100
[ 17.967817] madvise_vma_behavior+0x3c0/0xbd0
[ 17.967998] ? mas_prev+0x11/0x290
[ 17.968141] ? find_vma_prev+0x5e/0xa0
[ 17.968304] ? madvise_vma_anon_name+0x70/0x70
[ 17.968486] madvise_walk_vmas+0xa9/0x120
[ 17.968650] do_madvise.part.0+0xfa/0x270
[ 17.968813] __x64_sys_madvise+0x5a/0x70
[ 17.968974] do_syscall_64+0x37/0x90
[ 17.969123] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 17.969329] RIP: 0033:0x7f1840f0efdb
[ 17.969477] Code: c3 66 0f 1f 44 00 00 48 8b 15 39 6e 0e 00 f7 d8 64 89 02 b8 ff ff ff ff eb bc 0f 1f 44 00 00 f3 0f 1e fa b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0d 68
[ 17.970205] RSP: 002b:00007f17c09ffe38 EFLAGS: 00000202 ORIG_RAX: 000000000000001c
[ 17.970504] RAX: ffffffffffffffda RBX: 00007f17c0a00640 RCX: 00007f1840f0efdb
[ 17.970786] RDX: 0000000000000017 RSI: 0000000000200000 RDI: 00007f1800000000
[ 17.971068] RBP: 00007f17c09ffe50 R08: 0000000000000000 R09: 00007ffd3954164f
[ 17.971353] R10: 00007f1840e10348 R11: 0000000000000202 R12: ffffffffffffff80
[ 17.971709] R13: 0000000000000000 R14: 00007ffd39541550 R15: 00007f17c0200000
[ 17.972083] </TASK>
[ 17.972199] irq event stamp: 2353
[ 17.972372] hardirqs last enabled at (2353): [<ffffffff8117fe4e>] __up_console_sem+0x5e/0x70
[ 17.972869] hardirqs last disabled at (2352): [<ffffffff8117fe33>] __up_console_sem+0x43/0x70
[ 17.973365] softirqs last enabled at (2330): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160
[ 17.973857] softirqs last disabled at (2323): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160
[ 17.974341] ---[ end trace 0000000000000000 ]---
[ 17.974614] BUG: kernel NULL pointer dereference, address: 00000000000000b8
[ 17.975012] #PF: supervisor read access in kernel mode
[ 17.975314] #PF: error_code(0x0000) - not-present page
[ 17.975615] PGD 103f7b067 P4D 103f7b067 PUD 106cd7067 PMD 0
[ 17.975943] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 17.976197] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Tainted: G W 6.1.0-rc4-peterx+ #46
[ 17.976712] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 17.977370] RIP: 0010:__lock_acquire+0x190/0x1fa0
[ 17.977655] Code: 98 00 00 00 41 89 46 24 81 e2 ff 1f 00 00 48 0f a3 15 e4 ba dd 02 0f 83 ff 05 00 00 48 8d 04 52 48 c1 e0 06 48 05 c0 d2 f4 83 <44> 0f b6 a0 b8 00 00 00 41 0f b7 46 20 6f
[ 17.979170] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010046
[ 17.979787] RAX: 0000000000000000 RBX: fffffffffd3925a8 RCX: 0000000000000000
[ 17.980838] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff
[ 17.982048] RBP: 0000000000000000 R08: ffff888105eac720 R09: ffffc90000e4fa58
[ 17.982892] R10: ffff888105eab900 R11: ffffffff83162688 R12: 0000000000000000
[ 17.983771] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001
[ 17.984815] FS: 00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
[ 17.985924] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 17.986265] CR2: 00000000000000b8 CR3: 000000010c87a005 CR4: 0000000000770ee0
[ 17.986674] PKRU: 55555554
[ 17.986832] Call Trace:
[ 17.987012] <TASK>
[ 17.987266] ? tick_nohz_tick_stopped+0x12/0x30
[ 17.987770] lock_acquire+0xbf/0x2b0
[ 17.988118] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.988575] ? _printk+0x48/0x4e
[ 17.988889] _raw_spin_lock+0x2f/0x40
[ 17.989243] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.989687] follow_hugetlb_page.cold+0x75/0x5c4
[ 17.990119] __get_user_pages+0xbb/0x620
[ 17.990500] faultin_vma_page_range+0x9a/0x100
[ 17.990928] madvise_vma_behavior+0x3c0/0xbd0
[ 17.991354] ? mas_prev+0x11/0x290
[ 17.991678] ? find_vma_prev+0x5e/0xa0
[ 17.992024] ? madvise_vma_anon_name+0x70/0x70
[ 17.992421] madvise_walk_vmas+0xa9/0x120
[ 17.992793] do_madvise.part.0+0xfa/0x270
[ 17.993166] __x64_sys_madvise+0x5a/0x70
[ 17.993539] do_syscall_64+0x37/0x90
[ 17.993879] entry_SYSCALL_64_after_hwframe+0x63/0xcd
======8<=======

Resolution
==========

This patchset protects all the huge_pte_offset() callers to also take the
vma lock properly.

Patch Layout
============

Patch 1-2: cleanup, or dependency of the follow up patches
Patch 3: before fixing, document huge_pte_offset() on lock required
Patch 4-8: each patch resolves one possible race condition
Patch 9: introduce hugetlb_walk() to replace huge_pte_offset()

Tests
=====

The series is verified with the above reproducer so the race cannot
trigger anymore. It also passes all hugetlb kselftests.


This patch (of 9):

Even though vma_offset_start() is named like that, it's not returning "the
start address of the range" but rather the offset we should use to offset
the vma->vm_start address.

Make it return the real value of the start vaddr, and it also helps for
all the callers because whenever the retval is used, it'll be ultimately
added into the vma->vm_start anyway, so it's better.

Link: https://lkml.kernel.org/r/20221216155100.2043537-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20221216155100.2043537-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 243b1f2d Fri Dec 16 08:50:52 MST 2022 Peter Xu <peterx@redhat.com> mm/hugetlb: let vma_offset_start() to return start

Patch series "mm/hugetlb: Make huge_pte_offset() thread-safe for pmd
unshare", v4.

Problem
=======

huge_pte_offset() is a major helper used by hugetlb code paths to walk a
hugetlb pgtable. It's used mostly everywhere since that's needed even
before taking the pgtable lock.

huge_pte_offset() is always called with mmap lock held with either read or
write. It was assumed to be safe but it's actually not. One race
condition can easily trigger by: (1) firstly trigger pmd share on a memory
range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
another thread unshare the pmd range, and the pgtable page is prone to lost
if the other shared process wants to free it completely (by either munmap
or exit mm).

The recent work from Mike on vma lock can resolve most of this already.
It's achieved by forbidden pmd unsharing during the lock being taken, so no
further risk of the pgtable page being freed. It means if we can take the
vma lock around all huge_pte_offset() callers it'll be safe.

There're already a bunch of them that we did as per the latest mm-unstable,
but also quite a few others that we didn't for various reasons especially
on huge_pte_offset() usage.

One more thing to mention is that besides the vma lock, i_mmap_rwsem can
also be used to protect the pgtable page (along with its pgtable lock) from
being freed from under us. IOW, huge_pte_offset() callers need to either
hold the vma lock or i_mmap_rwsem to safely walk the pgtables.

A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is
very hard to trigger, one needs to apply another kernel delay patch too,
see below):

======8<=======
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <unistd.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <linux/memfd.h>
#include <assert.h>
#include <pthread.h>

#define MSIZE (1UL << 30) /* 1GB */
#define PSIZE (2UL << 20) /* 2MB */

#define HOLD_SEC (1)

int pipefd[2];
void *buf;

void *do_map(int fd)
{
unsigned char *tmpbuf, *p;
int ret;

ret = posix_memalign((void **)&tmpbuf, MSIZE, MSIZE);
if (ret) {
perror("posix_memalign() failed");
return NULL;
}

tmpbuf = mmap(tmpbuf, MSIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_FIXED, fd, 0);
if (tmpbuf == MAP_FAILED) {
perror("mmap() failed");
return NULL;
}
printf("mmap() -> %p\n", tmpbuf);

for (p = tmpbuf; p < tmpbuf + MSIZE; p += PSIZE) {
*p = 1;
}

return tmpbuf;
}

void do_unmap(void *buf)
{
munmap(buf, MSIZE);
}

void proc2(int fd)
{
unsigned char c;

buf = do_map(fd);
if (!buf)
return;

read(pipefd[0], &c, 1);
/*
* This frees the shared pgtable page, causing use-after-free in
* proc1_thread1 when soft walking hugetlb pgtable.
*/
do_unmap(buf);

printf("Proc2 quitting\n");
}

void *proc1_thread1(void *data)
{
/*
* Trigger follow-page on 1st 2m page. Kernel hack patch needed to
* withhold this procedure for easier reproduce.
*/
madvise(buf, PSIZE, MADV_POPULATE_WRITE);
printf("Proc1-thread1 quitting\n");
return NULL;
}

void *proc1_thread2(void *data)
{
unsigned char c;

/* Wait a while until proc1_thread1() start to wait */
sleep(0.5);
/* Trigger pmd unshare */
madvise(buf, PSIZE, MADV_DONTNEED);
/* Kick off proc2 to release the pgtable */
write(pipefd[1], &c, 1);

printf("Proc1-thread2 quitting\n");
return NULL;
}

void proc1(int fd)
{
pthread_t tid1, tid2;
int ret;

buf = do_map(fd);
if (!buf)
return;

ret = pthread_create(&tid1, NULL, proc1_thread1, NULL);
assert(ret == 0);
ret = pthread_create(&tid2, NULL, proc1_thread2, NULL);
assert(ret == 0);

/* Kick the child to share the PUD entry */
pthread_join(tid1, NULL);
pthread_join(tid2, NULL);

do_unmap(buf);
}

int main(void)
{
int fd, ret;

fd = memfd_create("test-huge", MFD_HUGETLB | MFD_HUGE_2MB);
if (fd < 0) {
perror("open failed");
return -1;
}

ret = ftruncate(fd, MSIZE);
if (ret) {
perror("ftruncate() failed");
return -1;
}

ret = pipe(pipefd);
if (ret) {
perror("pipe() failed");
return -1;
}

if (fork()) {
proc1(fd);
} else {
proc2(fd);
}

close(pipefd[0]);
close(pipefd[1]);
close(fd);

return 0;
}
======8<=======

The kernel patch needed to present such a race so it'll trigger 100%:

======8<=======
: diff --git a/mm/hugetlb.c b/mm/hugetlb.c
: index 9d97c9a2a15d..f8d99dad5004 100644
: --- a/mm/hugetlb.c
: +++ b/mm/hugetlb.c
: @@ -38,6 +38,7 @@
: #include <asm/page.h>
: #include <asm/pgalloc.h>
: #include <asm/tlb.h>
: +#include <asm/delay.h>
:
: #include <linux/io.h>
: #include <linux/hugetlb.h>
: @@ -6290,6 +6291,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
: bool unshare = false;
: int absent;
: struct page *page;
: + unsigned long c = 0;
:
: /*
: * If we have a pending SIGKILL, don't keep faulting pages and
: @@ -6309,6 +6311,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
: */
: pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
: huge_page_size(h));
: +
: + pr_info("%s: withhold 1 sec...\n", __func__);
: + for (c = 0; c < 100; c++) {
: + udelay(10000);
: + }
: + pr_info("%s: withhold 1 sec...done\n", __func__);
: +
: if (pte)
: ptl = huge_pte_lock(h, mm, pte);
: absent = !pte || huge_pte_none(huge_ptep_get(pte));
: ======8<=======

It'll trigger use-after-free of the pgtable spinlock:

======8<=======
[ 16.959907] follow_hugetlb_page: withhold 1 sec...
[ 17.960315] follow_hugetlb_page: withhold 1 sec...done
[ 17.960550] ------------[ cut here ]------------
[ 17.960742] DEBUG_LOCKS_WARN_ON(1)
[ 17.960756] WARNING: CPU: 3 PID: 542 at kernel/locking/lockdep.c:231 __lock_acquire+0x955/0x1fa0
[ 17.961264] Modules linked in:
[ 17.961394] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Not tainted 6.1.0-rc4-peterx+ #46
[ 17.961704] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 17.962266] RIP: 0010:__lock_acquire+0x955/0x1fa0
[ 17.962516] Code: c0 0f 84 5f fe ff ff 44 8b 1d 0f 9a 29 02 45 85 db 0f 85 4f fe ff ff 48 c7 c6 75 50 83 82 48 c7 c7 1b 4b 7d 82 e8 d3 22 d8 00 <0f> 0b 31 c0 4c 8b 54 24 08 4c 8b 04 24 e9
[ 17.963494] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010096
[ 17.963704] RAX: 0000000000000016 RBX: fffffffffd3925a8 RCX: 0000000000000000
[ 17.963989] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff
[ 17.964276] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffc90000e4fa58
[ 17.964557] R10: 0000000000000003 R11: ffffffff83162688 R12: 0000000000000000
[ 17.964839] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001
[ 17.965123] FS: 00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
[ 17.965443] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 17.965672] CR2: 00007f17c09ffef8 CR3: 000000010c87a005 CR4: 0000000000770ee0
[ 17.965956] PKRU: 55555554
[ 17.966068] Call Trace:
[ 17.966172] <TASK>
[ 17.966268] ? tick_nohz_tick_stopped+0x12/0x30
[ 17.966455] lock_acquire+0xbf/0x2b0
[ 17.966603] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.966799] ? _printk+0x48/0x4e
[ 17.966934] _raw_spin_lock+0x2f/0x40
[ 17.967087] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.967285] follow_hugetlb_page.cold+0x75/0x5c4
[ 17.967473] __get_user_pages+0xbb/0x620
[ 17.967635] faultin_vma_page_range+0x9a/0x100
[ 17.967817] madvise_vma_behavior+0x3c0/0xbd0
[ 17.967998] ? mas_prev+0x11/0x290
[ 17.968141] ? find_vma_prev+0x5e/0xa0
[ 17.968304] ? madvise_vma_anon_name+0x70/0x70
[ 17.968486] madvise_walk_vmas+0xa9/0x120
[ 17.968650] do_madvise.part.0+0xfa/0x270
[ 17.968813] __x64_sys_madvise+0x5a/0x70
[ 17.968974] do_syscall_64+0x37/0x90
[ 17.969123] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 17.969329] RIP: 0033:0x7f1840f0efdb
[ 17.969477] Code: c3 66 0f 1f 44 00 00 48 8b 15 39 6e 0e 00 f7 d8 64 89 02 b8 ff ff ff ff eb bc 0f 1f 44 00 00 f3 0f 1e fa b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0d 68
[ 17.970205] RSP: 002b:00007f17c09ffe38 EFLAGS: 00000202 ORIG_RAX: 000000000000001c
[ 17.970504] RAX: ffffffffffffffda RBX: 00007f17c0a00640 RCX: 00007f1840f0efdb
[ 17.970786] RDX: 0000000000000017 RSI: 0000000000200000 RDI: 00007f1800000000
[ 17.971068] RBP: 00007f17c09ffe50 R08: 0000000000000000 R09: 00007ffd3954164f
[ 17.971353] R10: 00007f1840e10348 R11: 0000000000000202 R12: ffffffffffffff80
[ 17.971709] R13: 0000000000000000 R14: 00007ffd39541550 R15: 00007f17c0200000
[ 17.972083] </TASK>
[ 17.972199] irq event stamp: 2353
[ 17.972372] hardirqs last enabled at (2353): [<ffffffff8117fe4e>] __up_console_sem+0x5e/0x70
[ 17.972869] hardirqs last disabled at (2352): [<ffffffff8117fe33>] __up_console_sem+0x43/0x70
[ 17.973365] softirqs last enabled at (2330): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160
[ 17.973857] softirqs last disabled at (2323): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160
[ 17.974341] ---[ end trace 0000000000000000 ]---
[ 17.974614] BUG: kernel NULL pointer dereference, address: 00000000000000b8
[ 17.975012] #PF: supervisor read access in kernel mode
[ 17.975314] #PF: error_code(0x0000) - not-present page
[ 17.975615] PGD 103f7b067 P4D 103f7b067 PUD 106cd7067 PMD 0
[ 17.975943] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 17.976197] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Tainted: G W 6.1.0-rc4-peterx+ #46
[ 17.976712] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 17.977370] RIP: 0010:__lock_acquire+0x190/0x1fa0
[ 17.977655] Code: 98 00 00 00 41 89 46 24 81 e2 ff 1f 00 00 48 0f a3 15 e4 ba dd 02 0f 83 ff 05 00 00 48 8d 04 52 48 c1 e0 06 48 05 c0 d2 f4 83 <44> 0f b6 a0 b8 00 00 00 41 0f b7 46 20 6f
[ 17.979170] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010046
[ 17.979787] RAX: 0000000000000000 RBX: fffffffffd3925a8 RCX: 0000000000000000
[ 17.980838] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff
[ 17.982048] RBP: 0000000000000000 R08: ffff888105eac720 R09: ffffc90000e4fa58
[ 17.982892] R10: ffff888105eab900 R11: ffffffff83162688 R12: 0000000000000000
[ 17.983771] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001
[ 17.984815] FS: 00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
[ 17.985924] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 17.986265] CR2: 00000000000000b8 CR3: 000000010c87a005 CR4: 0000000000770ee0
[ 17.986674] PKRU: 55555554
[ 17.986832] Call Trace:
[ 17.987012] <TASK>
[ 17.987266] ? tick_nohz_tick_stopped+0x12/0x30
[ 17.987770] lock_acquire+0xbf/0x2b0
[ 17.988118] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.988575] ? _printk+0x48/0x4e
[ 17.988889] _raw_spin_lock+0x2f/0x40
[ 17.989243] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.989687] follow_hugetlb_page.cold+0x75/0x5c4
[ 17.990119] __get_user_pages+0xbb/0x620
[ 17.990500] faultin_vma_page_range+0x9a/0x100
[ 17.990928] madvise_vma_behavior+0x3c0/0xbd0
[ 17.991354] ? mas_prev+0x11/0x290
[ 17.991678] ? find_vma_prev+0x5e/0xa0
[ 17.992024] ? madvise_vma_anon_name+0x70/0x70
[ 17.992421] madvise_walk_vmas+0xa9/0x120
[ 17.992793] do_madvise.part.0+0xfa/0x270
[ 17.993166] __x64_sys_madvise+0x5a/0x70
[ 17.993539] do_syscall_64+0x37/0x90
[ 17.993879] entry_SYSCALL_64_after_hwframe+0x63/0xcd
======8<=======

Resolution
==========

This patchset protects all the huge_pte_offset() callers to also take the
vma lock properly.

Patch Layout
============

Patch 1-2: cleanup, or dependency of the follow up patches
Patch 3: before fixing, document huge_pte_offset() on lock required
Patch 4-8: each patch resolves one possible race condition
Patch 9: introduce hugetlb_walk() to replace huge_pte_offset()

Tests
=====

The series is verified with the above reproducer so the race cannot
trigger anymore. It also passes all hugetlb kselftests.


This patch (of 9):

Even though vma_offset_start() is named like that, it's not returning "the
start address of the range" but rather the offset we should use to offset
the vma->vm_start address.

Make it return the real value of the start vaddr, and it also helps for
all the callers because whenever the retval is used, it'll be ultimately
added into the vma->vm_start anyway, so it's better.

Link: https://lkml.kernel.org/r/20221216155100.2043537-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20221216155100.2043537-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 243b1f2d Fri Dec 16 08:50:52 MST 2022 Peter Xu <peterx@redhat.com> mm/hugetlb: let vma_offset_start() to return start

Patch series "mm/hugetlb: Make huge_pte_offset() thread-safe for pmd
unshare", v4.

Problem
=======

huge_pte_offset() is a major helper used by hugetlb code paths to walk a
hugetlb pgtable. It's used mostly everywhere since that's needed even
before taking the pgtable lock.

huge_pte_offset() is always called with mmap lock held with either read or
write. It was assumed to be safe but it's actually not. One race
condition can easily trigger by: (1) firstly trigger pmd share on a memory
range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
another thread unshare the pmd range, and the pgtable page is prone to lost
if the other shared process wants to free it completely (by either munmap
or exit mm).

The recent work from Mike on vma lock can resolve most of this already.
It's achieved by forbidden pmd unsharing during the lock being taken, so no
further risk of the pgtable page being freed. It means if we can take the
vma lock around all huge_pte_offset() callers it'll be safe.

There're already a bunch of them that we did as per the latest mm-unstable,
but also quite a few others that we didn't for various reasons especially
on huge_pte_offset() usage.

One more thing to mention is that besides the vma lock, i_mmap_rwsem can
also be used to protect the pgtable page (along with its pgtable lock) from
being freed from under us. IOW, huge_pte_offset() callers need to either
hold the vma lock or i_mmap_rwsem to safely walk the pgtables.

A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is
very hard to trigger, one needs to apply another kernel delay patch too,
see below):

======8<=======
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <unistd.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <linux/memfd.h>
#include <assert.h>
#include <pthread.h>

#define MSIZE (1UL << 30) /* 1GB */
#define PSIZE (2UL << 20) /* 2MB */

#define HOLD_SEC (1)

int pipefd[2];
void *buf;

void *do_map(int fd)
{
unsigned char *tmpbuf, *p;
int ret;

ret = posix_memalign((void **)&tmpbuf, MSIZE, MSIZE);
if (ret) {
perror("posix_memalign() failed");
return NULL;
}

tmpbuf = mmap(tmpbuf, MSIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_FIXED, fd, 0);
if (tmpbuf == MAP_FAILED) {
perror("mmap() failed");
return NULL;
}
printf("mmap() -> %p\n", tmpbuf);

for (p = tmpbuf; p < tmpbuf + MSIZE; p += PSIZE) {
*p = 1;
}

return tmpbuf;
}

void do_unmap(void *buf)
{
munmap(buf, MSIZE);
}

void proc2(int fd)
{
unsigned char c;

buf = do_map(fd);
if (!buf)
return;

read(pipefd[0], &c, 1);
/*
* This frees the shared pgtable page, causing use-after-free in
* proc1_thread1 when soft walking hugetlb pgtable.
*/
do_unmap(buf);

printf("Proc2 quitting\n");
}

void *proc1_thread1(void *data)
{
/*
* Trigger follow-page on 1st 2m page. Kernel hack patch needed to
* withhold this procedure for easier reproduce.
*/
madvise(buf, PSIZE, MADV_POPULATE_WRITE);
printf("Proc1-thread1 quitting\n");
return NULL;
}

void *proc1_thread2(void *data)
{
unsigned char c;

/* Wait a while until proc1_thread1() start to wait */
sleep(0.5);
/* Trigger pmd unshare */
madvise(buf, PSIZE, MADV_DONTNEED);
/* Kick off proc2 to release the pgtable */
write(pipefd[1], &c, 1);

printf("Proc1-thread2 quitting\n");
return NULL;
}

void proc1(int fd)
{
pthread_t tid1, tid2;
int ret;

buf = do_map(fd);
if (!buf)
return;

ret = pthread_create(&tid1, NULL, proc1_thread1, NULL);
assert(ret == 0);
ret = pthread_create(&tid2, NULL, proc1_thread2, NULL);
assert(ret == 0);

/* Kick the child to share the PUD entry */
pthread_join(tid1, NULL);
pthread_join(tid2, NULL);

do_unmap(buf);
}

int main(void)
{
int fd, ret;

fd = memfd_create("test-huge", MFD_HUGETLB | MFD_HUGE_2MB);
if (fd < 0) {
perror("open failed");
return -1;
}

ret = ftruncate(fd, MSIZE);
if (ret) {
perror("ftruncate() failed");
return -1;
}

ret = pipe(pipefd);
if (ret) {
perror("pipe() failed");
return -1;
}

if (fork()) {
proc1(fd);
} else {
proc2(fd);
}

close(pipefd[0]);
close(pipefd[1]);
close(fd);

return 0;
}
======8<=======

The kernel patch needed to present such a race so it'll trigger 100%:

======8<=======
: diff --git a/mm/hugetlb.c b/mm/hugetlb.c
: index 9d97c9a2a15d..f8d99dad5004 100644
: --- a/mm/hugetlb.c
: +++ b/mm/hugetlb.c
: @@ -38,6 +38,7 @@
: #include <asm/page.h>
: #include <asm/pgalloc.h>
: #include <asm/tlb.h>
: +#include <asm/delay.h>
:
: #include <linux/io.h>
: #include <linux/hugetlb.h>
: @@ -6290,6 +6291,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
: bool unshare = false;
: int absent;
: struct page *page;
: + unsigned long c = 0;
:
: /*
: * If we have a pending SIGKILL, don't keep faulting pages and
: @@ -6309,6 +6311,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
: */
: pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
: huge_page_size(h));
: +
: + pr_info("%s: withhold 1 sec...\n", __func__);
: + for (c = 0; c < 100; c++) {
: + udelay(10000);
: + }
: + pr_info("%s: withhold 1 sec...done\n", __func__);
: +
: if (pte)
: ptl = huge_pte_lock(h, mm, pte);
: absent = !pte || huge_pte_none(huge_ptep_get(pte));
: ======8<=======

It'll trigger use-after-free of the pgtable spinlock:

======8<=======
[ 16.959907] follow_hugetlb_page: withhold 1 sec...
[ 17.960315] follow_hugetlb_page: withhold 1 sec...done
[ 17.960550] ------------[ cut here ]------------
[ 17.960742] DEBUG_LOCKS_WARN_ON(1)
[ 17.960756] WARNING: CPU: 3 PID: 542 at kernel/locking/lockdep.c:231 __lock_acquire+0x955/0x1fa0
[ 17.961264] Modules linked in:
[ 17.961394] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Not tainted 6.1.0-rc4-peterx+ #46
[ 17.961704] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 17.962266] RIP: 0010:__lock_acquire+0x955/0x1fa0
[ 17.962516] Code: c0 0f 84 5f fe ff ff 44 8b 1d 0f 9a 29 02 45 85 db 0f 85 4f fe ff ff 48 c7 c6 75 50 83 82 48 c7 c7 1b 4b 7d 82 e8 d3 22 d8 00 <0f> 0b 31 c0 4c 8b 54 24 08 4c 8b 04 24 e9
[ 17.963494] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010096
[ 17.963704] RAX: 0000000000000016 RBX: fffffffffd3925a8 RCX: 0000000000000000
[ 17.963989] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff
[ 17.964276] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffc90000e4fa58
[ 17.964557] R10: 0000000000000003 R11: ffffffff83162688 R12: 0000000000000000
[ 17.964839] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001
[ 17.965123] FS: 00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
[ 17.965443] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 17.965672] CR2: 00007f17c09ffef8 CR3: 000000010c87a005 CR4: 0000000000770ee0
[ 17.965956] PKRU: 55555554
[ 17.966068] Call Trace:
[ 17.966172] <TASK>
[ 17.966268] ? tick_nohz_tick_stopped+0x12/0x30
[ 17.966455] lock_acquire+0xbf/0x2b0
[ 17.966603] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.966799] ? _printk+0x48/0x4e
[ 17.966934] _raw_spin_lock+0x2f/0x40
[ 17.967087] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.967285] follow_hugetlb_page.cold+0x75/0x5c4
[ 17.967473] __get_user_pages+0xbb/0x620
[ 17.967635] faultin_vma_page_range+0x9a/0x100
[ 17.967817] madvise_vma_behavior+0x3c0/0xbd0
[ 17.967998] ? mas_prev+0x11/0x290
[ 17.968141] ? find_vma_prev+0x5e/0xa0
[ 17.968304] ? madvise_vma_anon_name+0x70/0x70
[ 17.968486] madvise_walk_vmas+0xa9/0x120
[ 17.968650] do_madvise.part.0+0xfa/0x270
[ 17.968813] __x64_sys_madvise+0x5a/0x70
[ 17.968974] do_syscall_64+0x37/0x90
[ 17.969123] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 17.969329] RIP: 0033:0x7f1840f0efdb
[ 17.969477] Code: c3 66 0f 1f 44 00 00 48 8b 15 39 6e 0e 00 f7 d8 64 89 02 b8 ff ff ff ff eb bc 0f 1f 44 00 00 f3 0f 1e fa b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0d 68
[ 17.970205] RSP: 002b:00007f17c09ffe38 EFLAGS: 00000202 ORIG_RAX: 000000000000001c
[ 17.970504] RAX: ffffffffffffffda RBX: 00007f17c0a00640 RCX: 00007f1840f0efdb
[ 17.970786] RDX: 0000000000000017 RSI: 0000000000200000 RDI: 00007f1800000000
[ 17.971068] RBP: 00007f17c09ffe50 R08: 0000000000000000 R09: 00007ffd3954164f
[ 17.971353] R10: 00007f1840e10348 R11: 0000000000000202 R12: ffffffffffffff80
[ 17.971709] R13: 0000000000000000 R14: 00007ffd39541550 R15: 00007f17c0200000
[ 17.972083] </TASK>
[ 17.972199] irq event stamp: 2353
[ 17.972372] hardirqs last enabled at (2353): [<ffffffff8117fe4e>] __up_console_sem+0x5e/0x70
[ 17.972869] hardirqs last disabled at (2352): [<ffffffff8117fe33>] __up_console_sem+0x43/0x70
[ 17.973365] softirqs last enabled at (2330): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160
[ 17.973857] softirqs last disabled at (2323): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160
[ 17.974341] ---[ end trace 0000000000000000 ]---
[ 17.974614] BUG: kernel NULL pointer dereference, address: 00000000000000b8
[ 17.975012] #PF: supervisor read access in kernel mode
[ 17.975314] #PF: error_code(0x0000) - not-present page
[ 17.975615] PGD 103f7b067 P4D 103f7b067 PUD 106cd7067 PMD 0
[ 17.975943] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 17.976197] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Tainted: G W 6.1.0-rc4-peterx+ #46
[ 17.976712] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 17.977370] RIP: 0010:__lock_acquire+0x190/0x1fa0
[ 17.977655] Code: 98 00 00 00 41 89 46 24 81 e2 ff 1f 00 00 48 0f a3 15 e4 ba dd 02 0f 83 ff 05 00 00 48 8d 04 52 48 c1 e0 06 48 05 c0 d2 f4 83 <44> 0f b6 a0 b8 00 00 00 41 0f b7 46 20 6f
[ 17.979170] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010046
[ 17.979787] RAX: 0000000000000000 RBX: fffffffffd3925a8 RCX: 0000000000000000
[ 17.980838] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff
[ 17.982048] RBP: 0000000000000000 R08: ffff888105eac720 R09: ffffc90000e4fa58
[ 17.982892] R10: ffff888105eab900 R11: ffffffff83162688 R12: 0000000000000000
[ 17.983771] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001
[ 17.984815] FS: 00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
[ 17.985924] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 17.986265] CR2: 00000000000000b8 CR3: 000000010c87a005 CR4: 0000000000770ee0
[ 17.986674] PKRU: 55555554
[ 17.986832] Call Trace:
[ 17.987012] <TASK>
[ 17.987266] ? tick_nohz_tick_stopped+0x12/0x30
[ 17.987770] lock_acquire+0xbf/0x2b0
[ 17.988118] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.988575] ? _printk+0x48/0x4e
[ 17.988889] _raw_spin_lock+0x2f/0x40
[ 17.989243] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.989687] follow_hugetlb_page.cold+0x75/0x5c4
[ 17.990119] __get_user_pages+0xbb/0x620
[ 17.990500] faultin_vma_page_range+0x9a/0x100
[ 17.990928] madvise_vma_behavior+0x3c0/0xbd0
[ 17.991354] ? mas_prev+0x11/0x290
[ 17.991678] ? find_vma_prev+0x5e/0xa0
[ 17.992024] ? madvise_vma_anon_name+0x70/0x70
[ 17.992421] madvise_walk_vmas+0xa9/0x120
[ 17.992793] do_madvise.part.0+0xfa/0x270
[ 17.993166] __x64_sys_madvise+0x5a/0x70
[ 17.993539] do_syscall_64+0x37/0x90
[ 17.993879] entry_SYSCALL_64_after_hwframe+0x63/0xcd
======8<=======

Resolution
==========

This patchset protects all the huge_pte_offset() callers to also take the
vma lock properly.

Patch Layout
============

Patch 1-2: cleanup, or dependency of the follow up patches
Patch 3: before fixing, document huge_pte_offset() on lock required
Patch 4-8: each patch resolves one possible race condition
Patch 9: introduce hugetlb_walk() to replace huge_pte_offset()

Tests
=====

The series is verified with the above reproducer so the race cannot
trigger anymore. It also passes all hugetlb kselftests.


This patch (of 9):

Even though vma_offset_start() is named like that, it's not returning "the
start address of the range" but rather the offset we should use to offset
the vma->vm_start address.

Make it return the real value of the start vaddr, and it also helps for
all the callers because whenever the retval is used, it'll be ultimately
added into the vma->vm_start anyway, so it's better.

Link: https://lkml.kernel.org/r/20221216155100.2043537-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20221216155100.2043537-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 243b1f2d Fri Dec 16 08:50:52 MST 2022 Peter Xu <peterx@redhat.com> mm/hugetlb: let vma_offset_start() to return start

Patch series "mm/hugetlb: Make huge_pte_offset() thread-safe for pmd
unshare", v4.

Problem
=======

huge_pte_offset() is a major helper used by hugetlb code paths to walk a
hugetlb pgtable. It's used mostly everywhere since that's needed even
before taking the pgtable lock.

huge_pte_offset() is always called with mmap lock held with either read or
write. It was assumed to be safe but it's actually not. One race
condition can easily trigger by: (1) firstly trigger pmd share on a memory
range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
another thread unshare the pmd range, and the pgtable page is prone to lost
if the other shared process wants to free it completely (by either munmap
or exit mm).

The recent work from Mike on vma lock can resolve most of this already.
It's achieved by forbidden pmd unsharing during the lock being taken, so no
further risk of the pgtable page being freed. It means if we can take the
vma lock around all huge_pte_offset() callers it'll be safe.

There're already a bunch of them that we did as per the latest mm-unstable,
but also quite a few others that we didn't for various reasons especially
on huge_pte_offset() usage.

One more thing to mention is that besides the vma lock, i_mmap_rwsem can
also be used to protect the pgtable page (along with its pgtable lock) from
being freed from under us. IOW, huge_pte_offset() callers need to either
hold the vma lock or i_mmap_rwsem to safely walk the pgtables.

A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is
very hard to trigger, one needs to apply another kernel delay patch too,
see below):

======8<=======
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <unistd.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <linux/memfd.h>
#include <assert.h>
#include <pthread.h>

#define MSIZE (1UL << 30) /* 1GB */
#define PSIZE (2UL << 20) /* 2MB */

#define HOLD_SEC (1)

int pipefd[2];
void *buf;

void *do_map(int fd)
{
unsigned char *tmpbuf, *p;
int ret;

ret = posix_memalign((void **)&tmpbuf, MSIZE, MSIZE);
if (ret) {
perror("posix_memalign() failed");
return NULL;
}

tmpbuf = mmap(tmpbuf, MSIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_FIXED, fd, 0);
if (tmpbuf == MAP_FAILED) {
perror("mmap() failed");
return NULL;
}
printf("mmap() -> %p\n", tmpbuf);

for (p = tmpbuf; p < tmpbuf + MSIZE; p += PSIZE) {
*p = 1;
}

return tmpbuf;
}

void do_unmap(void *buf)
{
munmap(buf, MSIZE);
}

void proc2(int fd)
{
unsigned char c;

buf = do_map(fd);
if (!buf)
return;

read(pipefd[0], &c, 1);
/*
* This frees the shared pgtable page, causing use-after-free in
* proc1_thread1 when soft walking hugetlb pgtable.
*/
do_unmap(buf);

printf("Proc2 quitting\n");
}

void *proc1_thread1(void *data)
{
/*
* Trigger follow-page on 1st 2m page. Kernel hack patch needed to
* withhold this procedure for easier reproduce.
*/
madvise(buf, PSIZE, MADV_POPULATE_WRITE);
printf("Proc1-thread1 quitting\n");
return NULL;
}

void *proc1_thread2(void *data)
{
unsigned char c;

/* Wait a while until proc1_thread1() start to wait */
sleep(0.5);
/* Trigger pmd unshare */
madvise(buf, PSIZE, MADV_DONTNEED);
/* Kick off proc2 to release the pgtable */
write(pipefd[1], &c, 1);

printf("Proc1-thread2 quitting\n");
return NULL;
}

void proc1(int fd)
{
pthread_t tid1, tid2;
int ret;

buf = do_map(fd);
if (!buf)
return;

ret = pthread_create(&tid1, NULL, proc1_thread1, NULL);
assert(ret == 0);
ret = pthread_create(&tid2, NULL, proc1_thread2, NULL);
assert(ret == 0);

/* Kick the child to share the PUD entry */
pthread_join(tid1, NULL);
pthread_join(tid2, NULL);

do_unmap(buf);
}

int main(void)
{
int fd, ret;

fd = memfd_create("test-huge", MFD_HUGETLB | MFD_HUGE_2MB);
if (fd < 0) {
perror("open failed");
return -1;
}

ret = ftruncate(fd, MSIZE);
if (ret) {
perror("ftruncate() failed");
return -1;
}

ret = pipe(pipefd);
if (ret) {
perror("pipe() failed");
return -1;
}

if (fork()) {
proc1(fd);
} else {
proc2(fd);
}

close(pipefd[0]);
close(pipefd[1]);
close(fd);

return 0;
}
======8<=======

The kernel patch needed to present such a race so it'll trigger 100%:

======8<=======
: diff --git a/mm/hugetlb.c b/mm/hugetlb.c
: index 9d97c9a2a15d..f8d99dad5004 100644
: --- a/mm/hugetlb.c
: +++ b/mm/hugetlb.c
: @@ -38,6 +38,7 @@
: #include <asm/page.h>
: #include <asm/pgalloc.h>
: #include <asm/tlb.h>
: +#include <asm/delay.h>
:
: #include <linux/io.h>
: #include <linux/hugetlb.h>
: @@ -6290,6 +6291,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
: bool unshare = false;
: int absent;
: struct page *page;
: + unsigned long c = 0;
:
: /*
: * If we have a pending SIGKILL, don't keep faulting pages and
: @@ -6309,6 +6311,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
: */
: pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
: huge_page_size(h));
: +
: + pr_info("%s: withhold 1 sec...\n", __func__);
: + for (c = 0; c < 100; c++) {
: + udelay(10000);
: + }
: + pr_info("%s: withhold 1 sec...done\n", __func__);
: +
: if (pte)
: ptl = huge_pte_lock(h, mm, pte);
: absent = !pte || huge_pte_none(huge_ptep_get(pte));
: ======8<=======

It'll trigger use-after-free of the pgtable spinlock:

======8<=======
[ 16.959907] follow_hugetlb_page: withhold 1 sec...
[ 17.960315] follow_hugetlb_page: withhold 1 sec...done
[ 17.960550] ------------[ cut here ]------------
[ 17.960742] DEBUG_LOCKS_WARN_ON(1)
[ 17.960756] WARNING: CPU: 3 PID: 542 at kernel/locking/lockdep.c:231 __lock_acquire+0x955/0x1fa0
[ 17.961264] Modules linked in:
[ 17.961394] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Not tainted 6.1.0-rc4-peterx+ #46
[ 17.961704] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 17.962266] RIP: 0010:__lock_acquire+0x955/0x1fa0
[ 17.962516] Code: c0 0f 84 5f fe ff ff 44 8b 1d 0f 9a 29 02 45 85 db 0f 85 4f fe ff ff 48 c7 c6 75 50 83 82 48 c7 c7 1b 4b 7d 82 e8 d3 22 d8 00 <0f> 0b 31 c0 4c 8b 54 24 08 4c 8b 04 24 e9
[ 17.963494] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010096
[ 17.963704] RAX: 0000000000000016 RBX: fffffffffd3925a8 RCX: 0000000000000000
[ 17.963989] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff
[ 17.964276] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffc90000e4fa58
[ 17.964557] R10: 0000000000000003 R11: ffffffff83162688 R12: 0000000000000000
[ 17.964839] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001
[ 17.965123] FS: 00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
[ 17.965443] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 17.965672] CR2: 00007f17c09ffef8 CR3: 000000010c87a005 CR4: 0000000000770ee0
[ 17.965956] PKRU: 55555554
[ 17.966068] Call Trace:
[ 17.966172] <TASK>
[ 17.966268] ? tick_nohz_tick_stopped+0x12/0x30
[ 17.966455] lock_acquire+0xbf/0x2b0
[ 17.966603] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.966799] ? _printk+0x48/0x4e
[ 17.966934] _raw_spin_lock+0x2f/0x40
[ 17.967087] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.967285] follow_hugetlb_page.cold+0x75/0x5c4
[ 17.967473] __get_user_pages+0xbb/0x620
[ 17.967635] faultin_vma_page_range+0x9a/0x100
[ 17.967817] madvise_vma_behavior+0x3c0/0xbd0
[ 17.967998] ? mas_prev+0x11/0x290
[ 17.968141] ? find_vma_prev+0x5e/0xa0
[ 17.968304] ? madvise_vma_anon_name+0x70/0x70
[ 17.968486] madvise_walk_vmas+0xa9/0x120
[ 17.968650] do_madvise.part.0+0xfa/0x270
[ 17.968813] __x64_sys_madvise+0x5a/0x70
[ 17.968974] do_syscall_64+0x37/0x90
[ 17.969123] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 17.969329] RIP: 0033:0x7f1840f0efdb
[ 17.969477] Code: c3 66 0f 1f 44 00 00 48 8b 15 39 6e 0e 00 f7 d8 64 89 02 b8 ff ff ff ff eb bc 0f 1f 44 00 00 f3 0f 1e fa b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0d 68
[ 17.970205] RSP: 002b:00007f17c09ffe38 EFLAGS: 00000202 ORIG_RAX: 000000000000001c
[ 17.970504] RAX: ffffffffffffffda RBX: 00007f17c0a00640 RCX: 00007f1840f0efdb
[ 17.970786] RDX: 0000000000000017 RSI: 0000000000200000 RDI: 00007f1800000000
[ 17.971068] RBP: 00007f17c09ffe50 R08: 0000000000000000 R09: 00007ffd3954164f
[ 17.971353] R10: 00007f1840e10348 R11: 0000000000000202 R12: ffffffffffffff80
[ 17.971709] R13: 0000000000000000 R14: 00007ffd39541550 R15: 00007f17c0200000
[ 17.972083] </TASK>
[ 17.972199] irq event stamp: 2353
[ 17.972372] hardirqs last enabled at (2353): [<ffffffff8117fe4e>] __up_console_sem+0x5e/0x70
[ 17.972869] hardirqs last disabled at (2352): [<ffffffff8117fe33>] __up_console_sem+0x43/0x70
[ 17.973365] softirqs last enabled at (2330): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160
[ 17.973857] softirqs last disabled at (2323): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160
[ 17.974341] ---[ end trace 0000000000000000 ]---
[ 17.974614] BUG: kernel NULL pointer dereference, address: 00000000000000b8
[ 17.975012] #PF: supervisor read access in kernel mode
[ 17.975314] #PF: error_code(0x0000) - not-present page
[ 17.975615] PGD 103f7b067 P4D 103f7b067 PUD 106cd7067 PMD 0
[ 17.975943] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 17.976197] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Tainted: G W 6.1.0-rc4-peterx+ #46
[ 17.976712] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 17.977370] RIP: 0010:__lock_acquire+0x190/0x1fa0
[ 17.977655] Code: 98 00 00 00 41 89 46 24 81 e2 ff 1f 00 00 48 0f a3 15 e4 ba dd 02 0f 83 ff 05 00 00 48 8d 04 52 48 c1 e0 06 48 05 c0 d2 f4 83 <44> 0f b6 a0 b8 00 00 00 41 0f b7 46 20 6f
[ 17.979170] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010046
[ 17.979787] RAX: 0000000000000000 RBX: fffffffffd3925a8 RCX: 0000000000000000
[ 17.980838] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff
[ 17.982048] RBP: 0000000000000000 R08: ffff888105eac720 R09: ffffc90000e4fa58
[ 17.982892] R10: ffff888105eab900 R11: ffffffff83162688 R12: 0000000000000000
[ 17.983771] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001
[ 17.984815] FS: 00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
[ 17.985924] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 17.986265] CR2: 00000000000000b8 CR3: 000000010c87a005 CR4: 0000000000770ee0
[ 17.986674] PKRU: 55555554
[ 17.986832] Call Trace:
[ 17.987012] <TASK>
[ 17.987266] ? tick_nohz_tick_stopped+0x12/0x30
[ 17.987770] lock_acquire+0xbf/0x2b0
[ 17.988118] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.988575] ? _printk+0x48/0x4e
[ 17.988889] _raw_spin_lock+0x2f/0x40
[ 17.989243] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.989687] follow_hugetlb_page.cold+0x75/0x5c4
[ 17.990119] __get_user_pages+0xbb/0x620
[ 17.990500] faultin_vma_page_range+0x9a/0x100
[ 17.990928] madvise_vma_behavior+0x3c0/0xbd0
[ 17.991354] ? mas_prev+0x11/0x290
[ 17.991678] ? find_vma_prev+0x5e/0xa0
[ 17.992024] ? madvise_vma_anon_name+0x70/0x70
[ 17.992421] madvise_walk_vmas+0xa9/0x120
[ 17.992793] do_madvise.part.0+0xfa/0x270
[ 17.993166] __x64_sys_madvise+0x5a/0x70
[ 17.993539] do_syscall_64+0x37/0x90
[ 17.993879] entry_SYSCALL_64_after_hwframe+0x63/0xcd
======8<=======

Resolution
==========

This patchset protects all the huge_pte_offset() callers to also take the
vma lock properly.

Patch Layout
============

Patch 1-2: cleanup, or dependency of the follow up patches
Patch 3: before fixing, document huge_pte_offset() on lock required
Patch 4-8: each patch resolves one possible race condition
Patch 9: introduce hugetlb_walk() to replace huge_pte_offset()

Tests
=====

The series is verified with the above reproducer so the race cannot
trigger anymore. It also passes all hugetlb kselftests.


This patch (of 9):

Even though vma_offset_start() is named like that, it's not returning "the
start address of the range" but rather the offset we should use to offset
the vma->vm_start address.

Make it return the real value of the start vaddr, and it also helps for
all the callers because whenever the retval is used, it'll be ultimately
added into the vma->vm_start anyway, so it's better.

Link: https://lkml.kernel.org/r/20221216155100.2043537-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20221216155100.2043537-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
/linux-master/fs/fuse/
H A Dfile.cdiff 4a90451b Fri Feb 09 08:14:50 MST 2024 Amir Goldstein <amir73il@gmail.com> fuse: implement open in passthrough mode

After getting a backing file id with FUSE_DEV_IOC_BACKING_OPEN ioctl,
a FUSE server can reply to an OPEN request with flag FOPEN_PASSTHROUGH
and the backing file id.

The FUSE server should reuse the same backing file id for all the open
replies of the same FUSE inode and open will fail (with -EIO) if a the
server attempts to open the same inode with conflicting io modes or to
setup passthrough to two different backing files for the same FUSE inode.
Using the same backing file id for several different inodes is allowed.

Opening a new file with FOPEN_DIRECT_IO for an inode that is already
open for passthrough is allowed, but only if the FOPEN_PASSTHROUGH flag
and correct backing file id are specified as well.

The read/write IO of such files will not use passthrough operations to
the backing file, but mmap, which does not support direct_io, will use
the backing file insead of using the page cache as it always did.

Even though all FUSE passthrough files of the same inode use the same
backing file as a backing inode reference, each FUSE file opens a unique
instance of a backing_file object to store the FUSE path that was used
to open the inode and the open flags of the specific open file.

The per-file, backing_file object is released along with the FUSE file.
The inode associated fuse_backing object is released when the last FUSE
passthrough file of that inode is released AND when the backing file id
is closed by the server using the FUSE_DEV_IOC_BACKING_CLOSE ioctl.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
diff 91ec6c85 Mon Aug 14 05:05:30 MDT 2023 Miklos Szeredi <mszeredi@redhat.com> Revert "fuse: in fuse_flush only wait if someone wants the return code"

This reverts commit 5a8bee63b10f6f2f52f6d22e109a4a147409842a.

Jürg Billeter reports the following regression:

Since v6.3-rc1 commit 5a8bee63b1 ("fuse: in fuse_flush only wait if
someone wants the return code") `fput()` is called asynchronously if a
file is closed as part of a process exiting, i.e., if there was no
explicit `close()` before exit.

If the file was open for writing, also `put_write_access()` is called
asynchronously as part of the async `fput()`.

If that newly written file is an executable, attempting to `execve()` the
new file can fail with `ETXTBSY` if it's called after the writer process
exited but before the async `fput()` has run.

Reported-and-tested-by: "Jürg Billeter" <j@bitron.ch>
Cc: <stable@vger.kernel.org> # v6.3
Link: https://lore.kernel.org/all/4f66cded234462964899f2a661750d6798a57ec0.camel@bitron.ch/
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
diff 44361e8c Wed Nov 23 01:10:42 MST 2022 Miklos Szeredi <mszeredi@redhat.com> fuse: lock inode unconditionally in fuse_fallocate()

file_modified() must be called with inode lock held. fuse_fallocate()
didn't lock the inode in case of just FALLOC_KEEP_SIZE flags value, which
resulted in a kernel Warning in notify_change().

Lock the inode unconditionally, like all other fallocate implementations
do.

Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Reported-and-tested-by: syzbot+462da39f0667b357c4b6@syzkaller.appspotmail.com
Fixes: 4a6f278d4827 ("fuse: add file_modified() to fallocate")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
diff 4a6f278d Fri Oct 28 06:25:20 MDT 2022 Miklos Szeredi <mszeredi@redhat.com> fuse: add file_modified() to fallocate

Add missing file_modified() call to fuse_file_fallocate(). Without this
fallocate on fuse failed to clear privileges.

Fixes: 05ba1f082300 ("fuse: add FALLOCATE operation")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
diff 91b94c5d Sun May 22 07:39:27 MDT 2022 Al Viro <viro@zeniv.linux.org.uk> iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC

New helper to be used instead of direct checks for IOCB_DSYNC:
iocb_is_dsync(iocb). Checks converted, which allows to avoid
the IS_SYNC(iocb->ki_filp->f_mapping->host) part (4 cache lines)
from iocb_flags() - it's checked in iocb_is_dsync() instead

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
diff 4f06dd92 Wed Oct 21 14:12:49 MDT 2020 Vivek Goyal <vgoyal@redhat.com> fuse: fix write deadlock

There are two modes for write(2) and friends in fuse:

a) write through (update page cache, send sync WRITE request to userspace)

b) buffered write (update page cache, async writeout later)

The write through method kept all the page cache pages locked that were
used for the request. Keeping more than one page locked is deadlock prone
and Qian Cai demonstrated this with trinity fuzzing.

The reason for keeping the pages locked is that concurrent mapped reads
shouldn't try to pull possibly stale data into the page cache.

For full page writes, the easy way to fix this is to make the cached page
be the authoritative source by marking the page PG_uptodate immediately.
After this the page can be safely unlocked, since mapped/cached reads will
take the written data from the cache.

Concurrent mapped writes will now cause data in the original WRITE request
to be updated; this however doesn't cause any data inconsistency and this
scenario should be exceedingly rare anyway.

If the WRITE request returns with an error in the above case, currently the
page is not marked uptodate; this means that a concurrent read will always
read consistent data. After this patch the page is uptodate between
writing to the cache and receiving the error: there's window where a cached
read will read the wrong data. While theoretically this could be a
regression, it is unlikely to be one in practice, since this is normal for
buffered writes.

In case of a partial page write to an already uptodate page the locking is
also unnecessary, with the above caveats.

Partial write of a not uptodate page still needs to be handled. One way
would be to read the complete page before doing the write. This is not
possible, since it might break filesystems that don't expect any READ
requests when the file was opened O_WRONLY.

The other solution is to serialize the synchronous write with reads from
the partial pages. The easiest way to do this is to keep the partial pages
locked. The problem is that a write() may involve two such pages (one head
and one tail). This patch fixes it by only locking the partial tail page.
If there's a partial head page as well, then split that off as a separate
WRITE request.

Reported-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/linux-fsdevel/4794a3fa3742a5e84fb0f934944204b55730829b.camel@lca.pw/
Fixes: ea9b9907b82a ("fuse: implement perform_write")
Cc: <stable@vger.kernel.org> # v2.6.26
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
diff 6ae330ca Wed Aug 19 16:19:54 MDT 2020 Vivek Goyal <vgoyal@redhat.com> virtiofs: serialize truncate/punch_hole and dax fault path

Currently in fuse we don't seem have any lock which can serialize fault
path with truncate/punch_hole path. With dax support I need one for
following reasons.

1. Dax requirement

DAX fault code relies on inode size being stable for the duration of
fault and want to serialize with truncate/punch_hole and they explicitly
mention it.

static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
const struct iomap_ops *ops)
/*
* Check whether offset isn't beyond end of file now. Caller is
* supposed to hold locks serializing us with truncate / punch hole so
* this is a reliable test.
*/
max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);

2. Make sure there are no users of pages being truncated/punch_hole

get_user_pages() might take references to page and then do some DMA
to said pages. Filesystem might truncate those pages without knowing
that a DMA is in progress or some I/O is in progress. So use
dax_layout_busy_page() to make sure there are no such references
and I/O is not in progress on said pages before moving ahead with
truncation.

3. Limitation of kvm page fault error reporting

If we are truncating file on host first and then removing mappings in
guest lateter (truncate page cache etc), then this could lead to a
problem with KVM. Say a mapping is in place in guest and truncation
happens on host. Now if guest accesses that mapping, then host will
take a fault and kvm will either exit to qemu or spin infinitely.

IOW, before we do truncation on host, we need to make sure that guest
inode does not have any mapping in that region or whole file.

4. virtiofs memory range reclaim

Soon I will introduce the notion of being able to reclaim dax memory
ranges from a fuse dax inode. There also I need to make sure that
no I/O or fault is going on in the reclaimed range and nobody is using
it so that range can be reclaimed without issues.

Currently if we take inode lock, that serializes read/write. But it does
not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem
for this purpose. It can be used to serialize with faults.

As of now, I am adding taking this semaphore only in dax fault path and
not regular fault path because existing code does not have one. May
be existing code can benefit from it as well to take care of some
races, but that we can fix later if need be. For now, I am just focussing
only on DAX path which is new path.

Also added logic to take fuse_inode->i_mmap_sem in
truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and
fuse dax fault are mutually exlusive and avoid all the above problems.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
diff 3f649ab7 Wed Jun 03 14:09:38 MDT 2020 Kees Cook <keescook@chromium.org> treewide: Remove uninitialized_var() usage

Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.

In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:

git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.

No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.

[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
diff 3f649ab7 Wed Jun 03 14:09:38 MDT 2020 Kees Cook <keescook@chromium.org> treewide: Remove uninitialized_var() usage

Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.

In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:

git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.

No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.

[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
diff cf576c58 Mon May 11 20:29:04 MDT 2020 Eryu Guan <eguan@linux.alibaba.com> fuse: invalidate inode attr in writeback cache mode

Under writeback mode, inode->i_blocks is not updated, making utils du
read st.blocks as 0.

For example, when using virtiofs (cache=always & nondax mode) with
writeback_cache enabled, writing a new file and check its disk usage
with du, du reports 0 usage.

# uname -r
5.6.0-rc6+
# mount -t virtiofs virtiofs /mnt/virtiofs
# rm -f /mnt/virtiofs/testfile

# create new file and do extend write
# xfs_io -fc "pwrite 0 4k" /mnt/virtiofs/testfile
wrote 4096/4096 bytes at offset 0
4 KiB, 1 ops; 0.0001 sec (28.103 MiB/sec and 7194.2446 ops/sec)
# du -k /mnt/virtiofs/testfile
0 <==== disk usage is 0
# stat -c %s,%b /mnt/virtiofs/testfile
4096,0 <==== i_size is correct, but st_blocks is 0

Fix it by invalidating attr in fuse_flush(), so we get up-to-date attr
from server on next getattr.

Signed-off-by: Eryu Guan <eguan@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
diff cf576c58 Mon May 11 20:29:04 MDT 2020 Eryu Guan <eguan@linux.alibaba.com> fuse: invalidate inode attr in writeback cache mode

Under writeback mode, inode->i_blocks is not updated, making utils du
read st.blocks as 0.

For example, when using virtiofs (cache=always & nondax mode) with
writeback_cache enabled, writing a new file and check its disk usage
with du, du reports 0 usage.

# uname -r
5.6.0-rc6+
# mount -t virtiofs virtiofs /mnt/virtiofs
# rm -f /mnt/virtiofs/testfile

# create new file and do extend write
# xfs_io -fc "pwrite 0 4k" /mnt/virtiofs/testfile
wrote 4096/4096 bytes at offset 0
4 KiB, 1 ops; 0.0001 sec (28.103 MiB/sec and 7194.2446 ops/sec)
# du -k /mnt/virtiofs/testfile
0 <==== disk usage is 0
# stat -c %s,%b /mnt/virtiofs/testfile
4096,0 <==== i_size is correct, but st_blocks is 0

Fix it by invalidating attr in fuse_flush(), so we get up-to-date attr
from server on next getattr.

Signed-off-by: Eryu Guan <eguan@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
/linux-master/fs/nfs/
H A Ddir.cdiff 10a973fc Wed Sep 27 19:50:25 MDT 2023 Al Viro <viro@zeniv.linux.org.uk> nfs: make nfs_set_verifier() safe for use in RCU pathwalk

nfs_set_verifier() relies upon dentry being pinned; if that's
the case, grabbing ->d_lock stabilizes ->d_parent and guarantees
that ->d_parent points to a positive dentry. For something
we'd run into in RCU mode that is *not* true - dentry might've
been through dentry_kill() just as we grabbed ->d_lock, with
its parent going through the same just as we get to into
nfs_set_verifier_locked(). It might get to detaching inode
(and zeroing ->d_inode) before nfs_set_verifier_locked() gets
to fetching that; we get an oops as the result.

That can happen in nfs{,4} ->d_revalidate(); the call chain in
question is nfs_set_verifier_locked() <- nfs_set_verifier() <-
nfs_lookup_revalidate_delegated() <- nfs{,4}_do_lookup_revalidate().
We have checked that the parent had been positive, but that's
done before we get to nfs_set_verifier() and it's possible for
memory pressure to pick our dentry as eviction candidate by that
time. If that happens, back-to-back attempts to kill dentry and
its parent are quite normal. Sure, in case of eviction we'll
fail the ->d_seq check in the caller, but we need to survive
until we return there...

Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
diff 10a973fc Wed Sep 27 19:50:25 MDT 2023 Al Viro <viro@zeniv.linux.org.uk> nfs: make nfs_set_verifier() safe for use in RCU pathwalk

nfs_set_verifier() relies upon dentry being pinned; if that's
the case, grabbing ->d_lock stabilizes ->d_parent and guarantees
that ->d_parent points to a positive dentry. For something
we'd run into in RCU mode that is *not* true - dentry might've
been through dentry_kill() just as we grabbed ->d_lock, with
its parent going through the same just as we get to into
nfs_set_verifier_locked(). It might get to detaching inode
(and zeroing ->d_inode) before nfs_set_verifier_locked() gets
to fetching that; we get an oops as the result.

That can happen in nfs{,4} ->d_revalidate(); the call chain in
question is nfs_set_verifier_locked() <- nfs_set_verifier() <-
nfs_lookup_revalidate_delegated() <- nfs{,4}_do_lookup_revalidate().
We have checked that the parent had been positive, but that's
done before we get to nfs_set_verifier() and it's possible for
memory pressure to pick our dentry as eviction candidate by that
time. If that happens, back-to-back attempts to kill dentry and
its parent are quite normal. Sure, in case of eviction we'll
fail the ->d_seq check in the caller, but we need to survive
until we return there...

Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
diff 4b71e241 Wed May 03 11:24:11 MDT 2023 Fabio M. De Francesco <fmdefrancesco@gmail.com> NFS: Convert kmap_atomic() to kmap_local_folio()

kmap_atomic() is deprecated in favor of kmap_local_{folio,page}().

Therefore, replace kmap_atomic() with kmap_local_folio() in
nfs_readdir_folio_array_append().

kmap_atomic() disables page-faults and preemption (the latter only for
!PREEMPT_RT kernels), However, the code within the mapping/un-mapping in
nfs_readdir_folio_array_append() does not depend on the above-mentioned
side effects.

Therefore, a mere replacement of the old API with the new one is all that
is required (i.e., there is no need to explicitly add any calls to
pagefault_disable() and/or preempt_disable()).

Tested with (x)fstests in a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel
with HIGHMEM64GB enabled.

Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Fixes: ec108d3cc766 ("NFS: Convert readdir page array functions to use a folio")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff b243874f Tue Mar 29 05:32:08 MDT 2022 ChenXiaoSong <chenxiaosong2@huawei.com> NFSv4: fix open failure with O_ACCMODE flag

open() with O_ACCMODE|O_DIRECT flags secondly will fail.

Reproducer:
1. mount -t nfs -o vers=4.2 $server_ip:/ /mnt/
2. fd = open("/mnt/file", O_ACCMODE|O_DIRECT|O_CREAT)
3. close(fd)
4. fd = open("/mnt/file", O_ACCMODE|O_DIRECT)

Server nfsd4_decode_share_access() will fail with error nfserr_bad_xdr when
client use incorrect share access mode of 0.

Fix this by using NFS4_SHARE_ACCESS_BOTH share access mode in client,
just like firstly opening.

Fixes: ce4ef7c0a8a05 ("NFS: Split out NFS v4 file operations")
Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
diff 9f01eb5d Mon Dec 09 22:46:39 MST 2019 Madhuparna Bhowmik <madhuparnabhowmik04@gmail.com> nfs: Fix nfs_access_get_cached_rcu() sparse error

This patch fixes the following sparse error:
fs/nfs/dir.c:2353:14: error: incompatible types in comparison expression (different address spaces):
fs/nfs/dir.c:2353:14: struct list_head [noderef] <asn:4> *
fs/nfs/dir.c:2353:14: struct list_head *

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik04@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
diff 4b310319 Sun Feb 02 15:53:53 MST 2020 Trond Myklebust <trondmy@gmail.com> NFS: Fix memory leaks and corruption in readdir

nfs_readdir_xdr_to_array() must not exit without having initialised
the array, so that the page cache deletion routines can safely
call nfs_readdir_clear_array().
Furthermore, we should ensure that if we exit nfs_readdir_filler()
with an error, we free up any page contents to prevent a leak
if we try to fill the page again.

Fixes: 11de3b11e08c ("NFS: Fix a memory leak in nfs_readdir")
Cc: stable@vger.kernel.org # v2.6.37+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff db531db9 Fri Jul 12 08:18:06 MDT 2019 Max Kellermann <mk@cm4all.com> Revert "NFS: readdirplus optimization by cache mechanism" (memleak)

This reverts commit be4c2d4723a4a637f0d1b4f7c66447141a4b3564.

That commit caused a severe memory leak in nfs_readdir_make_qstr().

When listing a directory with more than 100 files (this is how many
struct nfs_cache_array_entry elements fit in one 4kB page), all
allocated file name strings past those 100 leak.

The root of the leakage is that those string pointers are managed in
pages which are never linked into the page cache.

fs/nfs/dir.c puts pages into the page cache by calling
read_cache_page(); the callback function nfs_readdir_filler() will
then fill the given page struct which was passed to it, which is
already linked in the page cache (by do_read_cache_page() calling
add_to_page_cache_lru()).

Commit be4c2d4723a4 added another (local) array of allocated pages, to
be filled with more data, instead of discarding excess items received
from the NFS server. Those additional pages can be used by the next
nfs_readdir_filler() call (from within the same nfs_readdir() call).

The leak happens when some of those additional pages are never used
(copied to the page cache using copy_highpage()). The pages will be
freed by nfs_readdir_free_pages(), but their contents will not. The
commit did not invoke nfs_readdir_clear_array() (and doing so would
have been dangerous, because it did not track which of those pages
were already copied to the page cache, risking double free bugs).

How to reproduce the leak:

- Use a kernel with CONFIG_SLUB_DEBUG_ON.

- Create a directory on a NFS mount with more than 100 files with
names long enough to use the "kmalloc-32" slab (so we can easily
look up the allocation counts):

for i in `seq 110`; do touch ${i}_0123456789abcdef; done

- Drop all caches:

echo 3 >/proc/sys/vm/drop_caches

- Check the allocation counter:

grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls
30564391 nfs_readdir_add_to_array+0x73/0xd0 age=534558/4791307/6540952 pid=370-1048386 cpus=0-47 nodes=0-1

- Request a directory listing and check the allocation counters again:

ls
[...]
grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls
30564511 nfs_readdir_add_to_array+0x73/0xd0 age=207/4792999/6542663 pid=370-1048386 cpus=0-47 nodes=0-1

There are now 120 new allocations.

- Drop all caches and check the counters again:

echo 3 >/proc/sys/vm/drop_caches
grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls
30564401 nfs_readdir_add_to_array+0x73/0xd0 age=735/4793524/6543176 pid=370-1048386 cpus=0-47 nodes=0-1

110 allocations are gone, but 10 have leaked and will never be freed.

Unhelpfully, those allocations are explicitly excluded from KMEMLEAK,
that's why my initial attempts with KMEMLEAK were not successful:

/*
* Avoid a kmemleak false positive. The pointer to the name is stored
* in a page cache page which kmemleak does not scan.
*/
kmemleak_not_leak(string->name);

It would be possible to solve this bug without reverting the whole
commit:

- keep track of which pages were not used, and call
nfs_readdir_clear_array() on them, or
- manually link those pages into the page cache

But for now I have decided to just revert the commit, because the real
fix would require complex considerations, risking more dangerous
(crash) bugs, which may seem unsuitable for the stable branches.

Signed-off-by: Max Kellermann <mk@cm4all.com>
Cc: stable@vger.kernel.org # v5.1+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
diff cc89684c Tue Jul 04 08:22:20 MDT 2017 NeilBrown <neilb@suse.com> NFS: only invalidate dentrys that are clearly invalid.

Since commit bafc9b754f75 ("vfs: More precise tests in d_invalidate")
in v3.18, a return of '0' from ->d_revalidate() will cause the dentry
to be invalidated even if it has filesystems mounted on or it or on a
descendant. The mounted filesystem is unmounted.

This means we need to be careful not to return 0 unless the directory
referred to truly is invalid. So -ESTALE or -ENOENT should invalidate
the directory. Other errors such a -EPERM or -ERESTARTSYS should be
returned from ->d_revalidate() so they are propagated to the caller.

A particular problem can be demonstrated by:

1/ mount an NFS filesystem using NFSv3 on /mnt
2/ mount any other filesystem on /mnt/foo
3/ ls /mnt/foo
4/ turn off network, or otherwise make the server unable to respond
5/ ls /mnt/foo &
6/ cat /proc/$!/stack # note that nfs_lookup_revalidate is in the call stack
7/ kill -9 $! # this results in -ERESTARTSYS being returned
8/ observe that /mnt/foo has been unmounted.

This patch changes nfs_lookup_revalidate() to only treat
-ESTALE from nfs_lookup_verify_inode() and
-ESTALE or -ENOENT from ->lookup()
as indicating an invalid inode. Other errors are returned.

Also nfs_check_inode_attributes() is changed to return -ESTALE rather
than -EIO. This is consistent with the error returned in similar
circumstances from nfs_update_inode().

As this bug allows any user to unmount a filesystem mounted on an NFS
filesystem, this fix is suitable for stable kernels.

Fixes: bafc9b754f75 ("vfs: More precise tests in d_invalidate")
Cc: stable@vger.kernel.org (v3.18+)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff be62a1a8 Sat Mar 26 14:14:39 MDT 2016 Miklos Szeredi <mszeredi@redhat.com> nfs: use file_dentry()

NFS may be used as lower layer of overlayfs and accessing f_path.dentry can
lead to a crash.

Fix by replacing direct access of file->f_path.dentry with the
file_dentry() accessor, which will always return a native object.

Fixes: 4bacc9c9234c ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: <stable@vger.kernel.org> # v4.2
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
diff f682a398 Sun Jul 13 19:28:20 MDT 2014 NeilBrown <neilb@suse.de> NFS: allow lockless access to access_cache

The access cache is used during RCU-walk path lookups, so it is best
to avoid locking if possible as taking a lock kills concurrency.

The rbtree is not rcu-safe and cannot easily be made so.
Instead we simply check the last (i.e. most recent) entry on the LRU
list. If this doesn't match, then we return -ECHILD and retry in
lock/refcount mode.

This requires freeing the nfs_access_entry struct with rcu, and
requires using rcu access primatives when adding entries to the lru, and
when examining the last entry.

Calling put_rpccred before kfree_rcu looks a bit odd, but as
put_rpccred already provides rcu protection, we know that the cred will
not actually be freed until the next grace period, so any concurrent
access will be safe.

This patch provides about 5% performance improvement on a stat-heavy
synthetic work load with 4 threads on a 2-core CPU.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
/linux-master/include/linux/
H A Dhugetlb.hdiff b78b27d0 Thu Feb 22 07:04:21 MST 2024 Gang Li <gang.li@linux.dev> hugetlb: parallelize 1G hugetlb initialization

Optimizing the initialization speed of 1G huge pages through
parallelization.

1G hugetlbs are allocated from bootmem, a process that is already very
fast and does not currently require optimization. Therefore, we focus on
parallelizing only the initialization phase in `gather_bootmem_prealloc`.

Here are some test results:
test case no patch(ms) patched(ms) saved
------------------- -------------- ------------- --------
256c2T(4 node) 1G 4745 2024 57.34%
128c1T(2 node) 1G 3358 1712 49.02%
12T 1G 77000 18300 76.23%

[akpm@linux-foundation.org: s/initialied/initialized/, per Alexey]
Link: https://lkml.kernel.org/r/20240222140422.393911-9-gang.li@linux.dev
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 2820b0f0 Thu Oct 05 21:59:08 MDT 2023 Rik van Riel <riel@surriel.com> hugetlbfs: close race between MADV_DONTNEED and page fault

Malloc libraries, like jemalloc and tcalloc, take decisions on when to
call madvise independently from the code in the main application.

This sometimes results in the application page faulting on an address,
right after the malloc library has shot down the backing memory with
MADV_DONTNEED.

Usually this is harmless, because we always have some 4kB pages sitting
around to satisfy a page fault. However, with hugetlbfs systems often
allocate only the exact number of huge pages that the application wants.

Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of
any lock taken on the page fault path, which can open up the following
race condition:

CPU 1 CPU 2

MADV_DONTNEED
unmap page
shoot down TLB entry
page fault
fail to allocate a huge page
killed with SIGBUS
free page

Fix that race by pulling the locking from __unmap_hugepage_final_range
into helper functions called from zap_page_range_single. This ensures
page faults stay locked out of the MADV_DONTNEED VMA until the huge pages
have actually been freed.

Link: https://lkml.kernel.org/r/20231006040020.3677377-4-riel@surriel.com
Fixes: 04ada095dcfc ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing")
Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 2820b0f0 Thu Oct 05 21:59:08 MDT 2023 Rik van Riel <riel@surriel.com> hugetlbfs: close race between MADV_DONTNEED and page fault

Malloc libraries, like jemalloc and tcalloc, take decisions on when to
call madvise independently from the code in the main application.

This sometimes results in the application page faulting on an address,
right after the malloc library has shot down the backing memory with
MADV_DONTNEED.

Usually this is harmless, because we always have some 4kB pages sitting
around to satisfy a page fault. However, with hugetlbfs systems often
allocate only the exact number of huge pages that the application wants.

Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of
any lock taken on the page fault path, which can open up the following
race condition:

CPU 1 CPU 2

MADV_DONTNEED
unmap page
shoot down TLB entry
page fault
fail to allocate a huge page
killed with SIGBUS
free page

Fix that race by pulling the locking from __unmap_hugepage_final_range
into helper functions called from zap_page_range_single. This ensures
page faults stay locked out of the MADV_DONTNEED VMA until the huge pages
have actually been freed.

Link: https://lkml.kernel.org/r/20231006040020.3677377-4-riel@surriel.com
Fixes: 04ada095dcfc ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing")
Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 454a00c4 Wed Aug 16 09:11:51 MDT 2023 Matthew Wilcox (Oracle) <willy@infradead.org> mm: convert free_huge_page() to free_huge_folio()

Pass a folio instead of the head page to save a few instructions. Update
the documentation, at least in English.

Link: https://lkml.kernel.org/r/20230816151201.3655946-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 5502ea44 Wed Jun 28 15:53:05 MDT 2023 Peter Xu <peterx@redhat.com> mm/hugetlb: add page_mask for hugetlb_follow_page_mask()

follow_page() doesn't need it, but we'll start to need it when unifying
gup for hugetlb.

Link: https://lkml.kernel.org/r/20230628215310.73782-4-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A . Shutemov <kirill@shutemov.name>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff c33c7948 Mon Jun 12 09:15:45 MDT 2023 Ryan Roberts <ryan.roberts@arm.com> mm: ptep_get() conversion

Convert all instances of direct pte_t* dereferencing to instead use
ptep_get() helper. This means that by default, the accesses change from a
C dereference to a READ_ONCE(). This is technically the correct thing to
do since where pgtables are modified by HW (for access/dirty) they are
volatile and therefore we should always ensure READ_ONCE() semantics.

But more importantly, by always using the helper, it can be overridden by
the architecture to fully encapsulate the contents of the pte. Arch code
is deliberately not converted, as the arch code knows best. It is
intended that arch code (arm64) will override the default with its own
implementation that can (e.g.) hide certain bits from the core code, or
determine young/dirty status by mixing in state from another source.

Conversion was done using Coccinelle:

----

// $ make coccicheck \
// COCCI=ptepget.cocci \
// SPFLAGS="--include-headers" \
// MODE=patch

virtual patch

@ depends on patch @
pte_t *v;
@@

- *v
+ ptep_get(v)

----

Then reviewed and hand-edited to avoid multiple unnecessary calls to
ptep_get(), instead opting to store the result of a single call in a
variable, where it is correct to do so. This aims to negate any cost of
READ_ONCE() and will benefit arch-overrides that may be more complex.

Included is a fix for an issue in an earlier version of this patch that
was pointed out by kernel test robot. The issue arose because config
MMU=n elides definition of the ptep helper functions, including
ptep_get(). HUGETLB_PAGE=n configs still define a simple
huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
So when both configs are disabled, this caused a build error because
ptep_get() is not defined. Fix by continuing to do a direct dereference
when MMU=n. This is safe because for this config the arch code cannot be
trying to virtualize the ptes because none of the ptep helpers are
defined.

Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff d9712937 Tue Mar 14 16:12:49 MDT 2023 Axel Rasmussen <axelrasmussen@google.com> mm: userfaultfd: combine 'mode' and 'wp_copy' arguments

Many userfaultfd ioctl functions take both a 'mode' and a 'wp_copy'
argument. In future commits we plan to plumb the flags through to more
places, so we'd be proliferating the very long argument list even further.

Let's take the time to simplify the argument list. Combine the two
arguments into one - and generalize, so when we add more flags in the
future, it doesn't imply more function arguments.

Since the modes (copy, zeropage, continue) are mutually exclusive, store
them as an integer value (0, 1, 2) in the low bits. Place combine-able
flag bits in the high bits.

This is quite similar to an earlier patch proposed by Nadav Amit
("userfaultfd: introduce uffd_flags" [1]). The main difference is that
patch only handled flags, whereas this patch *also* combines the "mode"
argument into the same type to shorten the argument list.

[1]: https://lore.kernel.org/all/20220619233449.181323-2-namit@vmware.com/

Link: https://lkml.kernel.org/r/20230314221250.682452-4-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: James Houghton <jthoughton@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff a734991c Tue Mar 14 16:12:47 MDT 2023 Axel Rasmussen <axelrasmussen@google.com> mm: userfaultfd: rename functions for clarity + consistency

Patch series "mm: userfaultfd: refactor and add UFFDIO_CONTINUE_MODE_WP",
v5.

- Commits 1-3 refactor userfaultfd ioctl code without behavior changes, with the
main goal of improving consistency and reducing the number of function args.

- Commit 4 adds UFFDIO_CONTINUE_MODE_WP.


This patch (of 4):

The basic problem is, over time we've added new userfaultfd ioctls, and
we've refactored the code so functions which used to handle only one case
are now re-used to deal with several cases. While this happened, we
didn't bother to rename the functions.

Similarly, as we added new functions, we cargo-culted pieces of the
now-inconsistent naming scheme, so those functions too ended up with names
that don't make a lot of sense.

A key point here is, "copy" in most userfaultfd code refers specifically
to UFFDIO_COPY, where we allocate a new page and copy its contents from
userspace. There are many functions with "copy" in the name that don't
actually do this (at least in some cases).

So, rename things into a consistent scheme. The high level idea is that
the call stack for userfaultfd ioctls becomes:

userfaultfd_ioctl
-> userfaultfd_(particular ioctl)
-> mfill_atomic_(particular kind of fill operation)
-> mfill_atomic /* loops over pages in range */
-> mfill_atomic_pte /* deals with single pages */
-> mfill_atomic_pte_(particular kind of fill operation)
-> mfill_atomic_install_pte

There are of course some special cases (shmem, hugetlb), but this is the
general structure which all function names now adhere to.

Link: https://lkml.kernel.org/r/20230314221250.682452-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20230314221250.682452-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nadav Amit <namit@vmware.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff a734991c Tue Mar 14 16:12:47 MDT 2023 Axel Rasmussen <axelrasmussen@google.com> mm: userfaultfd: rename functions for clarity + consistency

Patch series "mm: userfaultfd: refactor and add UFFDIO_CONTINUE_MODE_WP",
v5.

- Commits 1-3 refactor userfaultfd ioctl code without behavior changes, with the
main goal of improving consistency and reducing the number of function args.

- Commit 4 adds UFFDIO_CONTINUE_MODE_WP.


This patch (of 4):

The basic problem is, over time we've added new userfaultfd ioctls, and
we've refactored the code so functions which used to handle only one case
are now re-used to deal with several cases. While this happened, we
didn't bother to rename the functions.

Similarly, as we added new functions, we cargo-culted pieces of the
now-inconsistent naming scheme, so those functions too ended up with names
that don't make a lot of sense.

A key point here is, "copy" in most userfaultfd code refers specifically
to UFFDIO_COPY, where we allocate a new page and copy its contents from
userspace. There are many functions with "copy" in the name that don't
actually do this (at least in some cases).

So, rename things into a consistent scheme. The high level idea is that
the call stack for userfaultfd ioctls becomes:

userfaultfd_ioctl
-> userfaultfd_(particular ioctl)
-> mfill_atomic_(particular kind of fill operation)
-> mfill_atomic /* loops over pages in range */
-> mfill_atomic_pte /* deals with single pages */
-> mfill_atomic_pte_(particular kind of fill operation)
-> mfill_atomic_install_pte

There are of course some special cases (shmem, hugetlb), but this is the
general structure which all function names now adhere to.

Link: https://lkml.kernel.org/r/20230314221250.682452-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20230314221250.682452-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nadav Amit <namit@vmware.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff ea8e72f4 Wed Jan 25 10:05:32 MST 2023 Sidhartha Kumar <sidhartha.kumar@oracle.com> mm/hugetlb: convert putback_active_hugepage to take in a folio

Convert putback_active_hugepage() to folio_putback_active_hugetlb(), this
removes one user of the Huge Page macros which take in a page. The
callers in migrate.c are also cleaned up by being able to directly use the
src and dst folio variables.

Link: https://lkml.kernel.org/r/20230125170537.96973-4-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
H A Dfs.hdiff d3b1a9a7 Fri Feb 02 01:33:04 MST 2024 JonasZhou <JonasZhou@zhaoxin.com> fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

In the struct address_space, there is a 32-byte gap between i_mmap
and i_mmap_rwsem. Due to the alignment of struct address_space
variables to 8 bytes, in certain situations, i_mmap and i_mmap_rwsem
may end up in the same CACHE line.

While running Unixbench/execl, we observe high false sharing issues
when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem
after i_private_list, ensuring a 64-byte gap between i_mmap and
i_mmap_rwsem.

For Intel Silver machines (2 sockets) using kernel v6.8 rc-2, the score
of Unixbench/execl improves by ~3.94%, and the score of Unixbench/shell
improves by ~3.26%.

Baseline:
-------------------------------------------------------------
162 546 748 11374 21 0xffff92e266af90c0
-------------------------------------------------------------
46.89% 44.65% 0.00% 0.00% 0x0 1 1 0xffffffff86d5fb96 460 258 271 1069 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1
4.21% 4.41% 0.00% 0.00% 0x4 1 1 0xffffffff86d0ed54 473 311 288 95 28 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1
0.00% 0.00% 0.04% 4.76% 0x8 1 1 0xffffffff86d4bcf1 0 0 0 5 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1
6.41% 6.02% 0.00% 0.00% 0x8 1 1 0xffffffff86d4ba85 411 271 339 210 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1
0.00% 0.00% 0.47% 95.24% 0x10 1 1 0xffffffff86d4bd34 0 0 0 74 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1
0.37% 0.13% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4f 328 212 380 7 5 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
5.13% 5.08% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4b 416 255 357 197 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
1.10% 0.53% 0.00% 0.00% 0x28 1 1 0xffffffff86e06eb8 395 228 351 24 14 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1
1.10% 2.14% 57.07% 0.00% 0x38 1 1 0xffffffff878c9225 1364 792 462 7003 32 [k] down_write [kernel.vmlinux] atomic64_64.h:109 0 1
0.00% 0.00% 0.01% 0.00% 0x38 1 1 0xffffffff878c8e75 0 0 252 3 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1
0.00% 0.13% 0.00% 0.00% 0x38 1 1 0xffffffff878c8e23 0 596 63 2 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1
2.38% 2.94% 6.53% 0.00% 0x38 1 1 0xffffffff878c8ccb 1150 818 570 1197 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1
30.59% 32.22% 0.00% 0.00% 0x38 1 1 0xffffffff878c8cb4 423 251 380 648 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1
1.83% 1.74% 35.88% 0.00% 0x38 1 1 0xffffffff86b4f833 1217 1112 565 4586 32 [k] up_write [kernel.vmlinux] atomic64_64.h:91 0 1

with this change:
-------------------------------------------------------------
360 12 300 57 35 0xffff982cdae76400
-------------------------------------------------------------
50.00% 59.67% 0.00% 0.00% 0x0 1 1 0xffffffff8215fb86 352 200 191 558 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1
8.33% 5.00% 0.00% 0.00% 0x4 1 1 0xffffffff8210ed44 370 284 263 42 24 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1
0.00% 0.00% 5.26% 2.86% 0x8 1 1 0xffffffff8214bce1 0 0 0 4 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1
33.33% 14.33% 0.00% 0.00% 0x8 1 1 0xffffffff8214ba75 344 186 219 140 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1
0.00% 0.00% 94.74% 97.14% 0x10 1 1 0xffffffff8214bd24 0 0 0 88 29 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1
8.33% 20.00% 0.00% 0.00% 0x10 1 1 0xffffffff8214bb3b 296 209 226 167 31 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
0.00% 0.67% 0.00% 0.00% 0x28 1 1 0xffffffff82206f45 0 140 334 4 3 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1
0.00% 0.33% 0.00% 0.00% 0x38 1 1 0xffffffff8250a6c4 0 286 126 5 5 [k] errseq_sample [kernel.vmlinux] errseq.c:125 0

Signed-off-by: JonasZhou <JonasZhou@zhaoxin.com>
Link: https://lore.kernel.org/r/20240202083304.10995-1-JonasZhou-oc@zhaoxin.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff d3b1a9a7 Fri Feb 02 01:33:04 MST 2024 JonasZhou <JonasZhou@zhaoxin.com> fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

In the struct address_space, there is a 32-byte gap between i_mmap
and i_mmap_rwsem. Due to the alignment of struct address_space
variables to 8 bytes, in certain situations, i_mmap and i_mmap_rwsem
may end up in the same CACHE line.

While running Unixbench/execl, we observe high false sharing issues
when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem
after i_private_list, ensuring a 64-byte gap between i_mmap and
i_mmap_rwsem.

For Intel Silver machines (2 sockets) using kernel v6.8 rc-2, the score
of Unixbench/execl improves by ~3.94%, and the score of Unixbench/shell
improves by ~3.26%.

Baseline:
-------------------------------------------------------------
162 546 748 11374 21 0xffff92e266af90c0
-------------------------------------------------------------
46.89% 44.65% 0.00% 0.00% 0x0 1 1 0xffffffff86d5fb96 460 258 271 1069 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1
4.21% 4.41% 0.00% 0.00% 0x4 1 1 0xffffffff86d0ed54 473 311 288 95 28 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1
0.00% 0.00% 0.04% 4.76% 0x8 1 1 0xffffffff86d4bcf1 0 0 0 5 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1
6.41% 6.02% 0.00% 0.00% 0x8 1 1 0xffffffff86d4ba85 411 271 339 210 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1
0.00% 0.00% 0.47% 95.24% 0x10 1 1 0xffffffff86d4bd34 0 0 0 74 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1
0.37% 0.13% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4f 328 212 380 7 5 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
5.13% 5.08% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4b 416 255 357 197 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
1.10% 0.53% 0.00% 0.00% 0x28 1 1 0xffffffff86e06eb8 395 228 351 24 14 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1
1.10% 2.14% 57.07% 0.00% 0x38 1 1 0xffffffff878c9225 1364 792 462 7003 32 [k] down_write [kernel.vmlinux] atomic64_64.h:109 0 1
0.00% 0.00% 0.01% 0.00% 0x38 1 1 0xffffffff878c8e75 0 0 252 3 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1
0.00% 0.13% 0.00% 0.00% 0x38 1 1 0xffffffff878c8e23 0 596 63 2 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1
2.38% 2.94% 6.53% 0.00% 0x38 1 1 0xffffffff878c8ccb 1150 818 570 1197 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1
30.59% 32.22% 0.00% 0.00% 0x38 1 1 0xffffffff878c8cb4 423 251 380 648 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1
1.83% 1.74% 35.88% 0.00% 0x38 1 1 0xffffffff86b4f833 1217 1112 565 4586 32 [k] up_write [kernel.vmlinux] atomic64_64.h:91 0 1

with this change:
-------------------------------------------------------------
360 12 300 57 35 0xffff982cdae76400
-------------------------------------------------------------
50.00% 59.67% 0.00% 0.00% 0x0 1 1 0xffffffff8215fb86 352 200 191 558 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1
8.33% 5.00% 0.00% 0.00% 0x4 1 1 0xffffffff8210ed44 370 284 263 42 24 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1
0.00% 0.00% 5.26% 2.86% 0x8 1 1 0xffffffff8214bce1 0 0 0 4 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1
33.33% 14.33% 0.00% 0.00% 0x8 1 1 0xffffffff8214ba75 344 186 219 140 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1
0.00% 0.00% 94.74% 97.14% 0x10 1 1 0xffffffff8214bd24 0 0 0 88 29 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1
8.33% 20.00% 0.00% 0.00% 0x10 1 1 0xffffffff8214bb3b 296 209 226 167 31 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
0.00% 0.67% 0.00% 0.00% 0x28 1 1 0xffffffff82206f45 0 140 334 4 3 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1
0.00% 0.33% 0.00% 0.00% 0x38 1 1 0xffffffff8250a6c4 0 286 126 5 5 [k] errseq_sample [kernel.vmlinux] errseq.c:125 0

Signed-off-by: JonasZhou <JonasZhou@zhaoxin.com>
Link: https://lore.kernel.org/r/20240202083304.10995-1-JonasZhou-oc@zhaoxin.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff d3b1a9a7 Fri Feb 02 01:33:04 MST 2024 JonasZhou <JonasZhou@zhaoxin.com> fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

In the struct address_space, there is a 32-byte gap between i_mmap
and i_mmap_rwsem. Due to the alignment of struct address_space
variables to 8 bytes, in certain situations, i_mmap and i_mmap_rwsem
may end up in the same CACHE line.

While running Unixbench/execl, we observe high false sharing issues
when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem
after i_private_list, ensuring a 64-byte gap between i_mmap and
i_mmap_rwsem.

For Intel Silver machines (2 sockets) using kernel v6.8 rc-2, the score
of Unixbench/execl improves by ~3.94%, and the score of Unixbench/shell
improves by ~3.26%.

Baseline:
-------------------------------------------------------------
162 546 748 11374 21 0xffff92e266af90c0
-------------------------------------------------------------
46.89% 44.65% 0.00% 0.00% 0x0 1 1 0xffffffff86d5fb96 460 258 271 1069 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1
4.21% 4.41% 0.00% 0.00% 0x4 1 1 0xffffffff86d0ed54 473 311 288 95 28 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1
0.00% 0.00% 0.04% 4.76% 0x8 1 1 0xffffffff86d4bcf1 0 0 0 5 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1
6.41% 6.02% 0.00% 0.00% 0x8 1 1 0xffffffff86d4ba85 411 271 339 210 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1
0.00% 0.00% 0.47% 95.24% 0x10 1 1 0xffffffff86d4bd34 0 0 0 74 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1
0.37% 0.13% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4f 328 212 380 7 5 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
5.13% 5.08% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4b 416 255 357 197 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
1.10% 0.53% 0.00% 0.00% 0x28 1 1 0xffffffff86e06eb8 395 228 351 24 14 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1
1.10% 2.14% 57.07% 0.00% 0x38 1 1 0xffffffff878c9225 1364 792 462 7003 32 [k] down_write [kernel.vmlinux] atomic64_64.h:109 0 1
0.00% 0.00% 0.01% 0.00% 0x38 1 1 0xffffffff878c8e75 0 0 252 3 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1
0.00% 0.13% 0.00% 0.00% 0x38 1 1 0xffffffff878c8e23 0 596 63 2 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1
2.38% 2.94% 6.53% 0.00% 0x38 1 1 0xffffffff878c8ccb 1150 818 570 1197 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1
30.59% 32.22% 0.00% 0.00% 0x38 1 1 0xffffffff878c8cb4 423 251 380 648 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1
1.83% 1.74% 35.88% 0.00% 0x38 1 1 0xffffffff86b4f833 1217 1112 565 4586 32 [k] up_write [kernel.vmlinux] atomic64_64.h:91 0 1

with this change:
-------------------------------------------------------------
360 12 300 57 35 0xffff982cdae76400
-------------------------------------------------------------
50.00% 59.67% 0.00% 0.00% 0x0 1 1 0xffffffff8215fb86 352 200 191 558 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1
8.33% 5.00% 0.00% 0.00% 0x4 1 1 0xffffffff8210ed44 370 284 263 42 24 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1
0.00% 0.00% 5.26% 2.86% 0x8 1 1 0xffffffff8214bce1 0 0 0 4 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1
33.33% 14.33% 0.00% 0.00% 0x8 1 1 0xffffffff8214ba75 344 186 219 140 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1
0.00% 0.00% 94.74% 97.14% 0x10 1 1 0xffffffff8214bd24 0 0 0 88 29 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1
8.33% 20.00% 0.00% 0.00% 0x10 1 1 0xffffffff8214bb3b 296 209 226 167 31 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
0.00% 0.67% 0.00% 0.00% 0x28 1 1 0xffffffff82206f45 0 140 334 4 3 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1
0.00% 0.33% 0.00% 0.00% 0x38 1 1 0xffffffff8250a6c4 0 286 126 5 5 [k] errseq_sample [kernel.vmlinux] errseq.c:125 0

Signed-off-by: JonasZhou <JonasZhou@zhaoxin.com>
Link: https://lore.kernel.org/r/20240202083304.10995-1-JonasZhou-oc@zhaoxin.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff d3b1a9a7 Fri Feb 02 01:33:04 MST 2024 JonasZhou <JonasZhou@zhaoxin.com> fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

In the struct address_space, there is a 32-byte gap between i_mmap
and i_mmap_rwsem. Due to the alignment of struct address_space
variables to 8 bytes, in certain situations, i_mmap and i_mmap_rwsem
may end up in the same CACHE line.

While running Unixbench/execl, we observe high false sharing issues
when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem
after i_private_list, ensuring a 64-byte gap between i_mmap and
i_mmap_rwsem.

For Intel Silver machines (2 sockets) using kernel v6.8 rc-2, the score
of Unixbench/execl improves by ~3.94%, and the score of Unixbench/shell
improves by ~3.26%.

Baseline:
-------------------------------------------------------------
162 546 748 11374 21 0xffff92e266af90c0
-------------------------------------------------------------
46.89% 44.65% 0.00% 0.00% 0x0 1 1 0xffffffff86d5fb96 460 258 271 1069 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1
4.21% 4.41% 0.00% 0.00% 0x4 1 1 0xffffffff86d0ed54 473 311 288 95 28 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1
0.00% 0.00% 0.04% 4.76% 0x8 1 1 0xffffffff86d4bcf1 0 0 0 5 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1
6.41% 6.02% 0.00% 0.00% 0x8 1 1 0xffffffff86d4ba85 411 271 339 210 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1
0.00% 0.00% 0.47% 95.24% 0x10 1 1 0xffffffff86d4bd34 0 0 0 74 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1
0.37% 0.13% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4f 328 212 380 7 5 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
5.13% 5.08% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4b 416 255 357 197 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
1.10% 0.53% 0.00% 0.00% 0x28 1 1 0xffffffff86e06eb8 395 228 351 24 14 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1
1.10% 2.14% 57.07% 0.00% 0x38 1 1 0xffffffff878c9225 1364 792 462 7003 32 [k] down_write [kernel.vmlinux] atomic64_64.h:109 0 1
0.00% 0.00% 0.01% 0.00% 0x38 1 1 0xffffffff878c8e75 0 0 252 3 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1
0.00% 0.13% 0.00% 0.00% 0x38 1 1 0xffffffff878c8e23 0 596 63 2 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1
2.38% 2.94% 6.53% 0.00% 0x38 1 1 0xffffffff878c8ccb 1150 818 570 1197 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1
30.59% 32.22% 0.00% 0.00% 0x38 1 1 0xffffffff878c8cb4 423 251 380 648 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1
1.83% 1.74% 35.88% 0.00% 0x38 1 1 0xffffffff86b4f833 1217 1112 565 4586 32 [k] up_write [kernel.vmlinux] atomic64_64.h:91 0 1

with this change:
-------------------------------------------------------------
360 12 300 57 35 0xffff982cdae76400
-------------------------------------------------------------
50.00% 59.67% 0.00% 0.00% 0x0 1 1 0xffffffff8215fb86 352 200 191 558 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1
8.33% 5.00% 0.00% 0.00% 0x4 1 1 0xffffffff8210ed44 370 284 263 42 24 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1
0.00% 0.00% 5.26% 2.86% 0x8 1 1 0xffffffff8214bce1 0 0 0 4 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1
33.33% 14.33% 0.00% 0.00% 0x8 1 1 0xffffffff8214ba75 344 186 219 140 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1
0.00% 0.00% 94.74% 97.14% 0x10 1 1 0xffffffff8214bd24 0 0 0 88 29 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1
8.33% 20.00% 0.00% 0.00% 0x10 1 1 0xffffffff8214bb3b 296 209 226 167 31 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
0.00% 0.67% 0.00% 0.00% 0x28 1 1 0xffffffff82206f45 0 140 334 4 3 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1
0.00% 0.33% 0.00% 0.00% 0x38 1 1 0xffffffff8250a6c4 0 286 126 5 5 [k] errseq_sample [kernel.vmlinux] errseq.c:125 0

Signed-off-by: JonasZhou <JonasZhou@zhaoxin.com>
Link: https://lore.kernel.org/r/20240202083304.10995-1-JonasZhou-oc@zhaoxin.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff e4e8b47a Mon Jan 15 10:30:46 MST 2018 Max Kellermann <mk@cm4all.com> fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n

Make IS_POSIXACL() return false if POSIX ACL support is disabled.

Never skip applying the umask in namei.c and never bother to do any
ACL specific checks if the filesystem falsely indicates it has ACLs
enabled when the feature is completely disabled in the kernel.

This fixes a problem where the umask is always ignored in the NFS
client when compiled without CONFIG_FS_POSIX_ACL. This is a 4 year
old regression caused by commit 013cdf1088d723 which itself was not
completely wrong, but failed to consider all the side effects by
misdesigned VFS code.

Prior to that commit, there were two places where the umask could be
applied, for example when creating a directory:

1. in the VFS layer in SYSCALL_DEFINE3(mkdirat), but only if
!IS_POSIXACL()

2. again (unconditionally) in nfs3_proc_mkdir()

The first one does not apply, because even without
CONFIG_FS_POSIX_ACL, the NFS client sets SB_POSIXACL in
nfs_fill_super().

After that commit, (2.) was replaced by:

2b. in posix_acl_create(), called by nfs3_proc_mkdir()

There's one branch in posix_acl_create() which applies the umask;
however, without CONFIG_FS_POSIX_ACL, posix_acl_create() is an empty
dummy function which does not apply the umask.

The approach chosen by this patch is to make IS_POSIXACL() always
return false when POSIX ACL support is disabled, so the umask always
gets applied by the VFS layer. This is consistent with the (regular)
behavior of posix_acl_create(): that function returns early if
IS_POSIXACL() is false, before applying the umask.

Therefore, posix_acl_create() is responsible for applying the umask if
there is ACL support enabled in the file system (SB_POSIXACL), and the
VFS layer is responsible for all other cases (no SB_POSIXACL or no
CONFIG_FS_POSIX_ACL).

Signed-off-by: Max Kellermann <mk@cm4all.com>
Link: https://lore.kernel.org/r/151603744662.29035.4910161264124875658.stgit@rabbit.intern.cm-ag
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff def3ae83 Mon Oct 09 09:37:12 MDT 2023 Amir Goldstein <amir73il@gmail.com> fs: store real path instead of fake path in backing file f_path

A backing file struct stores two path's, one "real" path that is referring
to f_inode and one "fake" path, which should be displayed to users in
/proc/<pid>/maps.

There is a lot more potential code that needs to know the "real" path, then
code that needs to know the "fake" path.

Instead of code having to request the "real" path with file_real_path(),
store the "real" path in f_path and require code that needs to know the
"fake" path request it with file_user_path().
Replace the file_real_path() helper with a simple const accessor f_path().

After this change, file_dentry() is not expected to observe any files
with overlayfs f_path and real f_inode, so the call to ->d_real() should
not be needed. Leave the ->d_real() call for now and add an assertion
in ovl_d_real() to catch if we made wrong assumptions.

Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Link: https://lore.kernel.org/r/CAJfpegtt48eXhhjDFA1ojcHPNKj3Go6joryCPtEFAKpocyBsnw@mail.gmail.com/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231009153712.1566422-4-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 4a00c673 Tue Aug 22 13:17:29 MDT 2023 Khadija Kamran <kamrankhadijadj@gmail.com> lsm: constify 'file' parameter in security_bprm_creds_from_file()

The 'bprm_creds_from_file' hook has implementation registered in
commoncap. Looking at the function implementation we observe that the
'file' parameter is not changing.

Mark the 'file' parameter of LSM hook security_bprm_creds_from_file() as
'const' since it will not be changing in the LSM hook.

Signed-off-by: Khadija Kamran <kamrankhadijadj@gmail.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
diff 03adc61e Thu Oct 12 15:55:18 MDT 2023 Dan Clash <daclash@linux.microsoft.com> audit,io_uring: io_uring openat triggers audit reference count underflow

An io_uring openat operation can update an audit reference count
from multiple threads resulting in the call trace below.

A call to io_uring_submit() with a single openat op with a flag of
IOSQE_ASYNC results in the following reference count updates.

These first part of the system call performs two increments that do not race.

do_syscall_64()
__do_sys_io_uring_enter()
io_submit_sqes()
io_openat_prep()
__io_openat_prep()
getname()
getname_flags() /* update 1 (increment) */
__audit_getname() /* update 2 (increment) */

The openat op is queued to an io_uring worker thread which starts the
opportunity for a race. The system call exit performs one decrement.

do_syscall_64()
syscall_exit_to_user_mode()
syscall_exit_to_user_mode_prepare()
__audit_syscall_exit()
audit_reset_context()
putname() /* update 3 (decrement) */

The io_uring worker thread performs one increment and two decrements.
These updates can race with the system call decrement.

io_wqe_worker()
io_worker_handle_work()
io_wq_submit_work()
io_issue_sqe()
io_openat()
io_openat2()
do_filp_open()
path_openat()
__audit_inode() /* update 4 (increment) */
putname() /* update 5 (decrement) */
__audit_uring_exit()
audit_reset_context()
putname() /* update 6 (decrement) */

The fix is to change the refcnt member of struct audit_names
from int to atomic_t.

kernel BUG at fs/namei.c:262!
Call Trace:
...
? putname+0x68/0x70
audit_reset_context.part.0.constprop.0+0xe1/0x300
__audit_uring_exit+0xda/0x1c0
io_issue_sqe+0x1f3/0x450
? lock_timer_base+0x3b/0xd0
io_wq_submit_work+0x8d/0x2b0
? __try_to_del_timer_sync+0x67/0xa0
io_worker_handle_work+0x17c/0x2b0
io_wqe_worker+0x10a/0x350

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/MW2PR2101MB1033FFF044A258F84AEAA584F1C9A@MW2PR2101MB1033.namprd21.prod.outlook.com/
Fixes: 5bd2182d58e9 ("audit,io_uring,io-wq: add some basic audit support to io_uring")
Signed-off-by: Dan Clash <daclash@linux.microsoft.com>
Link: https://lore.kernel.org/r/20231012215518.GA4048@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 2c18a63b Fri Aug 18 08:00:51 MDT 2023 Christian Brauner <brauner@kernel.org> super: wait until we passed kill super

Recent rework moved block device closing out of sb->put_super() and into
sb->kill_sb() to avoid deadlocks as s_umount is held in put_super() and
blkdev_put() can end up taking s_umount again.

That means we need to move the removal of the superblock from @fs_supers
out of generic_shutdown_super() and into deactivate_locked_super() to
ensure that concurrent mounters don't fail to open block devices that
are still in use because blkdev_put() in sb->kill_sb() hasn't been
called yet.

We can now do this as we can make iterators through @fs_super and
@super_blocks wait without holding s_umount. Concurrent mounts will wait
until a dying superblock is fully dead so until sb->kill_sb() has been
called and SB_DEAD been set. Concurrent iterators can already discard
any SB_DYING superblock.

Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230818-vfs-super-fixes-v3-v3-4-9f0b1876e46b@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff ed0360bb Thu Aug 17 08:13:33 MDT 2023 Amir Goldstein <amir73il@gmail.com> fs: create kiocb_{start,end}_write() helpers

aio, io_uring, cachefiles and overlayfs, all open code an ugly variant
of file_{start,end}_write() to silence lockdep warnings.

Create helpers for this lockdep dance so we can use the helpers in all
the callers.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Message-Id: <20230817141337.1025891-4-amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>

Completed in 994 milliseconds

12345