Searched +hist:0 +hist:b4d4147 (Results 1 - 25 of 38) sorted by last modified time

12

/linux-master/net/ipv6/
H A Daddrconf.cdiff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 7633c4da Mon Apr 08 08:18:21 MDT 2024 Jiri Benc <jbenc@redhat.com> ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
(in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[ 41.506330] refcount_t: addition on 0; use-after-free.
[ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[ 41.507413] Modules linked in: veth bridge stp llc
[ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.516799] Call Trace:
[ 41.517037] <TASK>
[ 41.517249] ? __warn+0x7b/0x120
[ 41.517535] ? refcount_warn_saturate+0xa5/0x130
[ 41.517923] ? report_bug+0x164/0x190
[ 41.518240] ? handle_bug+0x3d/0x70
[ 41.518541] ? exc_invalid_op+0x17/0x70
[ 41.520972] ? asm_exc_invalid_op+0x1a/0x20
[ 41.521325] ? refcount_warn_saturate+0xa5/0x130
[ 41.521708] ipv6_get_ifaddr+0xda/0xe0
[ 41.522035] inet6_rtm_getaddr+0x342/0x3f0
[ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10
[ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0
[ 41.523102] ? netlink_unicast+0x30f/0x390
[ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 41.523832] netlink_rcv_skb+0x53/0x100
[ 41.524157] netlink_unicast+0x23b/0x390
[ 41.524484] netlink_sendmsg+0x1f2/0x440
[ 41.524826] __sys_sendto+0x1d8/0x1f0
[ 41.525145] __x64_sys_sendto+0x1f/0x30
[ 41.525467] do_syscall_64+0xa5/0x1b0
[ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a
[ 41.526213] RIP: 0033:0x7fbc4cfcea9a
[ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[ 41.531573] </TASK>

Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
/linux-master/drivers/char/
H A Dhpet.cdiff cf8e8658 Thu Oct 20 07:54:33 MDT 2022 Ard Biesheuvel <ardb@kernel.org> arch: Remove Itanium (IA-64) architecture

The Itanium architecture is obsolete, and an informal survey [0] reveals
that any residual use of Itanium hardware in production is mostly HP-UX
or OpenVMS based. The use of Linux on Itanium appears to be limited to
enthusiasts that occasionally boot a fresh Linux kernel to see whether
things are still working as intended, and perhaps to churn out some
distro packages that are rarely used in practice.

None of the original companies behind Itanium still produce or support
any hardware or software for the architecture, and it is listed as
'Orphaned' in the MAINTAINERS file, as apparently, none of the engineers
that contributed on behalf of those companies (nor anyone else, for that
matter) have been willing to support or maintain the architecture
upstream or even be responsible for applying the odd fix. The Intel
firmware team removed all IA-64 support from the Tianocore/EDK2
reference implementation of EFI in 2018. (Itanium is the original
architecture for which EFI was developed, and the way Linux supports it
deviates significantly from other architectures.) Some distros, such as
Debian and Gentoo, still maintain [unofficial] ia64 ports, but many have
dropped support years ago.

While the argument is being made [1] that there is a 'for the common
good' angle to being able to build and run existing projects such as the
Grid Community Toolkit [2] on Itanium for interoperability testing, the
fact remains that none of those projects are known to be deployed on
Linux/ia64, and very few people actually have access to such a system in
the first place. Even if there were ways imaginable in which Linux/ia64
could be put to good use today, what matters is whether anyone is
actually doing that, and this does not appear to be the case.

There are no emulators widely available, and so boot testing Itanium is
generally infeasible for ordinary contributors. GCC still supports IA-64
but its compile farm [3] no longer has any IA-64 machines. GLIBC would
like to get rid of IA-64 [4] too because it would permit some overdue
code cleanups. In summary, the benefits to the ecosystem of having IA-64
be part of it are mostly theoretical, whereas the maintenance overhead
of keeping it supported is real.

So let's rip off the band aid, and remove the IA-64 arch code entirely.
This follows the timeline proposed by the Debian/ia64 maintainer [5],
which removes support in a controlled manner, leaving IA-64 in a known
good state in the most recent LTS release. Other projects will follow
once the kernel support is removed.

[0] https://lore.kernel.org/all/CAMj1kXFCMh_578jniKpUtx_j8ByHnt=s7S+yQ+vGbKt9ud7+kQ@mail.gmail.com/
[1] https://lore.kernel.org/all/0075883c-7c51-00f5-2c2d-5119c1820410@web.de/
[2] https://gridcf.org/gct-docs/latest/index.html
[3] https://cfarm.tetaneutral.net/machines/list/
[4] https://lore.kernel.org/all/87bkiilpc4.fsf@mid.deneb.enyo.de/
[5] https://lore.kernel.org/all/ff58a3e76e5102c94bb5946d99187b358def688a.camel@physik.fu-berlin.de/

Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
diff cf8e8658 Thu Oct 20 07:54:33 MDT 2022 Ard Biesheuvel <ardb@kernel.org> arch: Remove Itanium (IA-64) architecture

The Itanium architecture is obsolete, and an informal survey [0] reveals
that any residual use of Itanium hardware in production is mostly HP-UX
or OpenVMS based. The use of Linux on Itanium appears to be limited to
enthusiasts that occasionally boot a fresh Linux kernel to see whether
things are still working as intended, and perhaps to churn out some
distro packages that are rarely used in practice.

None of the original companies behind Itanium still produce or support
any hardware or software for the architecture, and it is listed as
'Orphaned' in the MAINTAINERS file, as apparently, none of the engineers
that contributed on behalf of those companies (nor anyone else, for that
matter) have been willing to support or maintain the architecture
upstream or even be responsible for applying the odd fix. The Intel
firmware team removed all IA-64 support from the Tianocore/EDK2
reference implementation of EFI in 2018. (Itanium is the original
architecture for which EFI was developed, and the way Linux supports it
deviates significantly from other architectures.) Some distros, such as
Debian and Gentoo, still maintain [unofficial] ia64 ports, but many have
dropped support years ago.

While the argument is being made [1] that there is a 'for the common
good' angle to being able to build and run existing projects such as the
Grid Community Toolkit [2] on Itanium for interoperability testing, the
fact remains that none of those projects are known to be deployed on
Linux/ia64, and very few people actually have access to such a system in
the first place. Even if there were ways imaginable in which Linux/ia64
could be put to good use today, what matters is whether anyone is
actually doing that, and this does not appear to be the case.

There are no emulators widely available, and so boot testing Itanium is
generally infeasible for ordinary contributors. GCC still supports IA-64
but its compile farm [3] no longer has any IA-64 machines. GLIBC would
like to get rid of IA-64 [4] too because it would permit some overdue
code cleanups. In summary, the benefits to the ecosystem of having IA-64
be part of it are mostly theoretical, whereas the maintenance overhead
of keeping it supported is real.

So let's rip off the band aid, and remove the IA-64 arch code entirely.
This follows the timeline proposed by the Debian/ia64 maintainer [5],
which removes support in a controlled manner, leaving IA-64 in a known
good state in the most recent LTS release. Other projects will follow
once the kernel support is removed.

[0] https://lore.kernel.org/all/CAMj1kXFCMh_578jniKpUtx_j8ByHnt=s7S+yQ+vGbKt9ud7+kQ@mail.gmail.com/
[1] https://lore.kernel.org/all/0075883c-7c51-00f5-2c2d-5119c1820410@web.de/
[2] https://gridcf.org/gct-docs/latest/index.html
[3] https://cfarm.tetaneutral.net/machines/list/
[4] https://lore.kernel.org/all/87bkiilpc4.fsf@mid.deneb.enyo.de/
[5] https://lore.kernel.org/all/ff58a3e76e5102c94bb5946d99187b358def688a.camel@physik.fu-berlin.de/

Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
diff c8dd5541 Fri Jan 21 23:11:34 MST 2022 Luis Chamberlain <mcgrof@kernel.org> hpet: simplify subdirectory registration with register_sysctl()

Patch series "sysctl: second set of kernel/sysctl cleanups", v2.

This is the 2nd set of kernel/sysctl.c cleanups. The diff stat should
reflect how this is a much better way to deal with theses. Fortunately
coccinelle can be used to ensure correctness for most of these and/or
future merge conflicts.

Note that since this is part of a larger effort to cleanup
kernel/sysctl.c I think we have no other option but to go with merging
these patches in either Andrew's tree or keep them staged in a separate
tree and send a merge request later. Otherwise kernel/sysctl.c will end
up becoming a sore spot for the next merge window.

This patch (of 8):

There is no need to user boiler plate code to specify a set of base
directories we're going to stuff sysctls under. Simplify this by using
register_sysctl() and specifying the directory path directly.

// pycocci sysctl-subdir-register-sysctl-simplify.cocci drivers/char/hpet.c

@c1@
expression E1;
identifier subdir, sysctls;
@@

static struct ctl_table subdir[] = {
{
.procname = E1,
.maxlen = 0,
.mode = 0555,
.child = sysctls,
},
{ }
};

@c2@
identifier c1.subdir;

expression E2;
identifier base;
@@

static struct ctl_table base[] = {
{
.procname = E2,
.maxlen = 0,
.mode = 0555,
.child = subdir,
},
{ }
};

@c3@
identifier c2.base;
identifier header;
@@

header = register_sysctl_table(base);

@r1 depends on c1 && c2 && c3@
expression c1.E1;
identifier c1.subdir, c1.sysctls;
@@

-static struct ctl_table subdir[] = {
- {
- .procname = E1,
- .maxlen = 0,
- .mode = 0555,
- .child = sysctls,
- },
- { }
-};

@r2 depends on c1 && c2 && c3@
identifier c1.subdir;

expression c2.E2;
identifier c2.base;
@@
-static struct ctl_table base[] = {
- {
- .procname = E2,
- .maxlen = 0,
- .mode = 0555,
- .child = subdir,
- },
- { }
-};

@initialize:python@
@@

def make_my_fresh_expression(s1, s2):
return '"' + s1.strip('"') + "/" + s2.strip('"') + '"'

@r3 depends on c1 && c2 && c3@
expression c1.E1;
identifier c1.sysctls;
expression c2.E2;
identifier c2.base;
identifier c3.header;
fresh identifier E3 = script:python(E2, E1) { make_my_fresh_expression(E2, E1) };
@@

header =
-register_sysctl_table(base);
+register_sysctl(E3, sysctls);

Generated-by: Coccinelle SmPL

Link: https://lkml.kernel.org/r/20211123202422.819032-1-mcgrof@kernel.org
Link: https://lkml.kernel.org/r/20211123202422.819032-2-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff c8dd5541 Fri Jan 21 23:11:34 MST 2022 Luis Chamberlain <mcgrof@kernel.org> hpet: simplify subdirectory registration with register_sysctl()

Patch series "sysctl: second set of kernel/sysctl cleanups", v2.

This is the 2nd set of kernel/sysctl.c cleanups. The diff stat should
reflect how this is a much better way to deal with theses. Fortunately
coccinelle can be used to ensure correctness for most of these and/or
future merge conflicts.

Note that since this is part of a larger effort to cleanup
kernel/sysctl.c I think we have no other option but to go with merging
these patches in either Andrew's tree or keep them staged in a separate
tree and send a merge request later. Otherwise kernel/sysctl.c will end
up becoming a sore spot for the next merge window.

This patch (of 8):

There is no need to user boiler plate code to specify a set of base
directories we're going to stuff sysctls under. Simplify this by using
register_sysctl() and specifying the directory path directly.

// pycocci sysctl-subdir-register-sysctl-simplify.cocci drivers/char/hpet.c

@c1@
expression E1;
identifier subdir, sysctls;
@@

static struct ctl_table subdir[] = {
{
.procname = E1,
.maxlen = 0,
.mode = 0555,
.child = sysctls,
},
{ }
};

@c2@
identifier c1.subdir;

expression E2;
identifier base;
@@

static struct ctl_table base[] = {
{
.procname = E2,
.maxlen = 0,
.mode = 0555,
.child = subdir,
},
{ }
};

@c3@
identifier c2.base;
identifier header;
@@

header = register_sysctl_table(base);

@r1 depends on c1 && c2 && c3@
expression c1.E1;
identifier c1.subdir, c1.sysctls;
@@

-static struct ctl_table subdir[] = {
- {
- .procname = E1,
- .maxlen = 0,
- .mode = 0555,
- .child = sysctls,
- },
- { }
-};

@r2 depends on c1 && c2 && c3@
identifier c1.subdir;

expression c2.E2;
identifier c2.base;
@@
-static struct ctl_table base[] = {
- {
- .procname = E2,
- .maxlen = 0,
- .mode = 0555,
- .child = subdir,
- },
- { }
-};

@initialize:python@
@@

def make_my_fresh_expression(s1, s2):
return '"' + s1.strip('"') + "/" + s2.strip('"') + '"'

@r3 depends on c1 && c2 && c3@
expression c1.E1;
identifier c1.sysctls;
expression c2.E2;
identifier c2.base;
identifier c3.header;
fresh identifier E3 = script:python(E2, E1) { make_my_fresh_expression(E2, E1) };
@@

header =
-register_sysctl_table(base);
+register_sysctl(E3, sysctls);

Generated-by: Coccinelle SmPL

Link: https://lkml.kernel.org/r/20211123202422.819032-1-mcgrof@kernel.org
Link: https://lkml.kernel.org/r/20211123202422.819032-2-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff c8dd5541 Fri Jan 21 23:11:34 MST 2022 Luis Chamberlain <mcgrof@kernel.org> hpet: simplify subdirectory registration with register_sysctl()

Patch series "sysctl: second set of kernel/sysctl cleanups", v2.

This is the 2nd set of kernel/sysctl.c cleanups. The diff stat should
reflect how this is a much better way to deal with theses. Fortunately
coccinelle can be used to ensure correctness for most of these and/or
future merge conflicts.

Note that since this is part of a larger effort to cleanup
kernel/sysctl.c I think we have no other option but to go with merging
these patches in either Andrew's tree or keep them staged in a separate
tree and send a merge request later. Otherwise kernel/sysctl.c will end
up becoming a sore spot for the next merge window.

This patch (of 8):

There is no need to user boiler plate code to specify a set of base
directories we're going to stuff sysctls under. Simplify this by using
register_sysctl() and specifying the directory path directly.

// pycocci sysctl-subdir-register-sysctl-simplify.cocci drivers/char/hpet.c

@c1@
expression E1;
identifier subdir, sysctls;
@@

static struct ctl_table subdir[] = {
{
.procname = E1,
.maxlen = 0,
.mode = 0555,
.child = sysctls,
},
{ }
};

@c2@
identifier c1.subdir;

expression E2;
identifier base;
@@

static struct ctl_table base[] = {
{
.procname = E2,
.maxlen = 0,
.mode = 0555,
.child = subdir,
},
{ }
};

@c3@
identifier c2.base;
identifier header;
@@

header = register_sysctl_table(base);

@r1 depends on c1 && c2 && c3@
expression c1.E1;
identifier c1.subdir, c1.sysctls;
@@

-static struct ctl_table subdir[] = {
- {
- .procname = E1,
- .maxlen = 0,
- .mode = 0555,
- .child = sysctls,
- },
- { }
-};

@r2 depends on c1 && c2 && c3@
identifier c1.subdir;

expression c2.E2;
identifier c2.base;
@@
-static struct ctl_table base[] = {
- {
- .procname = E2,
- .maxlen = 0,
- .mode = 0555,
- .child = subdir,
- },
- { }
-};

@initialize:python@
@@

def make_my_fresh_expression(s1, s2):
return '"' + s1.strip('"') + "/" + s2.strip('"') + '"'

@r3 depends on c1 && c2 && c3@
expression c1.E1;
identifier c1.sysctls;
expression c2.E2;
identifier c2.base;
identifier c3.header;
fresh identifier E3 = script:python(E2, E1) { make_my_fresh_expression(E2, E1) };
@@

header =
-register_sysctl_table(base);
+register_sysctl(E3, sysctls);

Generated-by: Coccinelle SmPL

Link: https://lkml.kernel.org/r/20211123202422.819032-1-mcgrof@kernel.org
Link: https://lkml.kernel.org/r/20211123202422.819032-2-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff c8dd5541 Fri Jan 21 23:11:34 MST 2022 Luis Chamberlain <mcgrof@kernel.org> hpet: simplify subdirectory registration with register_sysctl()

Patch series "sysctl: second set of kernel/sysctl cleanups", v2.

This is the 2nd set of kernel/sysctl.c cleanups. The diff stat should
reflect how this is a much better way to deal with theses. Fortunately
coccinelle can be used to ensure correctness for most of these and/or
future merge conflicts.

Note that since this is part of a larger effort to cleanup
kernel/sysctl.c I think we have no other option but to go with merging
these patches in either Andrew's tree or keep them staged in a separate
tree and send a merge request later. Otherwise kernel/sysctl.c will end
up becoming a sore spot for the next merge window.

This patch (of 8):

There is no need to user boiler plate code to specify a set of base
directories we're going to stuff sysctls under. Simplify this by using
register_sysctl() and specifying the directory path directly.

// pycocci sysctl-subdir-register-sysctl-simplify.cocci drivers/char/hpet.c

@c1@
expression E1;
identifier subdir, sysctls;
@@

static struct ctl_table subdir[] = {
{
.procname = E1,
.maxlen = 0,
.mode = 0555,
.child = sysctls,
},
{ }
};

@c2@
identifier c1.subdir;

expression E2;
identifier base;
@@

static struct ctl_table base[] = {
{
.procname = E2,
.maxlen = 0,
.mode = 0555,
.child = subdir,
},
{ }
};

@c3@
identifier c2.base;
identifier header;
@@

header = register_sysctl_table(base);

@r1 depends on c1 && c2 && c3@
expression c1.E1;
identifier c1.subdir, c1.sysctls;
@@

-static struct ctl_table subdir[] = {
- {
- .procname = E1,
- .maxlen = 0,
- .mode = 0555,
- .child = sysctls,
- },
- { }
-};

@r2 depends on c1 && c2 && c3@
identifier c1.subdir;

expression c2.E2;
identifier c2.base;
@@
-static struct ctl_table base[] = {
- {
- .procname = E2,
- .maxlen = 0,
- .mode = 0555,
- .child = subdir,
- },
- { }
-};

@initialize:python@
@@

def make_my_fresh_expression(s1, s2):
return '"' + s1.strip('"') + "/" + s2.strip('"') + '"'

@r3 depends on c1 && c2 && c3@
expression c1.E1;
identifier c1.sysctls;
expression c2.E2;
identifier c2.base;
identifier c3.header;
fresh identifier E3 = script:python(E2, E1) { make_my_fresh_expression(E2, E1) };
@@

header =
-register_sysctl_table(base);
+register_sysctl(E3, sysctls);

Generated-by: Coccinelle SmPL

Link: https://lkml.kernel.org/r/20211123202422.819032-1-mcgrof@kernel.org
Link: https://lkml.kernel.org/r/20211123202422.819032-2-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 98c49f17 Tue Jan 28 19:26:13 MST 2020 Gustavo A. R. Silva <gustavo@embeddedor.com> char: hpet: Fix out-of-bounds read bug

Currently, there is an out-of-bounds read on array hpetp->hp_dev
in the following for loop:

870 for (i = 0; i < hdp->hd_nirqs; i++)
871 hpetp->hp_dev[i].hd_hdwirq = hdp->hd_irq[i];

This is due to the recent change from one-element array to
flexible-array member in struct hpets:

104 struct hpets {
...
113 struct hpet_dev hp_dev[];
114 };

This change affected the total size of the dynamic memory
allocation, decreasing it by one time the size of struct hpet_dev.

Fix this by adjusting the allocation size when calling
struct_size().

Fixes: 987f028b8637c ("char: hpet: Use flexible-array member")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Eric Biggers <ebiggers@kernel.org>
Link: https://lore.kernel.org/r/20200129022613.GA24281@embeddedor.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 987f028b Mon Jan 20 16:53:26 MST 2020 Gustavo A. R. Silva <gustavo@embeddedor.com> char: hpet: Use flexible-array member

Old code in the kernel uses 1-byte and 0-byte arrays to indicate the
presence of a "variable length array":

struct something {
int length;
u8 data[1];
};

struct something *instance;

instance = kmalloc(sizeof(*instance) + size, GFP_KERNEL);
instance->length = size;
memcpy(instance->data, source, size);

There is also 0-byte arrays. Both cases pose confusion for things like
sizeof(), CONFIG_FORTIFY_SOURCE, etc.[1] Instead, the preferred mechanism
to declare variable-length types such as the one above is a flexible array
member[2] which need to be the last member of a structure and empty-sized:

struct something {
int stuff;
u8 data[];
};

Also, by making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
unadvertenly introduced[3] to the codebase from now on.

[1] https://github.com/KSPP/linux/issues/21
[2] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Link: https://lore.kernel.org/r/20200120235326.GA29231@embeddedor.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 987f028b Mon Jan 20 16:53:26 MST 2020 Gustavo A. R. Silva <gustavo@embeddedor.com> char: hpet: Use flexible-array member

Old code in the kernel uses 1-byte and 0-byte arrays to indicate the
presence of a "variable length array":

struct something {
int length;
u8 data[1];
};

struct something *instance;

instance = kmalloc(sizeof(*instance) + size, GFP_KERNEL);
instance->length = size;
memcpy(instance->data, source, size);

There is also 0-byte arrays. Both cases pose confusion for things like
sizeof(), CONFIG_FORTIFY_SOURCE, etc.[1] Instead, the preferred mechanism
to declare variable-length types such as the one above is a flexible array
member[2] which need to be the last member of a structure and empty-sized:

struct something {
int stuff;
u8 data[];
};

Also, by making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
unadvertenly introduced[3] to the codebase from now on.

[1] https://github.com/KSPP/linux/issues/21
[2] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Link: https://lore.kernel.org/r/20200120235326.GA29231@embeddedor.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0c7d37f4 Thu Jul 11 07:27:57 MDT 2019 Kefeng Wang <wangkefeng.wang@huawei.com> hpet: Fix division by zero in hpet_time_div()

The base value in do_div() called by hpet_time_div() is truncated from
unsigned long to uint32_t, resulting in a divide-by-zero exception.

UBSAN: Undefined behaviour in ../drivers/char/hpet.c:572:2
division by zero
CPU: 1 PID: 23682 Comm: syz-executor.3 Not tainted 4.4.184.x86_64+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
0000000000000000 b573382df1853d00 ffff8800a3287b98 ffffffff81ad7561
ffff8800a3287c00 ffffffff838b35b0 ffffffff838b3860 ffff8800a3287c20
0000000000000000 ffff8800a3287bb0 ffffffff81b8f25e ffffffff838b35a0
Call Trace:
[<ffffffff81ad7561>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ad7561>] dump_stack+0xc1/0x120 lib/dump_stack.c:51
[<ffffffff81b8f25e>] ubsan_epilogue+0x12/0x8d lib/ubsan.c:166
[<ffffffff81b900cb>] __ubsan_handle_divrem_overflow+0x282/0x2c8 lib/ubsan.c:262
[<ffffffff823560dd>] hpet_time_div drivers/char/hpet.c:572 [inline]
[<ffffffff823560dd>] hpet_ioctl_common drivers/char/hpet.c:663 [inline]
[<ffffffff823560dd>] hpet_ioctl_common.cold+0xa8/0xad drivers/char/hpet.c:577
[<ffffffff81e63d56>] hpet_ioctl+0xc6/0x180 drivers/char/hpet.c:676
[<ffffffff81711590>] vfs_ioctl fs/ioctl.c:43 [inline]
[<ffffffff81711590>] file_ioctl fs/ioctl.c:470 [inline]
[<ffffffff81711590>] do_vfs_ioctl+0x6e0/0xf70 fs/ioctl.c:605
[<ffffffff81711eb4>] SYSC_ioctl fs/ioctl.c:622 [inline]
[<ffffffff81711eb4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:613
[<ffffffff82846003>] tracesys_phase2+0x90/0x95

The main C reproducer autogenerated by syzkaller,

syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
memcpy((void*)0x20000100, "/dev/hpet\000", 10);
syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
syscall(__NR_ioctl, r[0], 0x40086806, 0x40000000000000);

Fix it by using div64_ul().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Zhang HongJun <zhanghongjun2@huawei.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190711132757.130092-1-wangkefeng.wang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
/linux-master/kernel/
H A Dsysctl.cdiff 19f0423f Fri Feb 23 01:31:26 MST 2024 Huang Yiwei <quic_hyiwei@quicinc.com> tracing: Support to dump instance traces by ftrace_dump_on_oops

Currently ftrace only dumps the global trace buffer on an OOPs. For
debugging a production usecase, instance trace will be helpful to
check specific problems since global trace buffer may be used for
other purposes.

This patch extend the ftrace_dump_on_oops parameter to dump a specific
or multiple trace instances:

- ftrace_dump_on_oops=0: as before -- don't dump
- ftrace_dump_on_oops[=1]: as before -- dump the global trace buffer
on all CPUs
- ftrace_dump_on_oops=2 or =orig_cpu: as before -- dump the global
trace buffer on CPU that triggered the oops
- ftrace_dump_on_oops=<instance_name>: new behavior -- dump the
tracing instance matching <instance_name>
- ftrace_dump_on_oops[=2/orig_cpu],<instance1_name>[=2/orig_cpu],
<instrance2_name>[=2/orig_cpu]: new behavior -- dump the global trace
buffer and multiple instance buffer on all CPUs, or only dump on CPU
that triggered the oops if =2 or =orig_cpu is given

Also, the sysctl node can handle the input accordingly.

Link: https://lore.kernel.org/linux-trace-kernel/20240223083126.1817731-1-quic_hyiwei@quicinc.com

Cc: Ross Zwisler <zwisler@google.com>
Cc: <mhiramat@kernel.org>
Cc: <mark.rutland@arm.com>
Cc: <mcgrof@kernel.org>
Cc: <keescook@chromium.org>
Cc: <j.granados@samsung.com>
Cc: <mathieu.desnoyers@efficios.com>
Cc: <corbet@lwn.net>
Signed-off-by: Huang Yiwei <quic_hyiwei@quicinc.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
diff cf8e8658 Thu Oct 20 07:54:33 MDT 2022 Ard Biesheuvel <ardb@kernel.org> arch: Remove Itanium (IA-64) architecture

The Itanium architecture is obsolete, and an informal survey [0] reveals
that any residual use of Itanium hardware in production is mostly HP-UX
or OpenVMS based. The use of Linux on Itanium appears to be limited to
enthusiasts that occasionally boot a fresh Linux kernel to see whether
things are still working as intended, and perhaps to churn out some
distro packages that are rarely used in practice.

None of the original companies behind Itanium still produce or support
any hardware or software for the architecture, and it is listed as
'Orphaned' in the MAINTAINERS file, as apparently, none of the engineers
that contributed on behalf of those companies (nor anyone else, for that
matter) have been willing to support or maintain the architecture
upstream or even be responsible for applying the odd fix. The Intel
firmware team removed all IA-64 support from the Tianocore/EDK2
reference implementation of EFI in 2018. (Itanium is the original
architecture for which EFI was developed, and the way Linux supports it
deviates significantly from other architectures.) Some distros, such as
Debian and Gentoo, still maintain [unofficial] ia64 ports, but many have
dropped support years ago.

While the argument is being made [1] that there is a 'for the common
good' angle to being able to build and run existing projects such as the
Grid Community Toolkit [2] on Itanium for interoperability testing, the
fact remains that none of those projects are known to be deployed on
Linux/ia64, and very few people actually have access to such a system in
the first place. Even if there were ways imaginable in which Linux/ia64
could be put to good use today, what matters is whether anyone is
actually doing that, and this does not appear to be the case.

There are no emulators widely available, and so boot testing Itanium is
generally infeasible for ordinary contributors. GCC still supports IA-64
but its compile farm [3] no longer has any IA-64 machines. GLIBC would
like to get rid of IA-64 [4] too because it would permit some overdue
code cleanups. In summary, the benefits to the ecosystem of having IA-64
be part of it are mostly theoretical, whereas the maintenance overhead
of keeping it supported is real.

So let's rip off the band aid, and remove the IA-64 arch code entirely.
This follows the timeline proposed by the Debian/ia64 maintainer [5],
which removes support in a controlled manner, leaving IA-64 in a known
good state in the most recent LTS release. Other projects will follow
once the kernel support is removed.

[0] https://lore.kernel.org/all/CAMj1kXFCMh_578jniKpUtx_j8ByHnt=s7S+yQ+vGbKt9ud7+kQ@mail.gmail.com/
[1] https://lore.kernel.org/all/0075883c-7c51-00f5-2c2d-5119c1820410@web.de/
[2] https://gridcf.org/gct-docs/latest/index.html
[3] https://cfarm.tetaneutral.net/machines/list/
[4] https://lore.kernel.org/all/87bkiilpc4.fsf@mid.deneb.enyo.de/
[5] https://lore.kernel.org/all/ff58a3e76e5102c94bb5946d99187b358def688a.camel@physik.fu-berlin.de/

Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
diff cf8e8658 Thu Oct 20 07:54:33 MDT 2022 Ard Biesheuvel <ardb@kernel.org> arch: Remove Itanium (IA-64) architecture

The Itanium architecture is obsolete, and an informal survey [0] reveals
that any residual use of Itanium hardware in production is mostly HP-UX
or OpenVMS based. The use of Linux on Itanium appears to be limited to
enthusiasts that occasionally boot a fresh Linux kernel to see whether
things are still working as intended, and perhaps to churn out some
distro packages that are rarely used in practice.

None of the original companies behind Itanium still produce or support
any hardware or software for the architecture, and it is listed as
'Orphaned' in the MAINTAINERS file, as apparently, none of the engineers
that contributed on behalf of those companies (nor anyone else, for that
matter) have been willing to support or maintain the architecture
upstream or even be responsible for applying the odd fix. The Intel
firmware team removed all IA-64 support from the Tianocore/EDK2
reference implementation of EFI in 2018. (Itanium is the original
architecture for which EFI was developed, and the way Linux supports it
deviates significantly from other architectures.) Some distros, such as
Debian and Gentoo, still maintain [unofficial] ia64 ports, but many have
dropped support years ago.

While the argument is being made [1] that there is a 'for the common
good' angle to being able to build and run existing projects such as the
Grid Community Toolkit [2] on Itanium for interoperability testing, the
fact remains that none of those projects are known to be deployed on
Linux/ia64, and very few people actually have access to such a system in
the first place. Even if there were ways imaginable in which Linux/ia64
could be put to good use today, what matters is whether anyone is
actually doing that, and this does not appear to be the case.

There are no emulators widely available, and so boot testing Itanium is
generally infeasible for ordinary contributors. GCC still supports IA-64
but its compile farm [3] no longer has any IA-64 machines. GLIBC would
like to get rid of IA-64 [4] too because it would permit some overdue
code cleanups. In summary, the benefits to the ecosystem of having IA-64
be part of it are mostly theoretical, whereas the maintenance overhead
of keeping it supported is real.

So let's rip off the band aid, and remove the IA-64 arch code entirely.
This follows the timeline proposed by the Debian/ia64 maintainer [5],
which removes support in a controlled manner, leaving IA-64 in a known
good state in the most recent LTS release. Other projects will follow
once the kernel support is removed.

[0] https://lore.kernel.org/all/CAMj1kXFCMh_578jniKpUtx_j8ByHnt=s7S+yQ+vGbKt9ud7+kQ@mail.gmail.com/
[1] https://lore.kernel.org/all/0075883c-7c51-00f5-2c2d-5119c1820410@web.de/
[2] https://gridcf.org/gct-docs/latest/index.html
[3] https://cfarm.tetaneutral.net/machines/list/
[4] https://lore.kernel.org/all/87bkiilpc4.fsf@mid.deneb.enyo.de/
[5] https://lore.kernel.org/all/ff58a3e76e5102c94bb5946d99187b358def688a.camel@physik.fu-berlin.de/

Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
diff 28898e26 Sun May 28 14:54:20 MDT 2023 Luis Chamberlain <mcgrof@kernel.org> sysctl: move security keys sysctl registration to its own file

The security keys sysctls are already declared on its own file,
just move the sysctl registration to its own file to help avoid
merge conflicts on sysctls.c, and help with clearing up sysctl.c
further.

This creates a small penalty of 23 bytes:

./scripts/bloat-o-meter vmlinux.1 vmlinux.2
add/remove: 2/0 grow/shrink: 0/1 up/down: 49/-26 (23)
Function old new delta
init_security_keys_sysctls - 33 +33
__pfx_init_security_keys_sysctls - 16 +16
sysctl_init_bases 85 59 -26
Total: Before=21256937, After=21256960, chg +0.00%

But soon we'll be saving tons of bytes anyway, as we modify the
sysctl registrations to use ARRAY_SIZE and so we get rid of all the
empty array elements so let's just clean this up now.

Reviewed-by: Paul Moore <paul@paul-moore.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
diff 28898e26 Sun May 28 14:54:20 MDT 2023 Luis Chamberlain <mcgrof@kernel.org> sysctl: move security keys sysctl registration to its own file

The security keys sysctls are already declared on its own file,
just move the sysctl registration to its own file to help avoid
merge conflicts on sysctls.c, and help with clearing up sysctl.c
further.

This creates a small penalty of 23 bytes:

./scripts/bloat-o-meter vmlinux.1 vmlinux.2
add/remove: 2/0 grow/shrink: 0/1 up/down: 49/-26 (23)
Function old new delta
init_security_keys_sysctls - 33 +33
__pfx_init_security_keys_sysctls - 16 +16
sysctl_init_bases 85 59 -26
Total: Before=21256937, After=21256960, chg +0.00%

But soon we'll be saving tons of bytes anyway, as we modify the
sysctl registrations to use ARRAY_SIZE and so we get rid of all the
empty array elements so let's just clean this up now.

Reviewed-by: Paul Moore <paul@paul-moore.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
diff 861dc0b4 Sun May 28 14:43:46 MDT 2023 Luis Chamberlain <mcgrof@kernel.org> sysctl: move umh sysctl registration to its own file

Move the umh sysctl registration to its own file, the array is
already there. We do this to remove the clutter out of kernel/sysctl.c
to avoid merge conflicts.

This also lets the sysctls not be built at all now when CONFIG_SYSCTL
is not enabled.

This has a small penalty of 23 bytes but soon we'll be removing
all the empty entries on sysctl arrays so just do this cleanup
now:

./scripts/bloat-o-meter vmlinux.base vmlinux.1
add/remove: 2/0 grow/shrink: 0/1 up/down: 49/-26 (23)
Function old new delta
init_umh_sysctls - 33 +33
__pfx_init_umh_sysctls - 16 +16
sysctl_init_bases 111 85 -26
Total: Before=21256914, After=21256937, chg +0.00%

Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
diff 861dc0b4 Sun May 28 14:43:46 MDT 2023 Luis Chamberlain <mcgrof@kernel.org> sysctl: move umh sysctl registration to its own file

Move the umh sysctl registration to its own file, the array is
already there. We do this to remove the clutter out of kernel/sysctl.c
to avoid merge conflicts.

This also lets the sysctls not be built at all now when CONFIG_SYSCTL
is not enabled.

This has a small penalty of 23 bytes but soon we'll be removing
all the empty entries on sysctl arrays so just do this cleanup
now:

./scripts/bloat-o-meter vmlinux.base vmlinux.1
add/remove: 2/0 grow/shrink: 0/1 up/down: 49/-26 (23)
Function old new delta
init_umh_sysctls - 33 +33
__pfx_init_umh_sysctls - 16 +16
sysctl_init_bases 111 85 -26
Total: Before=21256914, After=21256937, chg +0.00%

Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
diff 01e6aac7 Thu May 18 14:37:41 MDT 2023 Luis Chamberlain <mcgrof@kernel.org> signal: move show_unhandled_signals sysctl to its own file

The show_unhandled_signals sysctl is the only sysctl for debug
left on kernel/sysctl.c. We've been moving the syctls out from
kernel/sysctl.c so to help avoid merge conflicts as the shared
array gets out of hand.

This change incurs simplifies sysctl registration by localizing
it where it should go for a penalty in size of increasing the
kernel by 23 bytes, we accept this given recent cleanups have
actually already saved us 1465 bytes in the prior commits.

./scripts/bloat-o-meter vmlinux.3-remove-dev-table vmlinux.4-remove-debug-table
add/remove: 3/1 grow/shrink: 0/1 up/down: 177/-154 (23)
Function old new delta
signal_debug_table - 128 +128
init_signal_sysctls - 33 +33
__pfx_init_signal_sysctls - 16 +16
sysctl_init_bases 85 59 -26
debug_table 128 - -128
Total: Before=21256967, After=21256990, chg +0.00%

Reviewed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
diff 996ef312 Thu May 18 14:40:15 MDT 2023 Luis Chamberlain <mcgrof@kernel.org> sysctl: remove empty dev table

Now that all the dev sysctls have been moved out we can remove the
dev sysctl base directory. We don't need to create base directories,
they are created for you as if using 'mkdir -p' with register_syctl()
and register_sysctl_init(). For details refer to sysctl_mkdir_p()
usage.

We save 90 bytes with this changes:

./scripts/bloat-o-meter vmlinux.2.remove-sysctl-table vmlinux.3-remove-dev-table
add/remove: 0/1 grow/shrink: 0/1 up/down: 0/-90 (-90)
Function old new delta
sysctl_init_bases 111 85 -26
dev_table 64 - -64
Total: Before=21257057, After=21256967, chg -0.00%

The empty dev table has been in place since the v2.5.0 days because
back then ordering was essentialy. But later commit 7ec66d06362d
("sysctl: Stop requiring explicit management of sysctl directories"),
merged as of v3.4-rc1, the entire ordering of directories was replaced
by allowing sysctl directory autogeneration. This new mechanism
introduced on v3.4 allows for sysctl directories to automatically be
created for sysctl tables when they are needed and automatically removes
them when no sysctl tables use them. That commit also added a dedicated
struct ctl_dir as a new type for these autogenerated directories.

Reviewed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
diff 996ef312 Thu May 18 14:40:15 MDT 2023 Luis Chamberlain <mcgrof@kernel.org> sysctl: remove empty dev table

Now that all the dev sysctls have been moved out we can remove the
dev sysctl base directory. We don't need to create base directories,
they are created for you as if using 'mkdir -p' with register_syctl()
and register_sysctl_init(). For details refer to sysctl_mkdir_p()
usage.

We save 90 bytes with this changes:

./scripts/bloat-o-meter vmlinux.2.remove-sysctl-table vmlinux.3-remove-dev-table
add/remove: 0/1 grow/shrink: 0/1 up/down: 0/-90 (-90)
Function old new delta
sysctl_init_bases 111 85 -26
dev_table 64 - -64
Total: Before=21257057, After=21256967, chg -0.00%

The empty dev table has been in place since the v2.5.0 days because
back then ordering was essentialy. But later commit 7ec66d06362d
("sysctl: Stop requiring explicit management of sysctl directories"),
merged as of v3.4-rc1, the entire ordering of directories was replaced
by allowing sysctl directory autogeneration. This new mechanism
introduced on v3.4 allows for sysctl directories to automatically be
created for sysctl tables when they are needed and automatically removes
them when no sysctl tables use them. That commit also added a dedicated
struct ctl_dir as a new type for these autogenerated directories.

Reviewed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
diff 996ef312 Thu May 18 14:40:15 MDT 2023 Luis Chamberlain <mcgrof@kernel.org> sysctl: remove empty dev table

Now that all the dev sysctls have been moved out we can remove the
dev sysctl base directory. We don't need to create base directories,
they are created for you as if using 'mkdir -p' with register_syctl()
and register_sysctl_init(). For details refer to sysctl_mkdir_p()
usage.

We save 90 bytes with this changes:

./scripts/bloat-o-meter vmlinux.2.remove-sysctl-table vmlinux.3-remove-dev-table
add/remove: 0/1 grow/shrink: 0/1 up/down: 0/-90 (-90)
Function old new delta
sysctl_init_bases 111 85 -26
dev_table 64 - -64
Total: Before=21257057, After=21256967, chg -0.00%

The empty dev table has been in place since the v2.5.0 days because
back then ordering was essentialy. But later commit 7ec66d06362d
("sysctl: Stop requiring explicit management of sysctl directories"),
merged as of v3.4-rc1, the entire ordering of directories was replaced
by allowing sysctl directory autogeneration. This new mechanism
introduced on v3.4 allows for sysctl directories to automatically be
created for sysctl tables when they are needed and automatically removes
them when no sysctl tables use them. That commit also added a dedicated
struct ctl_dir as a new type for these autogenerated directories.

Reviewed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
/linux-master/ipc/
H A Dipc_sysctl.cdiff 50ec499b Mon Jan 15 08:46:41 MST 2024 Alexey Gladkov <legion@kernel.org> sysctl: allow change system v ipc sysctls inside ipc namespace

Patch series "Allow to change ipc/mq sysctls inside ipc namespace", v3.

Right now ipc and mq limits count as per ipc namespace, but only real root
can change them. By default, the current values of these limits are such
that it can only be reduced. Since only root can change the values, it is
impossible to reduce these limits in the rootless container.

We can allow limit changes within ipc namespace because mq parameters are
limited by RLIMIT_MSGQUEUE and ipc parameters are not limited to anything
other than cgroups.


This patch (of 3):

Rootless containers are not allowed to modify kernel IPC parameters.

All default limits are set to such high values that in fact there are no
limits at all. All limits are not inherited and are initialized to
default values when a new ipc_namespace is created.

For new ipc_namespace:

size_t ipc_ns.shm_ctlmax = SHMMAX; // (ULONG_MAX - (1UL << 24))
size_t ipc_ns.shm_ctlall = SHMALL; // (ULONG_MAX - (1UL << 24))
int ipc_ns.shm_ctlmni = IPCMNI; // (1 << 15)
int ipc_ns.shm_rmid_forced = 0;
unsigned int ipc_ns.msg_ctlmax = MSGMAX; // 8192
unsigned int ipc_ns.msg_ctlmni = MSGMNI; // 32000
unsigned int ipc_ns.msg_ctlmnb = MSGMNB; // 16384

The shm_tot (total amount of shared pages) has also ceased to be global,
it is located in ipc_namespace and is not inherited from anywhere.

In such conditions, it cannot be said that these limits limit anything.
The real limiter for them is cgroups.

If we allow rootless containers to change these parameters, then it can
only be reduced.

Link: https://lkml.kernel.org/r/cover.1705333426.git.legion@kernel.org
Link: https://lkml.kernel.org/r/d2f4603305cbfed58a24755aa61d027314b73a45.1705333426.git.legion@kernel.org
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Link: https://lkml.kernel.org/r/e2d84d3ec0172cfff759e6065da84ce0cc2736f8.1663756794.git.legion@kernel.org
Cc: Christian Brauner <brauner@kernel.org>
Cc: Joel Granados <joel.granados@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 0e9beb8a Mon Nov 08 19:36:02 MST 2021 Manfred Spraul <manfred@colorfullife.com> ipc/ipc_sysctl.c: remove fallback for !CONFIG_PROC_SYSCTL

Compilation of ipc/ipc_sysctl.c is controlled by
obj-$(CONFIG_SYSVIPC_SYSCTL)
[see ipc/Makefile]

And CONFIG_SYSVIPC_SYSCTL depends on SYSCTL
[see init/Kconfig]

An SYSCTL is selected by PROC_SYSCTL.
[see fs/proc/Kconfig]

Thus: #ifndef CONFIG_PROC_SYSCTL in ipc/ipc_sysctl.c is impossible, the
fallback can be removed.

Link: https://lkml.kernel.org/r/20210918145337.3369-1-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff eec4844f Thu Jul 18 16:58:50 MDT 2019 Matteo Croce <mcroce@redhat.com> proc/sysctl: add shared variables for range check

In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.

On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.

The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:

$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248

Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.

This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:

# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%

[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff eec4844f Thu Jul 18 16:58:50 MDT 2019 Matteo Croce <mcroce@redhat.com> proc/sysctl: add shared variables for range check

In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.

On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.

The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:

$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248

Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.

This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:

# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%

[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff eec4844f Thu Jul 18 16:58:50 MDT 2019 Matteo Croce <mcroce@redhat.com> proc/sysctl: add shared variables for range check

In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.

On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.

The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:

$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248

Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.

This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:

# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%

[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 99db46ea Tue May 14 16:46:36 MDT 2019 Manfred Spraul <manfred@colorfullife.com> ipc: do cyclic id allocation for the ipc object.

For ipcmni_extend mode, the sequence number space is only 7 bits. So
the chance of id reuse is relatively high compared with the non-extended
mode.

To alleviate this id reuse problem, this patch enables cyclic allocation
for the index to the radix tree (idx). The disadvantage is that this
can cause a slight slow-down of the fast path, as the radix tree could
be higher than necessary.

To limit the radix tree height, I have chosen the following limits:
1) The cycling is done over in_use*1.5.
2) At least, the cycling is done over
"normal" ipcnmi mode: RADIX_TREE_MAP_SIZE elements
"ipcmni_extended": 4096 elements

Result:
- for normal mode:
No change for <= 42 active ipc elements. With more than 42
active ipc elements, a 2nd level would be added to the radix
tree.
Without cyclic allocation, a 2nd level would be added only with
more than 63 active elements.

- for extended mode:
Cycling creates always at least a 2-level radix tree.
With more than 2730 active objects, a 3rd level would be
added, instead of > 4095 active objects until the 3rd level
is added without cyclic allocation.

For a 2-level radix tree compared to a 1-level radix tree, I have
observed < 1% performance impact.

Notes:
1) Normal "x=semget();y=semget();" is unaffected: Then the idx
is e.g. a and a+1, regardless if idr_alloc() or idr_alloc_cyclic()
is used.

2) The -1% happens in a microbenchmark after this situation:
x=semget();
for(i=0;i<4000;i++) {t=semget();semctl(t,0,IPC_RMID);}
y=semget();
Now perform semget calls on x and y that do not sleep.

3) The worst-case reuse cycle time is unfortunately unaffected:
If you have 2^24-1 ipc objects allocated, and get/remove the last
possible element in a loop, then the id is reused after 128
get/remove pairs.

Performance check:
A microbenchmark that performes no-op semop() randomly on two IDs,
with only these two IDs allocated.
The IDs were set using /proc/sys/kernel/sem_next_id.
The test was run 5 times, averages are shown.

1 & 2: Base (6.22 seconds for 10.000.000 semops)
1 & 40: -0.2%
1 & 3348: - 0.8%
1 & 27348: - 1.6%
1 & 15777204: - 3.2%

Or: ~12.6 cpu cycles per additional radix tree level.
The cpu is an Intel I3-5010U. ~1300 cpu cycles/syscall is slower
than what I remember (spectre impact?).

V2 of the patch:
- use "min" and "max"
- use RADIX_TREE_MAP_SIZE * RADIX_TREE_MAP_SIZE instead of
(2<<12).

[akpm@linux-foundation.org: fix max() warning]
Link: http://lkml.kernel.org/r/20190329204930.21620-3-longman@redhat.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Waiman Long <longman@redhat.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 99db46ea Tue May 14 16:46:36 MDT 2019 Manfred Spraul <manfred@colorfullife.com> ipc: do cyclic id allocation for the ipc object.

For ipcmni_extend mode, the sequence number space is only 7 bits. So
the chance of id reuse is relatively high compared with the non-extended
mode.

To alleviate this id reuse problem, this patch enables cyclic allocation
for the index to the radix tree (idx). The disadvantage is that this
can cause a slight slow-down of the fast path, as the radix tree could
be higher than necessary.

To limit the radix tree height, I have chosen the following limits:
1) The cycling is done over in_use*1.5.
2) At least, the cycling is done over
"normal" ipcnmi mode: RADIX_TREE_MAP_SIZE elements
"ipcmni_extended": 4096 elements

Result:
- for normal mode:
No change for <= 42 active ipc elements. With more than 42
active ipc elements, a 2nd level would be added to the radix
tree.
Without cyclic allocation, a 2nd level would be added only with
more than 63 active elements.

- for extended mode:
Cycling creates always at least a 2-level radix tree.
With more than 2730 active objects, a 3rd level would be
added, instead of > 4095 active objects until the 3rd level
is added without cyclic allocation.

For a 2-level radix tree compared to a 1-level radix tree, I have
observed < 1% performance impact.

Notes:
1) Normal "x=semget();y=semget();" is unaffected: Then the idx
is e.g. a and a+1, regardless if idr_alloc() or idr_alloc_cyclic()
is used.

2) The -1% happens in a microbenchmark after this situation:
x=semget();
for(i=0;i<4000;i++) {t=semget();semctl(t,0,IPC_RMID);}
y=semget();
Now perform semget calls on x and y that do not sleep.

3) The worst-case reuse cycle time is unfortunately unaffected:
If you have 2^24-1 ipc objects allocated, and get/remove the last
possible element in a loop, then the id is reused after 128
get/remove pairs.

Performance check:
A microbenchmark that performes no-op semop() randomly on two IDs,
with only these two IDs allocated.
The IDs were set using /proc/sys/kernel/sem_next_id.
The test was run 5 times, averages are shown.

1 & 2: Base (6.22 seconds for 10.000.000 semops)
1 & 40: -0.2%
1 & 3348: - 0.8%
1 & 27348: - 1.6%
1 & 15777204: - 3.2%

Or: ~12.6 cpu cycles per additional radix tree level.
The cpu is an Intel I3-5010U. ~1300 cpu cycles/syscall is slower
than what I remember (spectre impact?).

V2 of the patch:
- use "min" and "max"
- use RADIX_TREE_MAP_SIZE * RADIX_TREE_MAP_SIZE instead of
(2<<12).

[akpm@linux-foundation.org: fix max() warning]
Link: http://lkml.kernel.org/r/20190329204930.21620-3-longman@redhat.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Waiman Long <longman@redhat.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 8c81ddd2 Tue Oct 30 16:07:24 MDT 2018 Waiman Long <longman@redhat.com> ipc: IPCMNI limit check for semmni

For SysV semaphores, the semmni value is the last part of the 4-element
sem number array. To make semmni behave in a similar way to msgmni and
shmmni, we can't directly use the _minmax handler. Instead, a special sem
specific handler is added to check the last argument to make sure that it
is limited to the [0, IPCMNI] range. An error will be returned if this is
not the case.

Link: http://lkml.kernel.org/r/1536352137-12003-3-git-send-email-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 1195d94e Mon Oct 13 16:54:10 MDT 2014 Andrey Vagin <avagin@openvz.org> ipc: always handle a new value of auto_msgmni

proc_dointvec_minmax() returns zero if a new value has been set. So we
don't need to check all charecters have been handled.

Below you can find two examples. In the new value has not been handled
properly.

$ strace ./a.out
open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
write(3, "0\n\0", 3) = 2
close(3) = 0
exit_group(0)
$ cat /sys/kernel/debug/tracing/trace

$strace ./a.out
open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
write(3, "0\n", 2) = 2
close(3) = 0

$ cat /sys/kernel/debug/tracing/trace
a.out-697 [000] .... 3280.998235: unregister_ipcns_notifier <-proc_ipcauto_dointvec_minmax

Fixes: 9eefe520c814 ("ipc: do not use a negative value to re-enable msgmni automatic recomputin")
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Joe Perches <joe@perches.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 1195d94e Mon Oct 13 16:54:10 MDT 2014 Andrey Vagin <avagin@openvz.org> ipc: always handle a new value of auto_msgmni

proc_dointvec_minmax() returns zero if a new value has been set. So we
don't need to check all charecters have been handled.

Below you can find two examples. In the new value has not been handled
properly.

$ strace ./a.out
open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
write(3, "0\n\0", 3) = 2
close(3) = 0
exit_group(0)
$ cat /sys/kernel/debug/tracing/trace

$strace ./a.out
open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
write(3, "0\n", 2) = 2
close(3) = 0

$ cat /sys/kernel/debug/tracing/trace
a.out-697 [000] .... 3280.998235: unregister_ipcns_notifier <-proc_ipcauto_dointvec_minmax

Fixes: 9eefe520c814 ("ipc: do not use a negative value to re-enable msgmni automatic recomputin")
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Joe Perches <joe@perches.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 1195d94e Mon Oct 13 16:54:10 MDT 2014 Andrey Vagin <avagin@openvz.org> ipc: always handle a new value of auto_msgmni

proc_dointvec_minmax() returns zero if a new value has been set. So we
don't need to check all charecters have been handled.

Below you can find two examples. In the new value has not been handled
properly.

$ strace ./a.out
open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
write(3, "0\n\0", 3) = 2
close(3) = 0
exit_group(0)
$ cat /sys/kernel/debug/tracing/trace

$strace ./a.out
open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
write(3, "0\n", 2) = 2
close(3) = 0

$ cat /sys/kernel/debug/tracing/trace
a.out-697 [000] .... 3280.998235: unregister_ipcns_notifier <-proc_ipcauto_dointvec_minmax

Fixes: 9eefe520c814 ("ipc: do not use a negative value to re-enable msgmni automatic recomputin")
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Joe Perches <joe@perches.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 1195d94e Mon Oct 13 16:54:10 MDT 2014 Andrey Vagin <avagin@openvz.org> ipc: always handle a new value of auto_msgmni

proc_dointvec_minmax() returns zero if a new value has been set. So we
don't need to check all charecters have been handled.

Below you can find two examples. In the new value has not been handled
properly.

$ strace ./a.out
open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
write(3, "0\n\0", 3) = 2
close(3) = 0
exit_group(0)
$ cat /sys/kernel/debug/tracing/trace

$strace ./a.out
open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
write(3, "0\n", 2) = 2
close(3) = 0

$ cat /sys/kernel/debug/tracing/trace
a.out-697 [000] .... 3280.998235: unregister_ipcns_notifier <-proc_ipcauto_dointvec_minmax

Fixes: 9eefe520c814 ("ipc: do not use a negative value to re-enable msgmni automatic recomputin")
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Joe Perches <joe@perches.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 1195d94e Mon Oct 13 16:54:10 MDT 2014 Andrey Vagin <avagin@openvz.org> ipc: always handle a new value of auto_msgmni

proc_dointvec_minmax() returns zero if a new value has been set. So we
don't need to check all charecters have been handled.

Below you can find two examples. In the new value has not been handled
properly.

$ strace ./a.out
open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
write(3, "0\n\0", 3) = 2
close(3) = 0
exit_group(0)
$ cat /sys/kernel/debug/tracing/trace

$strace ./a.out
open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
write(3, "0\n", 2) = 2
close(3) = 0

$ cat /sys/kernel/debug/tracing/trace
a.out-697 [000] .... 3280.998235: unregister_ipcns_notifier <-proc_ipcauto_dointvec_minmax

Fixes: 9eefe520c814 ("ipc: do not use a negative value to re-enable msgmni automatic recomputin")
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Joe Perches <joe@perches.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 1195d94e Mon Oct 13 16:54:10 MDT 2014 Andrey Vagin <avagin@openvz.org> ipc: always handle a new value of auto_msgmni

proc_dointvec_minmax() returns zero if a new value has been set. So we
don't need to check all charecters have been handled.

Below you can find two examples. In the new value has not been handled
properly.

$ strace ./a.out
open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
write(3, "0\n\0", 3) = 2
close(3) = 0
exit_group(0)
$ cat /sys/kernel/debug/tracing/trace

$strace ./a.out
open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
write(3, "0\n", 2) = 2
close(3) = 0

$ cat /sys/kernel/debug/tracing/trace
a.out-697 [000] .... 3280.998235: unregister_ipcns_notifier <-proc_ipcauto_dointvec_minmax

Fixes: 9eefe520c814 ("ipc: do not use a negative value to re-enable msgmni automatic recomputin")
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Joe Perches <joe@perches.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
H A Dmqueue.cdiff da27f796 Fri Jan 27 11:46:51 MST 2023 Rik van Riel <riel@surriel.com> ipc,namespace: batch free ipc_namespace structures

Instead of waiting for an RCU grace period between each ipc_namespace
structure that is being freed, wait an RCU grace period for every batch
of ipc_namespace structures.

Thanks to Al Viro for the suggestion of the helper function.

This speeds up the run time of the test case that allocates ipc_namespaces
in a loop from 6 minutes, to a little over 1 second:

real 0m1.192s
user 0m0.038s
sys 0m1.152s

Signed-off-by: Rik van Riel <riel@surriel.com>
Reported-by: Chris Mason <clm@meta.com>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
diff da27f796 Fri Jan 27 11:46:51 MST 2023 Rik van Riel <riel@surriel.com> ipc,namespace: batch free ipc_namespace structures

Instead of waiting for an RCU grace period between each ipc_namespace
structure that is being freed, wait an RCU grace period for every batch
of ipc_namespace structures.

Thanks to Al Viro for the suggestion of the helper function.

This speeds up the run time of the test case that allocates ipc_namespaces
in a loop from 6 minutes, to a little over 1 second:

real 0m1.192s
user 0m0.038s
sys 0m1.152s

Signed-off-by: Rik van Riel <riel@surriel.com>
Reported-by: Chris Mason <clm@meta.com>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
diff da27f796 Fri Jan 27 11:46:51 MST 2023 Rik van Riel <riel@surriel.com> ipc,namespace: batch free ipc_namespace structures

Instead of waiting for an RCU grace period between each ipc_namespace
structure that is being freed, wait an RCU grace period for every batch
of ipc_namespace structures.

Thanks to Al Viro for the suggestion of the helper function.

This speeds up the run time of the test case that allocates ipc_namespaces
in a loop from 6 minutes, to a little over 1 second:

real 0m1.192s
user 0m0.038s
sys 0m1.152s

Signed-off-by: Rik van Riel <riel@surriel.com>
Reported-by: Chris Mason <clm@meta.com>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
diff a11ddb37 Sat May 22 18:41:49 MDT 2021 Varad Gautam <varad.gautam@suse.com> ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiry

do_mq_timedreceive calls wq_sleep with a stack local address. The
sender (do_mq_timedsend) uses this address to later call pipelined_send.

This leads to a very hard to trigger race where a do_mq_timedreceive
call might return and leave do_mq_timedsend to rely on an invalid
address, causing the following crash:

RIP: 0010:wake_q_add_safe+0x13/0x60
Call Trace:
__x64_sys_mq_timedsend+0x2a9/0x490
do_syscall_64+0x80/0x680
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f5928e40343

The race occurs as:

1. do_mq_timedreceive calls wq_sleep with the address of `struct
ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it
holds a valid `struct ext_wait_queue *` as long as the stack has not
been overwritten.

2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
__pipelined_op.

3. Sender calls __pipelined_op::smp_store_release(&this->state,
STATE_READY). Here is where the race window begins. (`this` is
`ewq_addr`.)

4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
will see `state == STATE_READY` and break.

5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
stack. (Although the address may not get overwritten until another
function happens to touch it, which means it can persist around for an
indefinite time.)

6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
`struct ext_wait_queue *`, and uses it to find a task_struct to pass to
the wake_q_add_safe call. In the lucky case where nothing has
overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
bogus address as the receiver's task_struct causing the crash.

do_mq_timedsend::__pipelined_op() should not dereference `this` after
setting STATE_READY, as the receiver counterpart is now free to return.
Change __pipelined_op to call wake_q_add_safe on the receiver's
task_struct returned by get_task_struct, instead of dereferencing `this`
which sits on the receiver's stack.

As Manfred pointed out, the race potentially also exists in
ipc/msg.c::expunge_all and ipc/sem.c::wake_up_sem_queue_prepare. Fix
those in the same way.

Link: https://lkml.kernel.org/r/20210510102950.12551-1-varad.gautam@suse.com
Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
Fixes: 8116b54e7e23ef ("ipc/sem.c: document and update memory barriers")
Fixes: 0d97a82ba830d8 ("ipc/msg.c: update and document memory barriers")
Signed-off-by: Varad Gautam <varad.gautam@suse.com>
Reported-by: Matthias von Faber <matthias.vonfaber@aox-tech.de>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff a11ddb37 Sat May 22 18:41:49 MDT 2021 Varad Gautam <varad.gautam@suse.com> ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiry

do_mq_timedreceive calls wq_sleep with a stack local address. The
sender (do_mq_timedsend) uses this address to later call pipelined_send.

This leads to a very hard to trigger race where a do_mq_timedreceive
call might return and leave do_mq_timedsend to rely on an invalid
address, causing the following crash:

RIP: 0010:wake_q_add_safe+0x13/0x60
Call Trace:
__x64_sys_mq_timedsend+0x2a9/0x490
do_syscall_64+0x80/0x680
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f5928e40343

The race occurs as:

1. do_mq_timedreceive calls wq_sleep with the address of `struct
ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it
holds a valid `struct ext_wait_queue *` as long as the stack has not
been overwritten.

2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
__pipelined_op.

3. Sender calls __pipelined_op::smp_store_release(&this->state,
STATE_READY). Here is where the race window begins. (`this` is
`ewq_addr`.)

4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
will see `state == STATE_READY` and break.

5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
stack. (Although the address may not get overwritten until another
function happens to touch it, which means it can persist around for an
indefinite time.)

6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
`struct ext_wait_queue *`, and uses it to find a task_struct to pass to
the wake_q_add_safe call. In the lucky case where nothing has
overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
bogus address as the receiver's task_struct causing the crash.

do_mq_timedsend::__pipelined_op() should not dereference `this` after
setting STATE_READY, as the receiver counterpart is now free to return.
Change __pipelined_op to call wake_q_add_safe on the receiver's
task_struct returned by get_task_struct, instead of dereferencing `this`
which sits on the receiver's stack.

As Manfred pointed out, the race potentially also exists in
ipc/msg.c::expunge_all and ipc/sem.c::wake_up_sem_queue_prepare. Fix
those in the same way.

Link: https://lkml.kernel.org/r/20210510102950.12551-1-varad.gautam@suse.com
Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
Fixes: 8116b54e7e23ef ("ipc/sem.c: document and update memory barriers")
Fixes: 0d97a82ba830d8 ("ipc/msg.c: update and document memory barriers")
Signed-off-by: Varad Gautam <varad.gautam@suse.com>
Reported-by: Matthias von Faber <matthias.vonfaber@aox-tech.de>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff a11ddb37 Sat May 22 18:41:49 MDT 2021 Varad Gautam <varad.gautam@suse.com> ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiry

do_mq_timedreceive calls wq_sleep with a stack local address. The
sender (do_mq_timedsend) uses this address to later call pipelined_send.

This leads to a very hard to trigger race where a do_mq_timedreceive
call might return and leave do_mq_timedsend to rely on an invalid
address, causing the following crash:

RIP: 0010:wake_q_add_safe+0x13/0x60
Call Trace:
__x64_sys_mq_timedsend+0x2a9/0x490
do_syscall_64+0x80/0x680
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f5928e40343

The race occurs as:

1. do_mq_timedreceive calls wq_sleep with the address of `struct
ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it
holds a valid `struct ext_wait_queue *` as long as the stack has not
been overwritten.

2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
__pipelined_op.

3. Sender calls __pipelined_op::smp_store_release(&this->state,
STATE_READY). Here is where the race window begins. (`this` is
`ewq_addr`.)

4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
will see `state == STATE_READY` and break.

5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
stack. (Although the address may not get overwritten until another
function happens to touch it, which means it can persist around for an
indefinite time.)

6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
`struct ext_wait_queue *`, and uses it to find a task_struct to pass to
the wake_q_add_safe call. In the lucky case where nothing has
overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
bogus address as the receiver's task_struct causing the crash.

do_mq_timedsend::__pipelined_op() should not dereference `this` after
setting STATE_READY, as the receiver counterpart is now free to return.
Change __pipelined_op to call wake_q_add_safe on the receiver's
task_struct returned by get_task_struct, instead of dereferencing `this`
which sits on the receiver's stack.

As Manfred pointed out, the race potentially also exists in
ipc/msg.c::expunge_all and ipc/sem.c::wake_up_sem_queue_prepare. Fix
those in the same way.

Link: https://lkml.kernel.org/r/20210510102950.12551-1-varad.gautam@suse.com
Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
Fixes: 8116b54e7e23ef ("ipc/sem.c: document and update memory barriers")
Fixes: 0d97a82ba830d8 ("ipc/msg.c: update and document memory barriers")
Signed-off-by: Varad Gautam <varad.gautam@suse.com>
Reported-by: Matthias von Faber <matthias.vonfaber@aox-tech.de>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff a11ddb37 Sat May 22 18:41:49 MDT 2021 Varad Gautam <varad.gautam@suse.com> ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiry

do_mq_timedreceive calls wq_sleep with a stack local address. The
sender (do_mq_timedsend) uses this address to later call pipelined_send.

This leads to a very hard to trigger race where a do_mq_timedreceive
call might return and leave do_mq_timedsend to rely on an invalid
address, causing the following crash:

RIP: 0010:wake_q_add_safe+0x13/0x60
Call Trace:
__x64_sys_mq_timedsend+0x2a9/0x490
do_syscall_64+0x80/0x680
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f5928e40343

The race occurs as:

1. do_mq_timedreceive calls wq_sleep with the address of `struct
ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it
holds a valid `struct ext_wait_queue *` as long as the stack has not
been overwritten.

2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
__pipelined_op.

3. Sender calls __pipelined_op::smp_store_release(&this->state,
STATE_READY). Here is where the race window begins. (`this` is
`ewq_addr`.)

4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
will see `state == STATE_READY` and break.

5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
stack. (Although the address may not get overwritten until another
function happens to touch it, which means it can persist around for an
indefinite time.)

6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
`struct ext_wait_queue *`, and uses it to find a task_struct to pass to
the wake_q_add_safe call. In the lucky case where nothing has
overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
bogus address as the receiver's task_struct causing the crash.

do_mq_timedsend::__pipelined_op() should not dereference `this` after
setting STATE_READY, as the receiver counterpart is now free to return.
Change __pipelined_op to call wake_q_add_safe on the receiver's
task_struct returned by get_task_struct, instead of dereferencing `this`
which sits on the receiver's stack.

As Manfred pointed out, the race potentially also exists in
ipc/msg.c::expunge_all and ipc/sem.c::wake_up_sem_queue_prepare. Fix
those in the same way.

Link: https://lkml.kernel.org/r/20210510102950.12551-1-varad.gautam@suse.com
Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
Fixes: 8116b54e7e23ef ("ipc/sem.c: document and update memory barriers")
Fixes: 0d97a82ba830d8 ("ipc/msg.c: update and document memory barriers")
Signed-off-by: Varad Gautam <varad.gautam@suse.com>
Reported-by: Matthias von Faber <matthias.vonfaber@aox-tech.de>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff a11ddb37 Sat May 22 18:41:49 MDT 2021 Varad Gautam <varad.gautam@suse.com> ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiry

do_mq_timedreceive calls wq_sleep with a stack local address. The
sender (do_mq_timedsend) uses this address to later call pipelined_send.

This leads to a very hard to trigger race where a do_mq_timedreceive
call might return and leave do_mq_timedsend to rely on an invalid
address, causing the following crash:

RIP: 0010:wake_q_add_safe+0x13/0x60
Call Trace:
__x64_sys_mq_timedsend+0x2a9/0x490
do_syscall_64+0x80/0x680
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f5928e40343

The race occurs as:

1. do_mq_timedreceive calls wq_sleep with the address of `struct
ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it
holds a valid `struct ext_wait_queue *` as long as the stack has not
been overwritten.

2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
__pipelined_op.

3. Sender calls __pipelined_op::smp_store_release(&this->state,
STATE_READY). Here is where the race window begins. (`this` is
`ewq_addr`.)

4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
will see `state == STATE_READY` and break.

5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
stack. (Although the address may not get overwritten until another
function happens to touch it, which means it can persist around for an
indefinite time.)

6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
`struct ext_wait_queue *`, and uses it to find a task_struct to pass to
the wake_q_add_safe call. In the lucky case where nothing has
overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
bogus address as the receiver's task_struct causing the crash.

do_mq_timedsend::__pipelined_op() should not dereference `this` after
setting STATE_READY, as the receiver counterpart is now free to return.
Change __pipelined_op to call wake_q_add_safe on the receiver's
task_struct returned by get_task_struct, instead of dereferencing `this`
which sits on the receiver's stack.

As Manfred pointed out, the race potentially also exists in
ipc/msg.c::expunge_all and ipc/sem.c::wake_up_sem_queue_prepare. Fix
those in the same way.

Link: https://lkml.kernel.org/r/20210510102950.12551-1-varad.gautam@suse.com
Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
Fixes: 8116b54e7e23ef ("ipc/sem.c: document and update memory barriers")
Fixes: 0d97a82ba830d8 ("ipc/msg.c: update and document memory barriers")
Signed-off-by: Varad Gautam <varad.gautam@suse.com>
Reported-by: Matthias von Faber <matthias.vonfaber@aox-tech.de>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff a11ddb37 Sat May 22 18:41:49 MDT 2021 Varad Gautam <varad.gautam@suse.com> ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiry

do_mq_timedreceive calls wq_sleep with a stack local address. The
sender (do_mq_timedsend) uses this address to later call pipelined_send.

This leads to a very hard to trigger race where a do_mq_timedreceive
call might return and leave do_mq_timedsend to rely on an invalid
address, causing the following crash:

RIP: 0010:wake_q_add_safe+0x13/0x60
Call Trace:
__x64_sys_mq_timedsend+0x2a9/0x490
do_syscall_64+0x80/0x680
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f5928e40343

The race occurs as:

1. do_mq_timedreceive calls wq_sleep with the address of `struct
ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it
holds a valid `struct ext_wait_queue *` as long as the stack has not
been overwritten.

2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
__pipelined_op.

3. Sender calls __pipelined_op::smp_store_release(&this->state,
STATE_READY). Here is where the race window begins. (`this` is
`ewq_addr`.)

4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
will see `state == STATE_READY` and break.

5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
stack. (Although the address may not get overwritten until another
function happens to touch it, which means it can persist around for an
indefinite time.)

6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
`struct ext_wait_queue *`, and uses it to find a task_struct to pass to
the wake_q_add_safe call. In the lucky case where nothing has
overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
bogus address as the receiver's task_struct causing the crash.

do_mq_timedsend::__pipelined_op() should not dereference `this` after
setting STATE_READY, as the receiver counterpart is now free to return.
Change __pipelined_op to call wake_q_add_safe on the receiver's
task_struct returned by get_task_struct, instead of dereferencing `this`
which sits on the receiver's stack.

As Manfred pointed out, the race potentially also exists in
ipc/msg.c::expunge_all and ipc/sem.c::wake_up_sem_queue_prepare. Fix
those in the same way.

Link: https://lkml.kernel.org/r/20210510102950.12551-1-varad.gautam@suse.com
Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
Fixes: 8116b54e7e23ef ("ipc/sem.c: document and update memory barriers")
Fixes: 0d97a82ba830d8 ("ipc/msg.c: update and document memory barriers")
Signed-off-by: Varad Gautam <varad.gautam@suse.com>
Reported-by: Matthias von Faber <matthias.vonfaber@aox-tech.de>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff a11ddb37 Sat May 22 18:41:49 MDT 2021 Varad Gautam <varad.gautam@suse.com> ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiry

do_mq_timedreceive calls wq_sleep with a stack local address. The
sender (do_mq_timedsend) uses this address to later call pipelined_send.

This leads to a very hard to trigger race where a do_mq_timedreceive
call might return and leave do_mq_timedsend to rely on an invalid
address, causing the following crash:

RIP: 0010:wake_q_add_safe+0x13/0x60
Call Trace:
__x64_sys_mq_timedsend+0x2a9/0x490
do_syscall_64+0x80/0x680
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f5928e40343

The race occurs as:

1. do_mq_timedreceive calls wq_sleep with the address of `struct
ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it
holds a valid `struct ext_wait_queue *` as long as the stack has not
been overwritten.

2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
__pipelined_op.

3. Sender calls __pipelined_op::smp_store_release(&this->state,
STATE_READY). Here is where the race window begins. (`this` is
`ewq_addr`.)

4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
will see `state == STATE_READY` and break.

5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
stack. (Although the address may not get overwritten until another
function happens to touch it, which means it can persist around for an
indefinite time.)

6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
`struct ext_wait_queue *`, and uses it to find a task_struct to pass to
the wake_q_add_safe call. In the lucky case where nothing has
overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
bogus address as the receiver's task_struct causing the crash.

do_mq_timedsend::__pipelined_op() should not dereference `this` after
setting STATE_READY, as the receiver counterpart is now free to return.
Change __pipelined_op to call wake_q_add_safe on the receiver's
task_struct returned by get_task_struct, instead of dereferencing `this`
which sits on the receiver's stack.

As Manfred pointed out, the race potentially also exists in
ipc/msg.c::expunge_all and ipc/sem.c::wake_up_sem_queue_prepare. Fix
those in the same way.

Link: https://lkml.kernel.org/r/20210510102950.12551-1-varad.gautam@suse.com
Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
Fixes: 8116b54e7e23ef ("ipc/sem.c: document and update memory barriers")
Fixes: 0d97a82ba830d8 ("ipc/msg.c: update and document memory barriers")
Signed-off-by: Varad Gautam <varad.gautam@suse.com>
Reported-by: Matthias von Faber <matthias.vonfaber@aox-tech.de>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff a11ddb37 Sat May 22 18:41:49 MDT 2021 Varad Gautam <varad.gautam@suse.com> ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiry

do_mq_timedreceive calls wq_sleep with a stack local address. The
sender (do_mq_timedsend) uses this address to later call pipelined_send.

This leads to a very hard to trigger race where a do_mq_timedreceive
call might return and leave do_mq_timedsend to rely on an invalid
address, causing the following crash:

RIP: 0010:wake_q_add_safe+0x13/0x60
Call Trace:
__x64_sys_mq_timedsend+0x2a9/0x490
do_syscall_64+0x80/0x680
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f5928e40343

The race occurs as:

1. do_mq_timedreceive calls wq_sleep with the address of `struct
ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it
holds a valid `struct ext_wait_queue *` as long as the stack has not
been overwritten.

2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
__pipelined_op.

3. Sender calls __pipelined_op::smp_store_release(&this->state,
STATE_READY). Here is where the race window begins. (`this` is
`ewq_addr`.)

4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
will see `state == STATE_READY` and break.

5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
stack. (Although the address may not get overwritten until another
function happens to touch it, which means it can persist around for an
indefinite time.)

6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
`struct ext_wait_queue *`, and uses it to find a task_struct to pass to
the wake_q_add_safe call. In the lucky case where nothing has
overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
bogus address as the receiver's task_struct causing the crash.

do_mq_timedsend::__pipelined_op() should not dereference `this` after
setting STATE_READY, as the receiver counterpart is now free to return.
Change __pipelined_op to call wake_q_add_safe on the receiver's
task_struct returned by get_task_struct, instead of dereferencing `this`
which sits on the receiver's stack.

As Manfred pointed out, the race potentially also exists in
ipc/msg.c::expunge_all and ipc/sem.c::wake_up_sem_queue_prepare. Fix
those in the same way.

Link: https://lkml.kernel.org/r/20210510102950.12551-1-varad.gautam@suse.com
Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
Fixes: 8116b54e7e23ef ("ipc/sem.c: document and update memory barriers")
Fixes: 0d97a82ba830d8 ("ipc/msg.c: update and document memory barriers")
Signed-off-by: Varad Gautam <varad.gautam@suse.com>
Reported-by: Matthias von Faber <matthias.vonfaber@aox-tech.de>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff a11ddb37 Sat May 22 18:41:49 MDT 2021 Varad Gautam <varad.gautam@suse.com> ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiry

do_mq_timedreceive calls wq_sleep with a stack local address. The
sender (do_mq_timedsend) uses this address to later call pipelined_send.

This leads to a very hard to trigger race where a do_mq_timedreceive
call might return and leave do_mq_timedsend to rely on an invalid
address, causing the following crash:

RIP: 0010:wake_q_add_safe+0x13/0x60
Call Trace:
__x64_sys_mq_timedsend+0x2a9/0x490
do_syscall_64+0x80/0x680
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f5928e40343

The race occurs as:

1. do_mq_timedreceive calls wq_sleep with the address of `struct
ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it
holds a valid `struct ext_wait_queue *` as long as the stack has not
been overwritten.

2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
__pipelined_op.

3. Sender calls __pipelined_op::smp_store_release(&this->state,
STATE_READY). Here is where the race window begins. (`this` is
`ewq_addr`.)

4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
will see `state == STATE_READY` and break.

5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
stack. (Although the address may not get overwritten until another
function happens to touch it, which means it can persist around for an
indefinite time.)

6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
`struct ext_wait_queue *`, and uses it to find a task_struct to pass to
the wake_q_add_safe call. In the lucky case where nothing has
overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
bogus address as the receiver's task_struct causing the crash.

do_mq_timedsend::__pipelined_op() should not dereference `this` after
setting STATE_READY, as the receiver counterpart is now free to return.
Change __pipelined_op to call wake_q_add_safe on the receiver's
task_struct returned by get_task_struct, instead of dereferencing `this`
which sits on the receiver's stack.

As Manfred pointed out, the race potentially also exists in
ipc/msg.c::expunge_all and ipc/sem.c::wake_up_sem_queue_prepare. Fix
those in the same way.

Link: https://lkml.kernel.org/r/20210510102950.12551-1-varad.gautam@suse.com
Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
Fixes: 8116b54e7e23ef ("ipc/sem.c: document and update memory barriers")
Fixes: 0d97a82ba830d8 ("ipc/msg.c: update and document memory barriers")
Signed-off-by: Varad Gautam <varad.gautam@suse.com>
Reported-by: Matthias von Faber <matthias.vonfaber@aox-tech.de>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff a11ddb37 Sat May 22 18:41:49 MDT 2021 Varad Gautam <varad.gautam@suse.com> ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiry

do_mq_timedreceive calls wq_sleep with a stack local address. The
sender (do_mq_timedsend) uses this address to later call pipelined_send.

This leads to a very hard to trigger race where a do_mq_timedreceive
call might return and leave do_mq_timedsend to rely on an invalid
address, causing the following crash:

RIP: 0010:wake_q_add_safe+0x13/0x60
Call Trace:
__x64_sys_mq_timedsend+0x2a9/0x490
do_syscall_64+0x80/0x680
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f5928e40343

The race occurs as:

1. do_mq_timedreceive calls wq_sleep with the address of `struct
ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it
holds a valid `struct ext_wait_queue *` as long as the stack has not
been overwritten.

2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
__pipelined_op.

3. Sender calls __pipelined_op::smp_store_release(&this->state,
STATE_READY). Here is where the race window begins. (`this` is
`ewq_addr`.)

4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
will see `state == STATE_READY` and break.

5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
stack. (Although the address may not get overwritten until another
function happens to touch it, which means it can persist around for an
indefinite time.)

6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
`struct ext_wait_queue *`, and uses it to find a task_struct to pass to
the wake_q_add_safe call. In the lucky case where nothing has
overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
bogus address as the receiver's task_struct causing the crash.

do_mq_timedsend::__pipelined_op() should not dereference `this` after
setting STATE_READY, as the receiver counterpart is now free to return.
Change __pipelined_op to call wake_q_add_safe on the receiver's
task_struct returned by get_task_struct, instead of dereferencing `this`
which sits on the receiver's stack.

As Manfred pointed out, the race potentially also exists in
ipc/msg.c::expunge_all and ipc/sem.c::wake_up_sem_queue_prepare. Fix
those in the same way.

Link: https://lkml.kernel.org/r/20210510102950.12551-1-varad.gautam@suse.com
Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
Fixes: 8116b54e7e23ef ("ipc/sem.c: document and update memory barriers")
Fixes: 0d97a82ba830d8 ("ipc/msg.c: update and document memory barriers")
Signed-off-by: Varad Gautam <varad.gautam@suse.com>
Reported-by: Matthias von Faber <matthias.vonfaber@aox-tech.de>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/linux-master/net/sunrpc/
H A Dxprtsock.cdiff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 26e8bfa3 Tue Sep 26 15:03:22 MDT 2023 Anna Schumaker <Anna.Schumaker@Netapp.com> SUNRPC/TLS: Lock the lower_xprt during the tls handshake

Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[ 224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 224.150951] #PF: supervisor read access in kernel mode
[ 224.151117] #PF: error_code(0x0000) - not-present page
[ 224.151278] PGD 0 P4D 0
[ 224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[ 224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[ 224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[ 224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[ 224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[ 224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[ 224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[ 224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[ 224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[ 224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[ 224.154521] FS: 0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[ 224.154763] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[ 224.155157] PKRU: 55555554
[ 224.155244] Call Trace:
[ 224.155325] <TASK>
[ 224.155395] ? __die_body+0x68/0xb0
[ 224.155507] ? page_fault_oops+0x34c/0x3a0
[ 224.155635] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 224.155793] ? exc_page_fault+0x7a/0x1b0
[ 224.155916] ? asm_exc_page_fault+0x26/0x30
[ 224.156047] ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156367] ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.156697] ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[ 224.157013] process_scheduled_works+0x24e/0x450
[ 224.157158] worker_thread+0x21c/0x2d0
[ 224.157275] ? __pfx_worker_thread+0x10/0x10
[ 224.157409] kthread+0xe8/0x110
[ 224.157510] ? __pfx_kthread+0x10/0x10
[ 224.157628] ret_from_fork+0x37/0x50
[ 224.157741] ? __pfx_kthread+0x10/0x10
[ 224.157859] ret_from_fork_asm+0x1b/0x30
[ 224.157983] </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
/linux-master/net/rxrpc/
H A Dsysctl.cdiff 5d7edbc9 Sat Aug 27 07:27:56 MDT 2022 David Howells <dhowells@redhat.com> rxrpc: Get rid of the Rx ring

Get rid of the Rx ring and replace it with a pair of queues instead. One
queue gets the packets that are in-sequence and are ready for processing by
recvmsg(); the other queue gets the out-of-sequence packets for addition to
the first queue as the holes get filled.

The annotation ring is removed and replaced with a SACK table. The SACK
table has the bits set that correspond exactly to the sequence number of
the packet being acked. The SACK ring is copied when an ACK packet is
being assembled and rotated so that the first ACK is in byte 0.

Flow control handling is altered so that packets that are moved to the
in-sequence queue are hard-ACK'd even before they're consumed - and then
the Rx window size in the ACK packet (rsize) is shrunk down to compensate
(even going to 0 if the window is full).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
diff 5d7edbc9 Sat Aug 27 07:27:56 MDT 2022 David Howells <dhowells@redhat.com> rxrpc: Get rid of the Rx ring

Get rid of the Rx ring and replace it with a pair of queues instead. One
queue gets the packets that are in-sequence and are ready for processing by
recvmsg(); the other queue gets the out-of-sequence packets for addition to
the first queue as the holes get filled.

The annotation ring is removed and replaced with a SACK table. The SACK
table has the bits set that correspond exactly to the sequence number of
the packet being acked. The SACK ring is copied when an ACK packet is
being assembled and rotated so that the first ACK is in byte 0.

Flow control handling is altered so that packets that are moved to the
in-sequence queue are hard-ACK'd even before they're consumed - and then
the Rx window size in the ACK packet (rsize) is shrunk down to compensate
(even going to 0 if the window is full).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
diff 88e22159 Sat May 21 02:03:04 MDT 2022 David Howells <dhowells@redhat.com> rxrpc: Fix listen() setting the bar too high for the prealloc rings

AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump
up the sysctl), but whilst the preallocation circular buffers have 32 slots
in them, one of them has to be a dead slot because we're using CIRC_CNT().

This means that listen(rxrpc_sock, 32) will cause an oops when the socket
is closed because rxrpc_service_prealloc_one() allocated one too many calls
and rxrpc_discard_prealloc() won't then be able to get rid of them because
it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries
to abort them, but oopses because call->peer isn't yet set.

Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match
the ring capacity.

BUG: kernel NULL pointer dereference, address: 0000000000000086
...
RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc]
Call Trace:
<TASK>
? __wake_up_common_lock+0x7a/0x90
? rxrpc_notify_socket+0x8e/0x140 [rxrpc]
? rxrpc_abort_call+0x4c/0x60 [rxrpc]
rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc]
rxrpc_release+0xc9/0x1c0 [rxrpc]
__sock_release+0x37/0xa0
sock_close+0x11/0x20
__fput+0x89/0x240
task_work_run+0x59/0x90
do_exit+0x319/0xaa0

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 88e22159 Sat May 21 02:03:04 MDT 2022 David Howells <dhowells@redhat.com> rxrpc: Fix listen() setting the bar too high for the prealloc rings

AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump
up the sysctl), but whilst the preallocation circular buffers have 32 slots
in them, one of them has to be a dead slot because we're using CIRC_CNT().

This means that listen(rxrpc_sock, 32) will cause an oops when the socket
is closed because rxrpc_service_prealloc_one() allocated one too many calls
and rxrpc_discard_prealloc() won't then be able to get rid of them because
it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries
to abort them, but oopses because call->peer isn't yet set.

Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match
the ring capacity.

BUG: kernel NULL pointer dereference, address: 0000000000000086
...
RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc]
Call Trace:
<TASK>
? __wake_up_common_lock+0x7a/0x90
? rxrpc_notify_socket+0x8e/0x140 [rxrpc]
? rxrpc_abort_call+0x4c/0x60 [rxrpc]
rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc]
rxrpc_release+0xc9/0x1c0 [rxrpc]
__sock_release+0x37/0xa0
sock_close+0x11/0x20
__fput+0x89/0x240
task_work_run+0x59/0x90
do_exit+0x319/0xaa0

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 88e22159 Sat May 21 02:03:04 MDT 2022 David Howells <dhowells@redhat.com> rxrpc: Fix listen() setting the bar too high for the prealloc rings

AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump
up the sysctl), but whilst the preallocation circular buffers have 32 slots
in them, one of them has to be a dead slot because we're using CIRC_CNT().

This means that listen(rxrpc_sock, 32) will cause an oops when the socket
is closed because rxrpc_service_prealloc_one() allocated one too many calls
and rxrpc_discard_prealloc() won't then be able to get rid of them because
it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries
to abort them, but oopses because call->peer isn't yet set.

Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match
the ring capacity.

BUG: kernel NULL pointer dereference, address: 0000000000000086
...
RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc]
Call Trace:
<TASK>
? __wake_up_common_lock+0x7a/0x90
? rxrpc_notify_socket+0x8e/0x140 [rxrpc]
? rxrpc_abort_call+0x4c/0x60 [rxrpc]
rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc]
rxrpc_release+0xc9/0x1c0 [rxrpc]
__sock_release+0x37/0xa0
sock_close+0x11/0x20
__fput+0x89/0x240
task_work_run+0x59/0x90
do_exit+0x319/0xaa0

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 88e22159 Sat May 21 02:03:04 MDT 2022 David Howells <dhowells@redhat.com> rxrpc: Fix listen() setting the bar too high for the prealloc rings

AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump
up the sysctl), but whilst the preallocation circular buffers have 32 slots
in them, one of them has to be a dead slot because we're using CIRC_CNT().

This means that listen(rxrpc_sock, 32) will cause an oops when the socket
is closed because rxrpc_service_prealloc_one() allocated one too many calls
and rxrpc_discard_prealloc() won't then be able to get rid of them because
it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries
to abort them, but oopses because call->peer isn't yet set.

Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match
the ring capacity.

BUG: kernel NULL pointer dereference, address: 0000000000000086
...
RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc]
Call Trace:
<TASK>
? __wake_up_common_lock+0x7a/0x90
? rxrpc_notify_socket+0x8e/0x140 [rxrpc]
? rxrpc_abort_call+0x4c/0x60 [rxrpc]
rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc]
rxrpc_release+0xc9/0x1c0 [rxrpc]
__sock_release+0x37/0xa0
sock_close+0x11/0x20
__fput+0x89/0x240
task_work_run+0x59/0x90
do_exit+0x319/0xaa0

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 88e22159 Sat May 21 02:03:04 MDT 2022 David Howells <dhowells@redhat.com> rxrpc: Fix listen() setting the bar too high for the prealloc rings

AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump
up the sysctl), but whilst the preallocation circular buffers have 32 slots
in them, one of them has to be a dead slot because we're using CIRC_CNT().

This means that listen(rxrpc_sock, 32) will cause an oops when the socket
is closed because rxrpc_service_prealloc_one() allocated one too many calls
and rxrpc_discard_prealloc() won't then be able to get rid of them because
it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries
to abort them, but oopses because call->peer isn't yet set.

Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match
the ring capacity.

BUG: kernel NULL pointer dereference, address: 0000000000000086
...
RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc]
Call Trace:
<TASK>
? __wake_up_common_lock+0x7a/0x90
? rxrpc_notify_socket+0x8e/0x140 [rxrpc]
? rxrpc_abort_call+0x4c/0x60 [rxrpc]
rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc]
rxrpc_release+0xc9/0x1c0 [rxrpc]
__sock_release+0x37/0xa0
sock_close+0x11/0x20
__fput+0x89/0x240
task_work_run+0x59/0x90
do_exit+0x319/0xaa0

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 88e22159 Sat May 21 02:03:04 MDT 2022 David Howells <dhowells@redhat.com> rxrpc: Fix listen() setting the bar too high for the prealloc rings

AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump
up the sysctl), but whilst the preallocation circular buffers have 32 slots
in them, one of them has to be a dead slot because we're using CIRC_CNT().

This means that listen(rxrpc_sock, 32) will cause an oops when the socket
is closed because rxrpc_service_prealloc_one() allocated one too many calls
and rxrpc_discard_prealloc() won't then be able to get rid of them because
it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries
to abort them, but oopses because call->peer isn't yet set.

Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match
the ring capacity.

BUG: kernel NULL pointer dereference, address: 0000000000000086
...
RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc]
Call Trace:
<TASK>
? __wake_up_common_lock+0x7a/0x90
? rxrpc_notify_socket+0x8e/0x140 [rxrpc]
? rxrpc_abort_call+0x4c/0x60 [rxrpc]
rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc]
rxrpc_release+0xc9/0x1c0 [rxrpc]
__sock_release+0x37/0xa0
sock_close+0x11/0x20
__fput+0x89/0x240
task_work_run+0x59/0x90
do_exit+0x319/0xaa0

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 88e22159 Sat May 21 02:03:04 MDT 2022 David Howells <dhowells@redhat.com> rxrpc: Fix listen() setting the bar too high for the prealloc rings

AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump
up the sysctl), but whilst the preallocation circular buffers have 32 slots
in them, one of them has to be a dead slot because we're using CIRC_CNT().

This means that listen(rxrpc_sock, 32) will cause an oops when the socket
is closed because rxrpc_service_prealloc_one() allocated one too many calls
and rxrpc_discard_prealloc() won't then be able to get rid of them because
it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries
to abort them, but oopses because call->peer isn't yet set.

Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match
the ring capacity.

BUG: kernel NULL pointer dereference, address: 0000000000000086
...
RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc]
Call Trace:
<TASK>
? __wake_up_common_lock+0x7a/0x90
? rxrpc_notify_socket+0x8e/0x140 [rxrpc]
? rxrpc_abort_call+0x4c/0x60 [rxrpc]
rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc]
rxrpc_release+0xc9/0x1c0 [rxrpc]
__sock_release+0x37/0xa0
sock_close+0x11/0x20
__fput+0x89/0x240
task_work_run+0x59/0x90
do_exit+0x319/0xaa0

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 88e22159 Sat May 21 02:03:04 MDT 2022 David Howells <dhowells@redhat.com> rxrpc: Fix listen() setting the bar too high for the prealloc rings

AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump
up the sysctl), but whilst the preallocation circular buffers have 32 slots
in them, one of them has to be a dead slot because we're using CIRC_CNT().

This means that listen(rxrpc_sock, 32) will cause an oops when the socket
is closed because rxrpc_service_prealloc_one() allocated one too many calls
and rxrpc_discard_prealloc() won't then be able to get rid of them because
it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries
to abort them, but oopses because call->peer isn't yet set.

Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match
the ring capacity.

BUG: kernel NULL pointer dereference, address: 0000000000000086
...
RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc]
Call Trace:
<TASK>
? __wake_up_common_lock+0x7a/0x90
? rxrpc_notify_socket+0x8e/0x140 [rxrpc]
? rxrpc_abort_call+0x4c/0x60 [rxrpc]
rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc]
rxrpc_release+0xc9/0x1c0 [rxrpc]
__sock_release+0x37/0xa0
sock_close+0x11/0x20
__fput+0x89/0x240
task_work_run+0x59/0x90
do_exit+0x319/0xaa0

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 88e22159 Sat May 21 02:03:04 MDT 2022 David Howells <dhowells@redhat.com> rxrpc: Fix listen() setting the bar too high for the prealloc rings

AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump
up the sysctl), but whilst the preallocation circular buffers have 32 slots
in them, one of them has to be a dead slot because we're using CIRC_CNT().

This means that listen(rxrpc_sock, 32) will cause an oops when the socket
is closed because rxrpc_service_prealloc_one() allocated one too many calls
and rxrpc_discard_prealloc() won't then be able to get rid of them because
it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries
to abort them, but oopses because call->peer isn't yet set.

Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match
the ring capacity.

BUG: kernel NULL pointer dereference, address: 0000000000000086
...
RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc]
Call Trace:
<TASK>
? __wake_up_common_lock+0x7a/0x90
? rxrpc_notify_socket+0x8e/0x140 [rxrpc]
? rxrpc_abort_call+0x4c/0x60 [rxrpc]
rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc]
rxrpc_release+0xc9/0x1c0 [rxrpc]
__sock_release+0x37/0xa0
sock_close+0x11/0x20
__fput+0x89/0x240
task_work_run+0x59/0x90
do_exit+0x319/0xaa0

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 88e22159 Sat May 21 02:03:04 MDT 2022 David Howells <dhowells@redhat.com> rxrpc: Fix listen() setting the bar too high for the prealloc rings

AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump
up the sysctl), but whilst the preallocation circular buffers have 32 slots
in them, one of them has to be a dead slot because we're using CIRC_CNT().

This means that listen(rxrpc_sock, 32) will cause an oops when the socket
is closed because rxrpc_service_prealloc_one() allocated one too many calls
and rxrpc_discard_prealloc() won't then be able to get rid of them because
it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries
to abort them, but oopses because call->peer isn't yet set.

Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match
the ring capacity.

BUG: kernel NULL pointer dereference, address: 0000000000000086
...
RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc]
Call Trace:
<TASK>
? __wake_up_common_lock+0x7a/0x90
? rxrpc_notify_socket+0x8e/0x140 [rxrpc]
? rxrpc_abort_call+0x4c/0x60 [rxrpc]
rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc]
rxrpc_release+0xc9/0x1c0 [rxrpc]
__sock_release+0x37/0xa0
sock_close+0x11/0x20
__fput+0x89/0x240
task_work_run+0x59/0x90
do_exit+0x319/0xaa0

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 88e22159 Sat May 21 02:03:04 MDT 2022 David Howells <dhowells@redhat.com> rxrpc: Fix listen() setting the bar too high for the prealloc rings

AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump
up the sysctl), but whilst the preallocation circular buffers have 32 slots
in them, one of them has to be a dead slot because we're using CIRC_CNT().

This means that listen(rxrpc_sock, 32) will cause an oops when the socket
is closed because rxrpc_service_prealloc_one() allocated one too many calls
and rxrpc_discard_prealloc() won't then be able to get rid of them because
it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries
to abort them, but oopses because call->peer isn't yet set.

Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match
the ring capacity.

BUG: kernel NULL pointer dereference, address: 0000000000000086
...
RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc]
Call Trace:
<TASK>
? __wake_up_common_lock+0x7a/0x90
? rxrpc_notify_socket+0x8e/0x140 [rxrpc]
? rxrpc_abort_call+0x4c/0x60 [rxrpc]
rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc]
rxrpc_release+0xc9/0x1c0 [rxrpc]
__sock_release+0x37/0xa0
sock_close+0x11/0x20
__fput+0x89/0x240
task_work_run+0x59/0x90
do_exit+0x319/0xaa0

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 88e22159 Sat May 21 02:03:04 MDT 2022 David Howells <dhowells@redhat.com> rxrpc: Fix listen() setting the bar too high for the prealloc rings

AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump
up the sysctl), but whilst the preallocation circular buffers have 32 slots
in them, one of them has to be a dead slot because we're using CIRC_CNT().

This means that listen(rxrpc_sock, 32) will cause an oops when the socket
is closed because rxrpc_service_prealloc_one() allocated one too many calls
and rxrpc_discard_prealloc() won't then be able to get rid of them because
it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries
to abort them, but oopses because call->peer isn't yet set.

Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match
the ring capacity.

BUG: kernel NULL pointer dereference, address: 0000000000000086
...
RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc]
Call Trace:
<TASK>
? __wake_up_common_lock+0x7a/0x90
? rxrpc_notify_socket+0x8e/0x140 [rxrpc]
? rxrpc_abort_call+0x4c/0x60 [rxrpc]
rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc]
rxrpc_release+0xc9/0x1c0 [rxrpc]
__sock_release+0x37/0xa0
sock_close+0x11/0x20
__fput+0x89/0x240
task_work_run+0x59/0x90
do_exit+0x319/0xaa0

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 88e22159 Sat May 21 02:03:04 MDT 2022 David Howells <dhowells@redhat.com> rxrpc: Fix listen() setting the bar too high for the prealloc rings

AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump
up the sysctl), but whilst the preallocation circular buffers have 32 slots
in them, one of them has to be a dead slot because we're using CIRC_CNT().

This means that listen(rxrpc_sock, 32) will cause an oops when the socket
is closed because rxrpc_service_prealloc_one() allocated one too many calls
and rxrpc_discard_prealloc() won't then be able to get rid of them because
it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries
to abort them, but oopses because call->peer isn't yet set.

Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match
the ring capacity.

BUG: kernel NULL pointer dereference, address: 0000000000000086
...
RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc]
Call Trace:
<TASK>
? __wake_up_common_lock+0x7a/0x90
? rxrpc_notify_socket+0x8e/0x140 [rxrpc]
? rxrpc_abort_call+0x4c/0x60 [rxrpc]
rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc]
rxrpc_release+0xc9/0x1c0 [rxrpc]
__sock_release+0x37/0xa0
sock_close+0x11/0x20
__fput+0x89/0x240
task_work_run+0x59/0x90
do_exit+0x319/0xaa0

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 88e22159 Sat May 21 02:03:04 MDT 2022 David Howells <dhowells@redhat.com> rxrpc: Fix listen() setting the bar too high for the prealloc rings

AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump
up the sysctl), but whilst the preallocation circular buffers have 32 slots
in them, one of them has to be a dead slot because we're using CIRC_CNT().

This means that listen(rxrpc_sock, 32) will cause an oops when the socket
is closed because rxrpc_service_prealloc_one() allocated one too many calls
and rxrpc_discard_prealloc() won't then be able to get rid of them because
it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries
to abort them, but oopses because call->peer isn't yet set.

Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match
the ring capacity.

BUG: kernel NULL pointer dereference, address: 0000000000000086
...
RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc]
Call Trace:
<TASK>
? __wake_up_common_lock+0x7a/0x90
? rxrpc_notify_socket+0x8e/0x140 [rxrpc]
? rxrpc_abort_call+0x4c/0x60 [rxrpc]
rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc]
rxrpc_release+0xc9/0x1c0 [rxrpc]
__sock_release+0x37/0xa0
sock_close+0x11/0x20
__fput+0x89/0x240
task_work_run+0x59/0x90
do_exit+0x319/0xaa0

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 88e22159 Sat May 21 02:03:04 MDT 2022 David Howells <dhowells@redhat.com> rxrpc: Fix listen() setting the bar too high for the prealloc rings

AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump
up the sysctl), but whilst the preallocation circular buffers have 32 slots
in them, one of them has to be a dead slot because we're using CIRC_CNT().

This means that listen(rxrpc_sock, 32) will cause an oops when the socket
is closed because rxrpc_service_prealloc_one() allocated one too many calls
and rxrpc_discard_prealloc() won't then be able to get rid of them because
it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries
to abort them, but oopses because call->peer isn't yet set.

Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match
the ring capacity.

BUG: kernel NULL pointer dereference, address: 0000000000000086
...
RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc]
Call Trace:
<TASK>
? __wake_up_common_lock+0x7a/0x90
? rxrpc_notify_socket+0x8e/0x140 [rxrpc]
? rxrpc_abort_call+0x4c/0x60 [rxrpc]
rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc]
rxrpc_release+0xc9/0x1c0 [rxrpc]
__sock_release+0x37/0xa0
sock_close+0x11/0x20
__fput+0x89/0x240
task_work_run+0x59/0x90
do_exit+0x319/0xaa0

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 88e22159 Sat May 21 02:03:04 MDT 2022 David Howells <dhowells@redhat.com> rxrpc: Fix listen() setting the bar too high for the prealloc rings

AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump
up the sysctl), but whilst the preallocation circular buffers have 32 slots
in them, one of them has to be a dead slot because we're using CIRC_CNT().

This means that listen(rxrpc_sock, 32) will cause an oops when the socket
is closed because rxrpc_service_prealloc_one() allocated one too many calls
and rxrpc_discard_prealloc() won't then be able to get rid of them because
it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries
to abort them, but oopses because call->peer isn't yet set.

Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match
the ring capacity.

BUG: kernel NULL pointer dereference, address: 0000000000000086
...
RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc]
Call Trace:
<TASK>
? __wake_up_common_lock+0x7a/0x90
? rxrpc_notify_socket+0x8e/0x140 [rxrpc]
? rxrpc_abort_call+0x4c/0x60 [rxrpc]
rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc]
rxrpc_release+0xc9/0x1c0 [rxrpc]
__sock_release+0x37/0xa0
sock_close+0x11/0x20
__fput+0x89/0x240
task_work_run+0x59/0x90
do_exit+0x319/0xaa0

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 88e22159 Sat May 21 02:03:04 MDT 2022 David Howells <dhowells@redhat.com> rxrpc: Fix listen() setting the bar too high for the prealloc rings

AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump
up the sysctl), but whilst the preallocation circular buffers have 32 slots
in them, one of them has to be a dead slot because we're using CIRC_CNT().

This means that listen(rxrpc_sock, 32) will cause an oops when the socket
is closed because rxrpc_service_prealloc_one() allocated one too many calls
and rxrpc_discard_prealloc() won't then be able to get rid of them because
it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries
to abort them, but oopses because call->peer isn't yet set.

Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match
the ring capacity.

BUG: kernel NULL pointer dereference, address: 0000000000000086
...
RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc]
Call Trace:
<TASK>
? __wake_up_common_lock+0x7a/0x90
? rxrpc_notify_socket+0x8e/0x140 [rxrpc]
? rxrpc_abort_call+0x4c/0x60 [rxrpc]
rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc]
rxrpc_release+0xc9/0x1c0 [rxrpc]
__sock_release+0x37/0xa0
sock_close+0x11/0x20
__fput+0x89/0x240
task_work_run+0x59/0x90
do_exit+0x319/0xaa0

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 88e22159 Sat May 21 02:03:04 MDT 2022 David Howells <dhowells@redhat.com> rxrpc: Fix listen() setting the bar too high for the prealloc rings

AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump
up the sysctl), but whilst the preallocation circular buffers have 32 slots
in them, one of them has to be a dead slot because we're using CIRC_CNT().

This means that listen(rxrpc_sock, 32) will cause an oops when the socket
is closed because rxrpc_service_prealloc_one() allocated one too many calls
and rxrpc_discard_prealloc() won't then be able to get rid of them because
it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries
to abort them, but oopses because call->peer isn't yet set.

Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match
the ring capacity.

BUG: kernel NULL pointer dereference, address: 0000000000000086
...
RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc]
Call Trace:
<TASK>
? __wake_up_common_lock+0x7a/0x90
? rxrpc_notify_socket+0x8e/0x140 [rxrpc]
? rxrpc_abort_call+0x4c/0x60 [rxrpc]
rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc]
rxrpc_release+0xc9/0x1c0 [rxrpc]
__sock_release+0x37/0xa0
sock_close+0x11/0x20
__fput+0x89/0x240
task_work_run+0x59/0x90
do_exit+0x319/0xaa0

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 88e22159 Sat May 21 02:03:04 MDT 2022 David Howells <dhowells@redhat.com> rxrpc: Fix listen() setting the bar too high for the prealloc rings

AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump
up the sysctl), but whilst the preallocation circular buffers have 32 slots
in them, one of them has to be a dead slot because we're using CIRC_CNT().

This means that listen(rxrpc_sock, 32) will cause an oops when the socket
is closed because rxrpc_service_prealloc_one() allocated one too many calls
and rxrpc_discard_prealloc() won't then be able to get rid of them because
it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries
to abort them, but oopses because call->peer isn't yet set.

Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match
the ring capacity.

BUG: kernel NULL pointer dereference, address: 0000000000000086
...
RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc]
Call Trace:
<TASK>
? __wake_up_common_lock+0x7a/0x90
? rxrpc_notify_socket+0x8e/0x140 [rxrpc]
? rxrpc_abort_call+0x4c/0x60 [rxrpc]
rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc]
rxrpc_release+0xc9/0x1c0 [rxrpc]
__sock_release+0x37/0xa0
sock_close+0x11/0x20
__fput+0x89/0x240
task_work_run+0x59/0x90
do_exit+0x319/0xaa0

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 88e22159 Sat May 21 02:03:04 MDT 2022 David Howells <dhowells@redhat.com> rxrpc: Fix listen() setting the bar too high for the prealloc rings

AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump
up the sysctl), but whilst the preallocation circular buffers have 32 slots
in them, one of them has to be a dead slot because we're using CIRC_CNT().

This means that listen(rxrpc_sock, 32) will cause an oops when the socket
is closed because rxrpc_service_prealloc_one() allocated one too many calls
and rxrpc_discard_prealloc() won't then be able to get rid of them because
it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries
to abort them, but oopses because call->peer isn't yet set.

Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match
the ring capacity.

BUG: kernel NULL pointer dereference, address: 0000000000000086
...
RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc]
Call Trace:
<TASK>
? __wake_up_common_lock+0x7a/0x90
? rxrpc_notify_socket+0x8e/0x140 [rxrpc]
? rxrpc_abort_call+0x4c/0x60 [rxrpc]
rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc]
rxrpc_release+0xc9/0x1c0 [rxrpc]
__sock_release+0x37/0xa0
sock_close+0x11/0x20
__fput+0x89/0x240
task_work_run+0x59/0x90
do_exit+0x319/0xaa0

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 88e22159 Sat May 21 02:03:04 MDT 2022 David Howells <dhowells@redhat.com> rxrpc: Fix listen() setting the bar too high for the prealloc rings

AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump
up the sysctl), but whilst the preallocation circular buffers have 32 slots
in them, one of them has to be a dead slot because we're using CIRC_CNT().

This means that listen(rxrpc_sock, 32) will cause an oops when the socket
is closed because rxrpc_service_prealloc_one() allocated one too many calls
and rxrpc_discard_prealloc() won't then be able to get rid of them because
it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries
to abort them, but oopses because call->peer isn't yet set.

Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match
the ring capacity.

BUG: kernel NULL pointer dereference, address: 0000000000000086
...
RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc]
Call Trace:
<TASK>
? __wake_up_common_lock+0x7a/0x90
? rxrpc_notify_socket+0x8e/0x140 [rxrpc]
? rxrpc_abort_call+0x4c/0x60 [rxrpc]
rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc]
rxrpc_release+0xc9/0x1c0 [rxrpc]
__sock_release+0x37/0xa0
sock_close+0x11/0x20
__fput+0x89/0x240
task_work_run+0x59/0x90
do_exit+0x319/0xaa0

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 88e22159 Sat May 21 02:03:04 MDT 2022 David Howells <dhowells@redhat.com> rxrpc: Fix listen() setting the bar too high for the prealloc rings

AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump
up the sysctl), but whilst the preallocation circular buffers have 32 slots
in them, one of them has to be a dead slot because we're using CIRC_CNT().

This means that listen(rxrpc_sock, 32) will cause an oops when the socket
is closed because rxrpc_service_prealloc_one() allocated one too many calls
and rxrpc_discard_prealloc() won't then be able to get rid of them because
it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries
to abort them, but oopses because call->peer isn't yet set.

Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match
the ring capacity.

BUG: kernel NULL pointer dereference, address: 0000000000000086
...
RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc]
Call Trace:
<TASK>
? __wake_up_common_lock+0x7a/0x90
? rxrpc_notify_socket+0x8e/0x140 [rxrpc]
? rxrpc_abort_call+0x4c/0x60 [rxrpc]
rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc]
rxrpc_release+0xc9/0x1c0 [rxrpc]
__sock_release+0x37/0xa0
sock_close+0x11/0x20
__fput+0x89/0x240
task_work_run+0x59/0x90
do_exit+0x319/0xaa0

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html
Signed-off-by: David S. Miller <davem@davemloft.net>
/linux-master/net/ipv4/
H A Ddevinet.cdiff cdb2f80f Thu Feb 29 04:40:16 MST 2024 Eric Dumazet <edumazet@google.com> inet: use xa_array iterator to implement inet_dump_ifaddr()

1) inet_dump_ifaddr() can can run under RCU protection
instead of RTNL.

2) properly return 0 at the end of a dump, avoiding an
an extra recvmsg() system call.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 16748707 Tue Feb 27 02:24:11 MST 2024 Eric Dumazet <edumazet@google.com> inet: use xa_array iterator to implement inet_netconf_dump_devconf()

1) inet_netconf_dump_devconf() can run under RCU protection
instead of RTNL.

2) properly return 0 at the end of a dump, avoiding an
an extra recvmsg() system call.

3) Do not use inet_base_seq() anymore, for_each_netdev_dump()
has nice properties. Restarting a GETDEVCONF dump if a device has
been added/removed or if net->ipv4.dev_addr_genid has changed is moot.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240227092411.2315725-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff ac28b1ec Wed Sep 06 20:57:09 MDT 2023 Liu Jian <liujian56@huawei.com> net: ipv4: fix one memleak in __inet_del_ifa()

I got the below warning when do fuzzing test:
unregister_netdevice: waiting for bond0 to become free. Usage count = 2

It can be repoduced via:

ip link add bond0 type bond
sysctl -w net.ipv4.conf.bond0.promote_secondaries=1
ip addr add 4.117.174.103/0 scope 0x40 dev bond0
ip addr add 192.168.100.111/255.255.255.254 scope 0 dev bond0
ip addr add 0.0.0.4/0 scope 0x40 secondary dev bond0
ip addr del 4.117.174.103/0 scope 0x40 dev bond0
ip link delete bond0 type bond

In this reproduction test case, an incorrect 'last_prim' is found in
__inet_del_ifa(), as a result, the secondary address(0.0.0.4/0 scope 0x40)
is lost. The memory of the secondary address is leaked and the reference of
in_device and net_device is leaked.

Fix this problem:
Look for 'last_prim' starting at location of the deleted IP and inserting
the promoted IP into the location of 'last_prim'.

Fixes: 0ff60a45678e ("[IPV4]: Fix secondary IP addresses after promotion")
Signed-off-by: Liu Jian <liujian56@huawei.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff ac28b1ec Wed Sep 06 20:57:09 MDT 2023 Liu Jian <liujian56@huawei.com> net: ipv4: fix one memleak in __inet_del_ifa()

I got the below warning when do fuzzing test:
unregister_netdevice: waiting for bond0 to become free. Usage count = 2

It can be repoduced via:

ip link add bond0 type bond
sysctl -w net.ipv4.conf.bond0.promote_secondaries=1
ip addr add 4.117.174.103/0 scope 0x40 dev bond0
ip addr add 192.168.100.111/255.255.255.254 scope 0 dev bond0
ip addr add 0.0.0.4/0 scope 0x40 secondary dev bond0
ip addr del 4.117.174.103/0 scope 0x40 dev bond0
ip link delete bond0 type bond

In this reproduction test case, an incorrect 'last_prim' is found in
__inet_del_ifa(), as a result, the secondary address(0.0.0.4/0 scope 0x40)
is lost. The memory of the secondary address is leaked and the reference of
in_device and net_device is leaked.

Fix this problem:
Look for 'last_prim' starting at location of the deleted IP and inserting
the promoted IP into the location of 'last_prim'.

Fixes: 0ff60a45678e ("[IPV4]: Fix secondary IP addresses after promotion")
Signed-off-by: Liu Jian <liujian56@huawei.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff ac28b1ec Wed Sep 06 20:57:09 MDT 2023 Liu Jian <liujian56@huawei.com> net: ipv4: fix one memleak in __inet_del_ifa()

I got the below warning when do fuzzing test:
unregister_netdevice: waiting for bond0 to become free. Usage count = 2

It can be repoduced via:

ip link add bond0 type bond
sysctl -w net.ipv4.conf.bond0.promote_secondaries=1
ip addr add 4.117.174.103/0 scope 0x40 dev bond0
ip addr add 192.168.100.111/255.255.255.254 scope 0 dev bond0
ip addr add 0.0.0.4/0 scope 0x40 secondary dev bond0
ip addr del 4.117.174.103/0 scope 0x40 dev bond0
ip link delete bond0 type bond

In this reproduction test case, an incorrect 'last_prim' is found in
__inet_del_ifa(), as a result, the secondary address(0.0.0.4/0 scope 0x40)
is lost. The memory of the secondary address is leaked and the reference of
in_device and net_device is leaked.

Fix this problem:
Look for 'last_prim' starting at location of the deleted IP and inserting
the promoted IP into the location of 'last_prim'.

Fixes: 0ff60a45678e ("[IPV4]: Fix secondary IP addresses after promotion")
Signed-off-by: Liu Jian <liujian56@huawei.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff ac28b1ec Wed Sep 06 20:57:09 MDT 2023 Liu Jian <liujian56@huawei.com> net: ipv4: fix one memleak in __inet_del_ifa()

I got the below warning when do fuzzing test:
unregister_netdevice: waiting for bond0 to become free. Usage count = 2

It can be repoduced via:

ip link add bond0 type bond
sysctl -w net.ipv4.conf.bond0.promote_secondaries=1
ip addr add 4.117.174.103/0 scope 0x40 dev bond0
ip addr add 192.168.100.111/255.255.255.254 scope 0 dev bond0
ip addr add 0.0.0.4/0 scope 0x40 secondary dev bond0
ip addr del 4.117.174.103/0 scope 0x40 dev bond0
ip link delete bond0 type bond

In this reproduction test case, an incorrect 'last_prim' is found in
__inet_del_ifa(), as a result, the secondary address(0.0.0.4/0 scope 0x40)
is lost. The memory of the secondary address is leaked and the reference of
in_device and net_device is leaked.

Fix this problem:
Look for 'last_prim' starting at location of the deleted IP and inserting
the promoted IP into the location of 'last_prim'.

Fixes: 0ff60a45678e ("[IPV4]: Fix secondary IP addresses after promotion")
Signed-off-by: Liu Jian <liujian56@huawei.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff ac28b1ec Wed Sep 06 20:57:09 MDT 2023 Liu Jian <liujian56@huawei.com> net: ipv4: fix one memleak in __inet_del_ifa()

I got the below warning when do fuzzing test:
unregister_netdevice: waiting for bond0 to become free. Usage count = 2

It can be repoduced via:

ip link add bond0 type bond
sysctl -w net.ipv4.conf.bond0.promote_secondaries=1
ip addr add 4.117.174.103/0 scope 0x40 dev bond0
ip addr add 192.168.100.111/255.255.255.254 scope 0 dev bond0
ip addr add 0.0.0.4/0 scope 0x40 secondary dev bond0
ip addr del 4.117.174.103/0 scope 0x40 dev bond0
ip link delete bond0 type bond

In this reproduction test case, an incorrect 'last_prim' is found in
__inet_del_ifa(), as a result, the secondary address(0.0.0.4/0 scope 0x40)
is lost. The memory of the secondary address is leaked and the reference of
in_device and net_device is leaked.

Fix this problem:
Look for 'last_prim' starting at location of the deleted IP and inserting
the promoted IP into the location of 'last_prim'.

Fixes: 0ff60a45678e ("[IPV4]: Fix secondary IP addresses after promotion")
Signed-off-by: Liu Jian <liujian56@huawei.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff ac28b1ec Wed Sep 06 20:57:09 MDT 2023 Liu Jian <liujian56@huawei.com> net: ipv4: fix one memleak in __inet_del_ifa()

I got the below warning when do fuzzing test:
unregister_netdevice: waiting for bond0 to become free. Usage count = 2

It can be repoduced via:

ip link add bond0 type bond
sysctl -w net.ipv4.conf.bond0.promote_secondaries=1
ip addr add 4.117.174.103/0 scope 0x40 dev bond0
ip addr add 192.168.100.111/255.255.255.254 scope 0 dev bond0
ip addr add 0.0.0.4/0 scope 0x40 secondary dev bond0
ip addr del 4.117.174.103/0 scope 0x40 dev bond0
ip link delete bond0 type bond

In this reproduction test case, an incorrect 'last_prim' is found in
__inet_del_ifa(), as a result, the secondary address(0.0.0.4/0 scope 0x40)
is lost. The memory of the secondary address is leaked and the reference of
in_device and net_device is leaked.

Fix this problem:
Look for 'last_prim' starting at location of the deleted IP and inserting
the promoted IP into the location of 'last_prim'.

Fixes: 0ff60a45678e ("[IPV4]: Fix secondary IP addresses after promotion")
Signed-off-by: Liu Jian <liujian56@huawei.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff ac28b1ec Wed Sep 06 20:57:09 MDT 2023 Liu Jian <liujian56@huawei.com> net: ipv4: fix one memleak in __inet_del_ifa()

I got the below warning when do fuzzing test:
unregister_netdevice: waiting for bond0 to become free. Usage count = 2

It can be repoduced via:

ip link add bond0 type bond
sysctl -w net.ipv4.conf.bond0.promote_secondaries=1
ip addr add 4.117.174.103/0 scope 0x40 dev bond0
ip addr add 192.168.100.111/255.255.255.254 scope 0 dev bond0
ip addr add 0.0.0.4/0 scope 0x40 secondary dev bond0
ip addr del 4.117.174.103/0 scope 0x40 dev bond0
ip link delete bond0 type bond

In this reproduction test case, an incorrect 'last_prim' is found in
__inet_del_ifa(), as a result, the secondary address(0.0.0.4/0 scope 0x40)
is lost. The memory of the secondary address is leaked and the reference of
in_device and net_device is leaked.

Fix this problem:
Look for 'last_prim' starting at location of the deleted IP and inserting
the promoted IP into the location of 'last_prim'.

Fixes: 0ff60a45678e ("[IPV4]: Fix secondary IP addresses after promotion")
Signed-off-by: Liu Jian <liujian56@huawei.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff ac28b1ec Wed Sep 06 20:57:09 MDT 2023 Liu Jian <liujian56@huawei.com> net: ipv4: fix one memleak in __inet_del_ifa()

I got the below warning when do fuzzing test:
unregister_netdevice: waiting for bond0 to become free. Usage count = 2

It can be repoduced via:

ip link add bond0 type bond
sysctl -w net.ipv4.conf.bond0.promote_secondaries=1
ip addr add 4.117.174.103/0 scope 0x40 dev bond0
ip addr add 192.168.100.111/255.255.255.254 scope 0 dev bond0
ip addr add 0.0.0.4/0 scope 0x40 secondary dev bond0
ip addr del 4.117.174.103/0 scope 0x40 dev bond0
ip link delete bond0 type bond

In this reproduction test case, an incorrect 'last_prim' is found in
__inet_del_ifa(), as a result, the secondary address(0.0.0.4/0 scope 0x40)
is lost. The memory of the secondary address is leaked and the reference of
in_device and net_device is leaked.

Fix this problem:
Look for 'last_prim' starting at location of the deleted IP and inserting
the promoted IP into the location of 'last_prim'.

Fixes: 0ff60a45678e ("[IPV4]: Fix secondary IP addresses after promotion")
Signed-off-by: Liu Jian <liujian56@huawei.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff ac28b1ec Wed Sep 06 20:57:09 MDT 2023 Liu Jian <liujian56@huawei.com> net: ipv4: fix one memleak in __inet_del_ifa()

I got the below warning when do fuzzing test:
unregister_netdevice: waiting for bond0 to become free. Usage count = 2

It can be repoduced via:

ip link add bond0 type bond
sysctl -w net.ipv4.conf.bond0.promote_secondaries=1
ip addr add 4.117.174.103/0 scope 0x40 dev bond0
ip addr add 192.168.100.111/255.255.255.254 scope 0 dev bond0
ip addr add 0.0.0.4/0 scope 0x40 secondary dev bond0
ip addr del 4.117.174.103/0 scope 0x40 dev bond0
ip link delete bond0 type bond

In this reproduction test case, an incorrect 'last_prim' is found in
__inet_del_ifa(), as a result, the secondary address(0.0.0.4/0 scope 0x40)
is lost. The memory of the secondary address is leaked and the reference of
in_device and net_device is leaked.

Fix this problem:
Look for 'last_prim' starting at location of the deleted IP and inserting
the promoted IP into the location of 'last_prim'.

Fixes: 0ff60a45678e ("[IPV4]: Fix secondary IP addresses after promotion")
Signed-off-by: Liu Jian <liujian56@huawei.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff ac28b1ec Wed Sep 06 20:57:09 MDT 2023 Liu Jian <liujian56@huawei.com> net: ipv4: fix one memleak in __inet_del_ifa()

I got the below warning when do fuzzing test:
unregister_netdevice: waiting for bond0 to become free. Usage count = 2

It can be repoduced via:

ip link add bond0 type bond
sysctl -w net.ipv4.conf.bond0.promote_secondaries=1
ip addr add 4.117.174.103/0 scope 0x40 dev bond0
ip addr add 192.168.100.111/255.255.255.254 scope 0 dev bond0
ip addr add 0.0.0.4/0 scope 0x40 secondary dev bond0
ip addr del 4.117.174.103/0 scope 0x40 dev bond0
ip link delete bond0 type bond

In this reproduction test case, an incorrect 'last_prim' is found in
__inet_del_ifa(), as a result, the secondary address(0.0.0.4/0 scope 0x40)
is lost. The memory of the secondary address is leaked and the reference of
in_device and net_device is leaked.

Fix this problem:
Look for 'last_prim' starting at location of the deleted IP and inserting
the promoted IP into the location of 'last_prim'.

Fixes: 0ff60a45678e ("[IPV4]: Fix secondary IP addresses after promotion")
Signed-off-by: Liu Jian <liujian56@huawei.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
/linux-master/fs/lockd/
H A Dsvc.cdiff f4578ba1 Tue Oct 10 16:31:22 MDT 2023 NeilBrown <neilb@suse.de> lockd: hold a reference to nlmsvc_serv while stopping the thread.

Both nfsd and nfsv4-callback take a temporary reference to the svc_serv
while calling svc_set_num_threads() to stop the last thread. lockd does
not.

This extra reference prevents the scv_serv from being freed when the
last thread drops its reference count. This is not currently needed
for lockd as the svc_serv is not accessed after the last thread is told
to exit.

However a future patch will require svc_exit_thread() to access the
svc_serv after the svc_put() so it will need the code that calls
svc_set_num_threads() to keep a reference and keep the svc_serv active.

So copy the pattern from nfsd and nfsv4-cb to lockd, and take a
reference around svc_set_num_threads(.., 0)

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
diff 7b719e2b Tue Jul 18 00:38:08 MDT 2023 NeilBrown <neilb@suse.de> SUNRPC: change svc_recv() to return void.

svc_recv() currently returns a 0 on success or one of two errors:
- -EAGAIN means no message was successfully received
- -EINTR means the thread has been told to stop

Previously nfsd would stop as the result of a signal as well as
following kthread_stop(). In that case the difference was useful: EINTR
means stop unconditionally. EAGAIN means stop if kthread_should_stop(),
continue otherwise.

Now threads only exit when kthread_should_stop() so we don't need the
distinction.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
diff fc412a61 Tue May 02 12:07:13 MDT 2023 Tom Rix <trix@redhat.com> lockd: define nlm_port_min,max with CONFIG_SYSCTL

gcc with W=1 and ! CONFIG_SYSCTL
fs/lockd/svc.c:80:51: error: ‘nlm_port_max’ defined but not used [-Werror=unused-const-variable=]
80 | static const int nlm_port_min = 0, nlm_port_max = 65535;
| ^~~~~~~~~~~~
fs/lockd/svc.c:80:33: error: ‘nlm_port_min’ defined but not used [-Werror=unused-const-variable=]
80 | static const int nlm_port_min = 0, nlm_port_max = 65535;
| ^~~~~~~~~~~~

The only use of these variables is when CONFIG_SYSCTL
is defined, so their definition should be likewise conditional.

Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
diff fc412a61 Tue May 02 12:07:13 MDT 2023 Tom Rix <trix@redhat.com> lockd: define nlm_port_min,max with CONFIG_SYSCTL

gcc with W=1 and ! CONFIG_SYSCTL
fs/lockd/svc.c:80:51: error: ‘nlm_port_max’ defined but not used [-Werror=unused-const-variable=]
80 | static const int nlm_port_min = 0, nlm_port_max = 65535;
| ^~~~~~~~~~~~
fs/lockd/svc.c:80:33: error: ‘nlm_port_min’ defined but not used [-Werror=unused-const-variable=]
80 | static const int nlm_port_min = 0, nlm_port_max = 65535;
| ^~~~~~~~~~~~

The only use of these variables is when CONFIG_SYSCTL
is defined, so their definition should be likewise conditional.

Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
diff 6b044fba Sun Nov 28 21:51:25 MST 2021 NeilBrown <neilb@suse.de> lockd: use svc_set_num_threads() for thread start and stop

svc_set_num_threads() does everything that lockd_start_svc() does, except
set sv_maxconn. It also (when passed 0) finds the threads and
stops them with kthread_stop().

So move the setting for sv_maxconn, and use svc_set_num_thread()

We now don't need nlmsvc_task.

Now that we use svc_set_num_threads() it makes sense to set svo_module.
This request that the thread exists with module_put_and_exit().
Also fix the documentation for svo_module to make this explicit.

svc_prepare_thread is now only used where it is defined, so it can be
made static.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
diff e919b076 Tue Nov 07 22:55:55 MST 2017 Vasily Averin <vvs@virtuozzo.com> lockd: remove net pointer from messages

Publishing of net pointer is not safe,
use net->ns.inum as net ID in debug messages

[ 171.757678] lockd_up_net: per-net data created; net=f00001e7
[ 171.767188] NFSD: starting 90-second grace period (net f00001e7)
[ 300.653313] lockd: nuking all hosts in net f00001e7...
[ 300.653641] lockd: host garbage collection for net f00001e7
[ 300.653968] lockd: nlmsvc_mark_resources for net f00001e7
[ 300.711483] lockd_down_net: per-net data destroyed; net=f00001e7
[ 300.711847] lockd: nuking all hosts in net 0...
[ 300.711847] lockd: host garbage collection for net 0
[ 300.711848] lockd: nlmsvc_mark_resources for net 0

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff e919b076 Tue Nov 07 22:55:55 MST 2017 Vasily Averin <vvs@virtuozzo.com> lockd: remove net pointer from messages

Publishing of net pointer is not safe,
use net->ns.inum as net ID in debug messages

[ 171.757678] lockd_up_net: per-net data created; net=f00001e7
[ 171.767188] NFSD: starting 90-second grace period (net f00001e7)
[ 300.653313] lockd: nuking all hosts in net f00001e7...
[ 300.653641] lockd: host garbage collection for net f00001e7
[ 300.653968] lockd: nlmsvc_mark_resources for net f00001e7
[ 300.711483] lockd_down_net: per-net data destroyed; net=f00001e7
[ 300.711847] lockd: nuking all hosts in net 0...
[ 300.711847] lockd: host garbage collection for net 0
[ 300.711848] lockd: nlmsvc_mark_resources for net 0

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff e919b076 Tue Nov 07 22:55:55 MST 2017 Vasily Averin <vvs@virtuozzo.com> lockd: remove net pointer from messages

Publishing of net pointer is not safe,
use net->ns.inum as net ID in debug messages

[ 171.757678] lockd_up_net: per-net data created; net=f00001e7
[ 171.767188] NFSD: starting 90-second grace period (net f00001e7)
[ 300.653313] lockd: nuking all hosts in net f00001e7...
[ 300.653641] lockd: host garbage collection for net f00001e7
[ 300.653968] lockd: nlmsvc_mark_resources for net f00001e7
[ 300.711483] lockd_down_net: per-net data destroyed; net=f00001e7
[ 300.711847] lockd: nuking all hosts in net 0...
[ 300.711847] lockd: host garbage collection for net 0
[ 300.711848] lockd: nlmsvc_mark_resources for net 0

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff c7d03a00 Wed Nov 16 18:58:21 MST 2016 Alexey Dobriyan <adobriyan@gmail.com> netns: make struct pernet_operations::id unsigned int

Make struct pernet_operations::id unsigned.

There are 2 reasons to do so:

1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.

2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.

"int" being used as an array index needs to be sign-extended
to 64-bit before being used.

void f(long *p, int i)
{
g(p[i]);
}

roughly translates to

movsx rsi, esi
mov rdi, [rsi+...]
call g

MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.

Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:

static inline void *net_generic(const struct net *net, int id)
{
...
ptr = ng->ptr[id - 1];
...
}

And this function is used a lot, so those sign extensions add up.

Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):

add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)

Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]

However, overall balance is in negative direction:

add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
function old new delta
nfsd4_lock 3886 3959 +73
tipc_link_build_proto_msg 1096 1140 +44
mac80211_hwsim_new_radio 2776 2808 +32
tipc_mon_rcv 1032 1058 +26
svcauth_gss_legacy_init 1413 1429 +16
tipc_bcbase_select_primary 379 392 +13
nfsd4_exchange_id 1247 1260 +13
nfsd4_setclientid_confirm 782 793 +11
...
put_client_renew_locked 494 480 -14
ip_set_sockfn_get 730 716 -14
geneve_sock_add 829 813 -16
nfsd4_sequence_done 721 703 -18
nlmclnt_lookup_host 708 686 -22
nfsd4_lockt 1085 1063 -22
nfs_get_client 1077 1050 -27
tcf_bpf_init 1106 1076 -30
nfsd4_encode_fattr 5997 5930 -67
Total: Before=154856051, After=154854321, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff c7d03a00 Wed Nov 16 18:58:21 MST 2016 Alexey Dobriyan <adobriyan@gmail.com> netns: make struct pernet_operations::id unsigned int

Make struct pernet_operations::id unsigned.

There are 2 reasons to do so:

1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.

2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.

"int" being used as an array index needs to be sign-extended
to 64-bit before being used.

void f(long *p, int i)
{
g(p[i]);
}

roughly translates to

movsx rsi, esi
mov rdi, [rsi+...]
call g

MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.

Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:

static inline void *net_generic(const struct net *net, int id)
{
...
ptr = ng->ptr[id - 1];
...
}

And this function is used a lot, so those sign extensions add up.

Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):

add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)

Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]

However, overall balance is in negative direction:

add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
function old new delta
nfsd4_lock 3886 3959 +73
tipc_link_build_proto_msg 1096 1140 +44
mac80211_hwsim_new_radio 2776 2808 +32
tipc_mon_rcv 1032 1058 +26
svcauth_gss_legacy_init 1413 1429 +16
tipc_bcbase_select_primary 379 392 +13
nfsd4_exchange_id 1247 1260 +13
nfsd4_setclientid_confirm 782 793 +11
...
put_client_renew_locked 494 480 -14
ip_set_sockfn_get 730 716 -14
geneve_sock_add 829 813 -16
nfsd4_sequence_done 721 703 -18
nlmclnt_lookup_host 708 686 -22
nfsd4_lockt 1085 1063 -22
nfs_get_client 1077 1050 -27
tcf_bpf_init 1106 1076 -30
nfsd4_encode_fattr 5997 5930 -67
Total: Before=154856051, After=154854321, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff c7d03a00 Wed Nov 16 18:58:21 MST 2016 Alexey Dobriyan <adobriyan@gmail.com> netns: make struct pernet_operations::id unsigned int

Make struct pernet_operations::id unsigned.

There are 2 reasons to do so:

1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.

2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.

"int" being used as an array index needs to be sign-extended
to 64-bit before being used.

void f(long *p, int i)
{
g(p[i]);
}

roughly translates to

movsx rsi, esi
mov rdi, [rsi+...]
call g

MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.

Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:

static inline void *net_generic(const struct net *net, int id)
{
...
ptr = ng->ptr[id - 1];
...
}

And this function is used a lot, so those sign extensions add up.

Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):

add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)

Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]

However, overall balance is in negative direction:

add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
function old new delta
nfsd4_lock 3886 3959 +73
tipc_link_build_proto_msg 1096 1140 +44
mac80211_hwsim_new_radio 2776 2808 +32
tipc_mon_rcv 1032 1058 +26
svcauth_gss_legacy_init 1413 1429 +16
tipc_bcbase_select_primary 379 392 +13
nfsd4_exchange_id 1247 1260 +13
nfsd4_setclientid_confirm 782 793 +11
...
put_client_renew_locked 494 480 -14
ip_set_sockfn_get 730 716 -14
geneve_sock_add 829 813 -16
nfsd4_sequence_done 721 703 -18
nlmclnt_lookup_host 708 686 -22
nfsd4_lockt 1085 1063 -22
nfs_get_client 1077 1050 -27
tcf_bpf_init 1106 1076 -30
nfsd4_encode_fattr 5997 5930 -67
Total: Before=154856051, After=154854321, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff c7d03a00 Wed Nov 16 18:58:21 MST 2016 Alexey Dobriyan <adobriyan@gmail.com> netns: make struct pernet_operations::id unsigned int

Make struct pernet_operations::id unsigned.

There are 2 reasons to do so:

1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.

2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.

"int" being used as an array index needs to be sign-extended
to 64-bit before being used.

void f(long *p, int i)
{
g(p[i]);
}

roughly translates to

movsx rsi, esi
mov rdi, [rsi+...]
call g

MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.

Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:

static inline void *net_generic(const struct net *net, int id)
{
...
ptr = ng->ptr[id - 1];
...
}

And this function is used a lot, so those sign extensions add up.

Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):

add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)

Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]

However, overall balance is in negative direction:

add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
function old new delta
nfsd4_lock 3886 3959 +73
tipc_link_build_proto_msg 1096 1140 +44
mac80211_hwsim_new_radio 2776 2808 +32
tipc_mon_rcv 1032 1058 +26
svcauth_gss_legacy_init 1413 1429 +16
tipc_bcbase_select_primary 379 392 +13
nfsd4_exchange_id 1247 1260 +13
nfsd4_setclientid_confirm 782 793 +11
...
put_client_renew_locked 494 480 -14
ip_set_sockfn_get 730 716 -14
geneve_sock_add 829 813 -16
nfsd4_sequence_done 721 703 -18
nlmclnt_lookup_host 708 686 -22
nfsd4_lockt 1085 1063 -22
nfs_get_client 1077 1050 -27
tcf_bpf_init 1106 1076 -30
nfsd4_encode_fattr 5997 5930 -67
Total: Before=154856051, After=154854321, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff c7d03a00 Wed Nov 16 18:58:21 MST 2016 Alexey Dobriyan <adobriyan@gmail.com> netns: make struct pernet_operations::id unsigned int

Make struct pernet_operations::id unsigned.

There are 2 reasons to do so:

1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.

2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.

"int" being used as an array index needs to be sign-extended
to 64-bit before being used.

void f(long *p, int i)
{
g(p[i]);
}

roughly translates to

movsx rsi, esi
mov rdi, [rsi+...]
call g

MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.

Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:

static inline void *net_generic(const struct net *net, int id)
{
...
ptr = ng->ptr[id - 1];
...
}

And this function is used a lot, so those sign extensions add up.

Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):

add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)

Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]

However, overall balance is in negative direction:

add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
function old new delta
nfsd4_lock 3886 3959 +73
tipc_link_build_proto_msg 1096 1140 +44
mac80211_hwsim_new_radio 2776 2808 +32
tipc_mon_rcv 1032 1058 +26
svcauth_gss_legacy_init 1413 1429 +16
tipc_bcbase_select_primary 379 392 +13
nfsd4_exchange_id 1247 1260 +13
nfsd4_setclientid_confirm 782 793 +11
...
put_client_renew_locked 494 480 -14
ip_set_sockfn_get 730 716 -14
geneve_sock_add 829 813 -16
nfsd4_sequence_done 721 703 -18
nlmclnt_lookup_host 708 686 -22
nfsd4_lockt 1085 1063 -22
nfs_get_client 1077 1050 -27
tcf_bpf_init 1106 1076 -30
nfsd4_encode_fattr 5997 5930 -67
Total: Before=154856051, After=154854321, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
/linux-master/drivers/md/
H A Dmd.cdiff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
diff 41425f96 Tue Mar 05 00:23:05 MST 2024 Yu Kuai <yukuai3@huawei.com> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
/linux-master/include/linux/
H A Dsysctl.hdiff 0b68ab50 Sat Dec 23 06:53:47 MST 2023 Thomas Weißschuh <linux@weissschuh.net> sysctl: delete unused define SYSCTL_PERM_EMPTY_DIR

It seems it was never used.

Fixes: 2f2665c13af4 ("sysctl: replace child with an enumeration")
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
diff 8001f493 Fri Oct 27 15:46:40 MDT 2023 Krister Johansen <kjlx@templeofstupid.com> proc: sysctl: prevent aliased sysctls from getting passed to init

The code that checks for unknown boot options is unaware of the sysctl
alias facility, which maps bootparams to sysctl values. If a user sets
an old value that has a valid alias, a message about an invalid
parameter will be printed during boot, and the parameter will get passed
to init. Fix by checking for the existence of aliased parameters in the
unknown boot parameter code. If an alias exists, don't return an error
or pass the value to init.

Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
Cc: stable@vger.kernel.org
Fixes: 0a477e1ae21b ("kernel/sysctl: support handling command line aliases")
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
diff 2f2665c1 Fri Jun 16 02:59:22 MDT 2023 Joel Granados <j.granados@samsung.com> sysctl: replace child with an enumeration

This is part of the effort to remove the empty element at the end of
ctl_table structs. "child" was a deprecated elem in this struct and was
being used to differentiate between two types of ctl_tables: "normal"
and "permanently emtpy".

What changed?:
* Replace "child" with an enumeration that will have two values: the
default (0) and the permanently empty (1). The latter is left at zero
so when struct ctl_table is created with kzalloc or in a local
context, it will have the zero value by default. We document the
new enum with kdoc.
* Remove the "empty child" check from sysctl_check_table
* Remove count_subheaders function as there is no longer a need to
calculate how many headers there are for every child
* Remove the recursive call to unregister_sysctl_table as there is no
need to traverse down the child tree any longer
* Add a new SYSCTL_PERM_EMPTY_DIR binary flag
* Remove the last remanence of child from partport/procfs.c

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
diff 2f5edd03 Tue May 23 06:22:19 MDT 2023 Joel Granados <j.granados@samsung.com> sysctl: Refactor base paths registrations

This is part of the general push to deprecate register_sysctl_paths and
register_sysctl_table. The old way of doing this through
register_sysctl_base and DECLARE_SYSCTL_BASE macro is replaced with a
call to register_sysctl_init. The 5 base paths affected are: "kernel",
"vm", "debug", "dev" and "fs".

We remove the register_sysctl_base function and the DECLARE_SYSCTL_BASE
macro since they are no longer needed.

In order to quickly acertain that the paths did not actually change I
executed `find /proc/sys/ | sha1sum` and made sure that the sha was the
same before and after the commit.

We end up saving 563 bytes with this change:

./scripts/bloat-o-meter vmlinux.0.base vmlinux.1.refactor-base-paths
add/remove: 0/5 grow/shrink: 2/0 up/down: 77/-640 (-563)
Function old new delta
sysctl_init_bases 55 111 +56
init_fs_sysctls 12 33 +21
vm_base_table 128 - -128
kernel_base_table 128 - -128
fs_base_table 128 - -128
dev_base_table 128 - -128
debug_base_table 128 - -128
Total: Before=21258215, After=21257652, chg -0.00%

[mcgrof: modified to use register_sysctl_init() over register_sysctl()
and add bloat-o-meter stats]

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Tested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Christian Brauner <brauner@kernel.org>
diff 2f5edd03 Tue May 23 06:22:19 MDT 2023 Joel Granados <j.granados@samsung.com> sysctl: Refactor base paths registrations

This is part of the general push to deprecate register_sysctl_paths and
register_sysctl_table. The old way of doing this through
register_sysctl_base and DECLARE_SYSCTL_BASE macro is replaced with a
call to register_sysctl_init. The 5 base paths affected are: "kernel",
"vm", "debug", "dev" and "fs".

We remove the register_sysctl_base function and the DECLARE_SYSCTL_BASE
macro since they are no longer needed.

In order to quickly acertain that the paths did not actually change I
executed `find /proc/sys/ | sha1sum` and made sure that the sha was the
same before and after the commit.

We end up saving 563 bytes with this change:

./scripts/bloat-o-meter vmlinux.0.base vmlinux.1.refactor-base-paths
add/remove: 0/5 grow/shrink: 2/0 up/down: 77/-640 (-563)
Function old new delta
sysctl_init_bases 55 111 +56
init_fs_sysctls 12 33 +21
vm_base_table 128 - -128
kernel_base_table 128 - -128
fs_base_table 128 - -128
dev_base_table 128 - -128
debug_base_table 128 - -128
Total: Before=21258215, After=21257652, chg -0.00%

[mcgrof: modified to use register_sysctl_init() over register_sysctl()
and add bloat-o-meter stats]

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Tested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Christian Brauner <brauner@kernel.org>
diff 2f5edd03 Tue May 23 06:22:19 MDT 2023 Joel Granados <j.granados@samsung.com> sysctl: Refactor base paths registrations

This is part of the general push to deprecate register_sysctl_paths and
register_sysctl_table. The old way of doing this through
register_sysctl_base and DECLARE_SYSCTL_BASE macro is replaced with a
call to register_sysctl_init. The 5 base paths affected are: "kernel",
"vm", "debug", "dev" and "fs".

We remove the register_sysctl_base function and the DECLARE_SYSCTL_BASE
macro since they are no longer needed.

In order to quickly acertain that the paths did not actually change I
executed `find /proc/sys/ | sha1sum` and made sure that the sha was the
same before and after the commit.

We end up saving 563 bytes with this change:

./scripts/bloat-o-meter vmlinux.0.base vmlinux.1.refactor-base-paths
add/remove: 0/5 grow/shrink: 2/0 up/down: 77/-640 (-563)
Function old new delta
sysctl_init_bases 55 111 +56
init_fs_sysctls 12 33 +21
vm_base_table 128 - -128
kernel_base_table 128 - -128
fs_base_table 128 - -128
dev_base_table 128 - -128
debug_base_table 128 - -128
Total: Before=21258215, After=21257652, chg -0.00%

[mcgrof: modified to use register_sysctl_init() over register_sysctl()
and add bloat-o-meter stats]

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Tested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Christian Brauner <brauner@kernel.org>
diff 19c4e618 Tue May 23 06:22:18 MDT 2023 Joel Granados <j.granados@samsung.com> sysctl: stop exporting register_sysctl_table

We make register_sysctl_table static because the only function calling
it is in fs/proc/proc_sysctl.c (__register_sysctl_base). We remove it
from the sysctl.h header and modify the documentation in both the header
and proc_sysctl.c files to mention "register_sysctl" instead of
"register_sysctl_table".

This plus the commits that remove register_sysctl_table from parport
save 217 bytes:

./scripts/bloat-o-meter .bsysctl/vmlinux.old .bsysctl/vmlinux.new
add/remove: 0/1 grow/shrink: 5/1 up/down: 458/-675 (-217)
Function old new delta
__register_sysctl_base 8 286 +278
parport_proc_register 268 379 +111
parport_device_proc_register 195 247 +52
kzalloc.constprop 598 608 +10
parport_default_proc_register 62 69 +7
register_sysctl_table 291 - -291
parport_sysctl_template 1288 904 -384
Total: Before=8603076, After=8602859, chg -0.00%

Signed-off-by: Joel Granados <j.granados@samsung.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
diff 3ddd9a80 Fri Jan 21 23:10:50 MST 2022 Xiaoming Ni <nixiaoming@huawei.com> sysctl: add a new register_sysctl_init() interface

Patch series "sysctl: first set of kernel/sysctl cleanups", v2.

Finally had time to respin the series of the work we had started last
year on cleaning up the kernel/sysct.c kitchen sink. People keeps
stuffing their sysctls in that file and this creates a maintenance
burden. So this effort is aimed at placing sysctls where they actually
belong.

I'm going to split patches up into series as there is quite a bit of
work.

This first set adds register_sysctl_init() for uses of registerting a
sysctl on the init path, adds const where missing to a few places,
generalizes common values so to be more easy to share, and starts the
move of a few kernel/sysctl.c out where they belong.

The majority of rework on v2 in this first patch set is 0-day fixes.
Eric Biederman's feedback is later addressed in subsequent patch sets.

I'll only post the first two patch sets for now. We can address the
rest once the first two patch sets get completely reviewed / Acked.

This patch (of 9):

The kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.

To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.

Today though folks heavily rely on tables on kernel/sysctl.c so they can
easily just extend this table with their needed sysctls. In order to
help users move their sysctls out we need to provide a helper which can
be used during code initialization.

We special-case the initialization use of register_sysctl() since it
*is* safe to fail, given all that sysctls do is provide a dynamic
interface to query or modify at runtime an existing variable. So the
use case of register_sysctl() on init should *not* stop if the sysctls
don't end up getting registered. It would be counter productive to stop
boot if a simple sysctl registration failed.

Provide a helper for init then, and document the recommended init levels
to use for callers of this routine. We will later use this in
subsequent patches to start slimming down kernel/sysctl.c tables and
moving sysctl registration to the code which actually needs these
sysctls.

[mcgrof@kernel.org: major commit log and documentation rephrasing also moved to fs/proc/proc_sysctl.c ]

Link: https://lkml.kernel.org/r/20211123202347.818157-1-mcgrof@kernel.org
Link: https://lkml.kernel.org/r/20211123202347.818157-2-mcgrof@kernel.org
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Paul Turner <pjt@google.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 3db978d4 Sun Jun 07 22:40:24 MDT 2020 Vlastimil Babka <vbabka@suse.cz> kernel/sysctl: support setting sysctl parameters from kernel command line

Patch series "support setting sysctl parameters from kernel command line", v3.

This series adds support for something that seems like many people
always wanted but nobody added it yet, so here's the ability to set
sysctl parameters via kernel command line options in the form of
sysctl.vm.something=1

The important part is Patch 1. The second, not so important part is an
attempt to clean up legacy one-off parameters that do the same thing as
a sysctl. I don't want to remove them completely for compatibility
reasons, but with generic sysctl support the idea is to remove the
one-off param handlers and treat the parameters as aliases for the
sysctl variants.

I have identified several parameters that mention sysctl counterparts in
Documentation/admin-guide/kernel-parameters.txt but there might be more.
The conversion also has varying level of success:

- numa_zonelist_order is converted in Patch 2 together with adding the
necessary infrastructure. It's easy as it doesn't really do anything
but warn on deprecated value these days.

- hung_task_panic is converted in Patch 3, but there's a downside that
now it only accepts 0 and 1, while previously it was any integer
value

- nmi_watchdog maps to two sysctls nmi_watchdog and hardlockup_panic,
so there's no straighforward conversion possible

- traceoff_on_warning is a flag without value and it would be required
to handle that somehow in the conversion infractructure, which seems
pointless for a single flag

This patch (of 5):

A recently proposed patch to add vm_swappiness command line parameter in
addition to existing sysctl [1] made me wonder why we don't have a
general support for passing sysctl parameters via command line.

Googling found only somebody else wondering the same [2], but I haven't
found any prior discussion with reasons why not to do this.

Settings the vm_swappiness issue aside (the underlying issue might be
solved in a different way), quick search of kernel-parameters.txt shows
there are already some that exist as both sysctl and kernel parameter -
hung_task_panic, nmi_watchdog, numa_zonelist_order, traceoff_on_warning.

A general mechanism would remove the need to add more of those one-offs
and might be handy in situations where configuration by e.g.
/etc/sysctl.d/ is impractical.

Hence, this patch adds a new parse_args() pass that looks for parameters
prefixed by 'sysctl.' and tries to interpret them as writes to the
corresponding sys/ files using an temporary in-kernel procfs mount.
This mechanism was suggested by Eric W. Biederman [3], as it handles
all dynamically registered sysctl tables, even though we don't handle
modular sysctls. Errors due to e.g. invalid parameter name or value
are reported in the kernel log.

The processing is hooked right before the init process is loaded, as
some handlers might be more complicated than simple setters and might
need some subsystems to be initialized. At the moment the init process
can be started and eventually execute a process writing to /proc/sys/
then it should be also fine to do that from the kernel.

Sysctls registered later on module load time are not set by this
mechanism - it's expected that in such scenarios, setting sysctl values
from userspace is practical enough.

[1] https://lore.kernel.org/r/BL0PR02MB560167492CA4094C91589930E9FC0@BL0PR02MB5601.namprd02.prod.outlook.com/
[2] https://unix.stackexchange.com/questions/558802/how-to-set-sysctl-using-kernel-command-line-parameter
[3] https://lore.kernel.org/r/87bloj2skm.fsf@x220.int.ebiederm.org/

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Link: http://lkml.kernel.org/r/20200427180433.7029-1-vbabka@suse.cz
Link: http://lkml.kernel.org/r/20200427180433.7029-2-vbabka@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff eec4844f Thu Jul 18 16:58:50 MDT 2019 Matteo Croce <mcroce@redhat.com> proc/sysctl: add shared variables for range check

In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.

On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.

The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:

$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248

Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.

This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:

# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%

[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff eec4844f Thu Jul 18 16:58:50 MDT 2019 Matteo Croce <mcroce@redhat.com> proc/sysctl: add shared variables for range check

In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.

On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.

The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:

$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248

Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.

This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:

# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%

[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff eec4844f Thu Jul 18 16:58:50 MDT 2019 Matteo Croce <mcroce@redhat.com> proc/sysctl: add shared variables for range check

In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.

On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.

The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:

$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248

Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.

This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:

# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%

[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/linux-master/fs/nfs/
H A Dsysctl.cdiff 0b4d4147 Wed Feb 14 01:34:09 MST 2007 Eric W. Biederman <ebiederm@xmission.com> [PATCH] sysctl: remove insert_at_head from register_sysctl

The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 0b4d4147 Wed Feb 14 01:34:09 MST 2007 Eric W. Biederman <ebiederm@xmission.com> [PATCH] sysctl: remove insert_at_head from register_sysctl

The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/linux-master/fs/coda/
H A Dsysctl.cdiff c03e3126 Fri Mar 25 02:57:01 MDT 2011 Rakib Mullick <rakib.mullick@gmail.com> codafs: fix build break when CONFIG_PROC_SYSCTL=n

Commit 0bc825d240ab ("codafs: fix compile warning when CONFIG_SYSCTL=n")
introduces build breakage, when CONFIG_PROC_SYSCTL=n and
CONFIG_CODA_FS=y:

fs/built-in.o: In function `init_coda':
psdev.c:(.init.text+0xc02): undefined reference to `coda_sysctl_init'
psdev.c:(.init.text+0xc7c): undefined reference to `coda_sysctl_clean'
fs/built-in.o: In function `exit_coda':
psdev.c:(.exit.text+0xa9): undefined reference to `coda_sysctl_clean'
make: *** [.tmp_vmlinux1] Error 1

Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Reported-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff c03e3126 Fri Mar 25 02:57:01 MDT 2011 Rakib Mullick <rakib.mullick@gmail.com> codafs: fix build break when CONFIG_PROC_SYSCTL=n

Commit 0bc825d240ab ("codafs: fix compile warning when CONFIG_SYSCTL=n")
introduces build breakage, when CONFIG_PROC_SYSCTL=n and
CONFIG_CODA_FS=y:

fs/built-in.o: In function `init_coda':
psdev.c:(.init.text+0xc02): undefined reference to `coda_sysctl_init'
psdev.c:(.init.text+0xc7c): undefined reference to `coda_sysctl_clean'
fs/built-in.o: In function `exit_coda':
psdev.c:(.exit.text+0xa9): undefined reference to `coda_sysctl_clean'
make: *** [.tmp_vmlinux1] Error 1

Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Reported-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff c03e3126 Fri Mar 25 02:57:01 MDT 2011 Rakib Mullick <rakib.mullick@gmail.com> codafs: fix build break when CONFIG_PROC_SYSCTL=n

Commit 0bc825d240ab ("codafs: fix compile warning when CONFIG_SYSCTL=n")
introduces build breakage, when CONFIG_PROC_SYSCTL=n and
CONFIG_CODA_FS=y:

fs/built-in.o: In function `init_coda':
psdev.c:(.init.text+0xc02): undefined reference to `coda_sysctl_init'
psdev.c:(.init.text+0xc7c): undefined reference to `coda_sysctl_clean'
fs/built-in.o: In function `exit_coda':
psdev.c:(.exit.text+0xa9): undefined reference to `coda_sysctl_clean'
make: *** [.tmp_vmlinux1] Error 1

Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Reported-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff c03e3126 Fri Mar 25 02:57:01 MDT 2011 Rakib Mullick <rakib.mullick@gmail.com> codafs: fix build break when CONFIG_PROC_SYSCTL=n

Commit 0bc825d240ab ("codafs: fix compile warning when CONFIG_SYSCTL=n")
introduces build breakage, when CONFIG_PROC_SYSCTL=n and
CONFIG_CODA_FS=y:

fs/built-in.o: In function `init_coda':
psdev.c:(.init.text+0xc02): undefined reference to `coda_sysctl_init'
psdev.c:(.init.text+0xc7c): undefined reference to `coda_sysctl_clean'
fs/built-in.o: In function `exit_coda':
psdev.c:(.exit.text+0xa9): undefined reference to `coda_sysctl_clean'
make: *** [.tmp_vmlinux1] Error 1

Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Reported-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 0bc825d2 Tue Mar 22 17:35:00 MDT 2011 Rakib Mullick <rakib.mullick@gmail.com> codafs: fix compile warning when CONFIG_SYSCTL=n

When CONFIG_SYSCTL=n, we get the following warning:

fs/coda/sysctl.c:18: warning: `coda_tabl' defined but not used

Fix the warning by making sure coda_table and it's callee function are in
the same context. Also clean up the code by removing extra #ifdef.

[akpm@linux-foundation.org: remove unneeded stub macros]
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Cc: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 0b4d4147 Wed Feb 14 01:34:09 MST 2007 Eric W. Biederman <ebiederm@xmission.com> [PATCH] sysctl: remove insert_at_head from register_sysctl

The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 0b4d4147 Wed Feb 14 01:34:09 MST 2007 Eric W. Biederman <ebiederm@xmission.com> [PATCH] sysctl: remove insert_at_head from register_sysctl

The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/linux-master/net/core/
H A Dneighbour.cdiff 62e395f8 Sun Jan 29 22:14:28 MST 2023 Brian Haley <haleyb.dev@gmail.com> neighbor: fix proxy_delay usage when it is zero

When set to zero, the neighbor sysctl proxy_delay value
does not cause an immediate reply for ARP/ND requests
as expected, it instead causes a random delay between
[0, U32_MAX). Looking at this comment from
__get_random_u32_below() explains the reason:

/*
* This function is technically undefined for ceil == 0, and in fact
* for the non-underscored constant version in the header, we build bug
* on that. But for the non-constant case, it's convenient to have that
* evaluate to being a straight call to get_random_u32(), so that
* get_random_u32_inclusive() can work over its whole range without
* undefined behavior.
*/

Added helper function that does not call get_random_u32_below()
if proxy_delay is zero and just uses the current value of
jiffies instead, causing pneigh_enqueue() to respond
immediately.

Also added definition of proxy_delay to ip-sysctl.txt since
it was missing.

Signed-off-by: Brian Haley <haleyb.dev@gmail.com>
Link: https://lore.kernel.org/r/20230130171428.367111-1-haleyb.dev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 62e395f8 Sun Jan 29 22:14:28 MST 2023 Brian Haley <haleyb.dev@gmail.com> neighbor: fix proxy_delay usage when it is zero

When set to zero, the neighbor sysctl proxy_delay value
does not cause an immediate reply for ARP/ND requests
as expected, it instead causes a random delay between
[0, U32_MAX). Looking at this comment from
__get_random_u32_below() explains the reason:

/*
* This function is technically undefined for ceil == 0, and in fact
* for the non-underscored constant version in the header, we build bug
* on that. But for the non-constant case, it's convenient to have that
* evaluate to being a straight call to get_random_u32(), so that
* get_random_u32_inclusive() can work over its whole range without
* undefined behavior.
*/

Added helper function that does not call get_random_u32_below()
if proxy_delay is zero and just uses the current value of
jiffies instead, causing pneigh_enqueue() to respond
immediately.

Also added definition of proxy_delay to ip-sysctl.txt since
it was missing.

Signed-off-by: Brian Haley <haleyb.dev@gmail.com>
Link: https://lore.kernel.org/r/20230130171428.367111-1-haleyb.dev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff c1d2ecdf Thu Feb 02 08:25:51 MST 2023 Julian Anastasov <ja@ssi.bg> neigh: make sure used and confirmed times are valid

Entries can linger in cache without timer for days, thanks to
the gc_thresh1 limit. As result, without traffic, the confirmed
time can be outdated and to appear to be in the future. Later,
on traffic, NUD_STALE entries can switch to NUD_DELAY and start
the timer which can see the invalid confirmed time and wrongly
switch to NUD_REACHABLE state instead of NUD_PROBE. As result,
timer is set many days in the future. This is more visible on
32-bit platforms, with higher HZ value.

Why this is a problem? While we expect unused entries to expire,
such entries stay in REACHABLE state for too long, locked in
cache. They are not expired normally, only when cache is full.

Problem and the wrong state change reported by Zhang Changzhong:

172.16.1.18 dev bond0 lladdr 0a:0e:0f:01:12:01 ref 1 used 350521/15994171/350520 probes 4 REACHABLE

350520 seconds have elapsed since this entry was last updated, but it is
still in the REACHABLE state (base_reachable_time_ms is 30000),
preventing lladdr from being updated through probe.

Fix it by ensuring timer is started with valid used/confirmed
times. Considering the valid time range is LONG_MAX jiffies,
we try not to go too much in the past while we are in
DELAY/PROBE state. There are also places that need
used/updated times to be validated while timer is not running.

Reported-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Tested-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff c1d2ecdf Thu Feb 02 08:25:51 MST 2023 Julian Anastasov <ja@ssi.bg> neigh: make sure used and confirmed times are valid

Entries can linger in cache without timer for days, thanks to
the gc_thresh1 limit. As result, without traffic, the confirmed
time can be outdated and to appear to be in the future. Later,
on traffic, NUD_STALE entries can switch to NUD_DELAY and start
the timer which can see the invalid confirmed time and wrongly
switch to NUD_REACHABLE state instead of NUD_PROBE. As result,
timer is set many days in the future. This is more visible on
32-bit platforms, with higher HZ value.

Why this is a problem? While we expect unused entries to expire,
such entries stay in REACHABLE state for too long, locked in
cache. They are not expired normally, only when cache is full.

Problem and the wrong state change reported by Zhang Changzhong:

172.16.1.18 dev bond0 lladdr 0a:0e:0f:01:12:01 ref 1 used 350521/15994171/350520 probes 4 REACHABLE

350520 seconds have elapsed since this entry was last updated, but it is
still in the REACHABLE state (base_reachable_time_ms is 30000),
preventing lladdr from being updated through probe.

Fix it by ensuring timer is started with valid used/confirmed
times. Considering the valid time range is LONG_MAX jiffies,
we try not to go too much in the past while we are in
DELAY/PROBE state. There are also places that need
used/updated times to be validated while timer is not running.

Reported-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Tested-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff c1d2ecdf Thu Feb 02 08:25:51 MST 2023 Julian Anastasov <ja@ssi.bg> neigh: make sure used and confirmed times are valid

Entries can linger in cache without timer for days, thanks to
the gc_thresh1 limit. As result, without traffic, the confirmed
time can be outdated and to appear to be in the future. Later,
on traffic, NUD_STALE entries can switch to NUD_DELAY and start
the timer which can see the invalid confirmed time and wrongly
switch to NUD_REACHABLE state instead of NUD_PROBE. As result,
timer is set many days in the future. This is more visible on
32-bit platforms, with higher HZ value.

Why this is a problem? While we expect unused entries to expire,
such entries stay in REACHABLE state for too long, locked in
cache. They are not expired normally, only when cache is full.

Problem and the wrong state change reported by Zhang Changzhong:

172.16.1.18 dev bond0 lladdr 0a:0e:0f:01:12:01 ref 1 used 350521/15994171/350520 probes 4 REACHABLE

350520 seconds have elapsed since this entry was last updated, but it is
still in the REACHABLE state (base_reachable_time_ms is 30000),
preventing lladdr from being updated through probe.

Fix it by ensuring timer is started with valid used/confirmed
times. Considering the valid time range is LONG_MAX jiffies,
we try not to go too much in the past while we are in
DELAY/PROBE state. There are also places that need
used/updated times to be validated while timer is not running.

Reported-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Tested-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 8207f253 Tue Nov 15 15:09:41 MST 2022 Thomas Zeitlhofer <thomas.zeitlhofer+lkml@ze-it.at> net: neigh: decrement the family specific qlen

Commit 0ff4eb3d5ebb ("neighbour: make proxy_queue.qlen limit
per-device") introduced the length counter qlen in struct neigh_parms.
There are separate neigh_parms instances for IPv4/ARP and IPv6/ND, and
while the family specific qlen is incremented in pneigh_enqueue(), the
mentioned commit decrements always the IPv4/ARP specific qlen,
regardless of the currently processed family, in pneigh_queue_purge()
and neigh_proxy_process().

As a result, with IPv6/ND, the family specific qlen is only incremented
(and never decremented) until it exceeds PROXY_QLEN, and then, according
to the check in pneigh_enqueue(), neighbor solicitations are not
answered anymore. As an example, this is noted when using the
subnet-router anycast address to access a Linux router. After a certain
amount of time (in the observed case, qlen exceeded PROXY_QLEN after two
days), the Linux router stops answering neighbor solicitations for its
subnet-router anycast address and effectively becomes unreachable.

Another result with IPv6/ND is that the IPv4/ARP specific qlen is
decremented more often than incremented. This leads to negative qlen
values, as a signed integer has been used for the length counter qlen,
and potentially to an integer overflow.

Fix this by introducing the helper function neigh_parms_qlen_dec(),
which decrements the family specific qlen. Thereby, make use of the
existing helper function neigh_get_dev_parms_rcu(), whose definition
therefore needs to be placed earlier in neighbour.c. Take the family
member from struct neigh_table to determine the currently processed
family and appropriately call neigh_parms_qlen_dec() from
pneigh_queue_purge() and neigh_proxy_process().

Additionally, use an unsigned integer for the length counter qlen.

Fixes: 0ff4eb3d5ebb ("neighbour: make proxy_queue.qlen limit per-device")
Signed-off-by: Thomas Zeitlhofer <thomas.zeitlhofer+lkml@ze-it.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 8207f253 Tue Nov 15 15:09:41 MST 2022 Thomas Zeitlhofer <thomas.zeitlhofer+lkml@ze-it.at> net: neigh: decrement the family specific qlen

Commit 0ff4eb3d5ebb ("neighbour: make proxy_queue.qlen limit
per-device") introduced the length counter qlen in struct neigh_parms.
There are separate neigh_parms instances for IPv4/ARP and IPv6/ND, and
while the family specific qlen is incremented in pneigh_enqueue(), the
mentioned commit decrements always the IPv4/ARP specific qlen,
regardless of the currently processed family, in pneigh_queue_purge()
and neigh_proxy_process().

As a result, with IPv6/ND, the family specific qlen is only incremented
(and never decremented) until it exceeds PROXY_QLEN, and then, according
to the check in pneigh_enqueue(), neighbor solicitations are not
answered anymore. As an example, this is noted when using the
subnet-router anycast address to access a Linux router. After a certain
amount of time (in the observed case, qlen exceeded PROXY_QLEN after two
days), the Linux router stops answering neighbor solicitations for its
subnet-router anycast address and effectively becomes unreachable.

Another result with IPv6/ND is that the IPv4/ARP specific qlen is
decremented more often than incremented. This leads to negative qlen
values, as a signed integer has been used for the length counter qlen,
and potentially to an integer overflow.

Fix this by introducing the helper function neigh_parms_qlen_dec(),
which decrements the family specific qlen. Thereby, make use of the
existing helper function neigh_get_dev_parms_rcu(), whose definition
therefore needs to be placed earlier in neighbour.c. Take the family
member from struct neigh_table to determine the currently processed
family and appropriately call neigh_parms_qlen_dec() from
pneigh_queue_purge() and neigh_proxy_process().

Additionally, use an unsigned integer for the length counter qlen.

Fixes: 0ff4eb3d5ebb ("neighbour: make proxy_queue.qlen limit per-device")
Signed-off-by: Thomas Zeitlhofer <thomas.zeitlhofer+lkml@ze-it.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff f8017317 Tue Nov 01 06:15:52 MDT 2022 Chen Zhongjin <chenzhongjin@huawei.com> net, neigh: Fix null-ptr-deref in neigh_table_clear()

When IPv6 module gets initialized but hits an error in the middle,
kenel panic with:

KASAN: null-ptr-deref in range [0x0000000000000598-0x000000000000059f]
CPU: 1 PID: 361 Comm: insmod
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:__neigh_ifdown.isra.0+0x24b/0x370
RSP: 0018:ffff888012677908 EFLAGS: 00000202
...
Call Trace:
<TASK>
neigh_table_clear+0x94/0x2d0
ndisc_cleanup+0x27/0x40 [ipv6]
inet6_init+0x21c/0x2cb [ipv6]
do_one_initcall+0xd3/0x4d0
do_init_module+0x1ae/0x670
...
Kernel panic - not syncing: Fatal exception

When ipv6 initialization fails, it will try to cleanup and calls:

neigh_table_clear()
neigh_ifdown(tbl, NULL)
pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev == NULL))
# dev_net(NULL) triggers null-ptr-deref.

Fix it by passing NULL to pneigh_queue_purge() in neigh_ifdown() if dev
is NULL, to make kernel not panic immediately.

Fixes: 66ba215cb513 ("neigh: fix possible DoS due to net iface start/stop loop")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Link: https://lore.kernel.org/r/20221101121552.21890-1-chenzhongjin@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff f8017317 Tue Nov 01 06:15:52 MDT 2022 Chen Zhongjin <chenzhongjin@huawei.com> net, neigh: Fix null-ptr-deref in neigh_table_clear()

When IPv6 module gets initialized but hits an error in the middle,
kenel panic with:

KASAN: null-ptr-deref in range [0x0000000000000598-0x000000000000059f]
CPU: 1 PID: 361 Comm: insmod
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:__neigh_ifdown.isra.0+0x24b/0x370
RSP: 0018:ffff888012677908 EFLAGS: 00000202
...
Call Trace:
<TASK>
neigh_table_clear+0x94/0x2d0
ndisc_cleanup+0x27/0x40 [ipv6]
inet6_init+0x21c/0x2cb [ipv6]
do_one_initcall+0xd3/0x4d0
do_init_module+0x1ae/0x670
...
Kernel panic - not syncing: Fatal exception

When ipv6 initialization fails, it will try to cleanup and calls:

neigh_table_clear()
neigh_ifdown(tbl, NULL)
pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev == NULL))
# dev_net(NULL) triggers null-ptr-deref.

Fix it by passing NULL to pneigh_queue_purge() in neigh_ifdown() if dev
is NULL, to make kernel not panic immediately.

Fixes: 66ba215cb513 ("neigh: fix possible DoS due to net iface start/stop loop")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Link: https://lore.kernel.org/r/20221101121552.21890-1-chenzhongjin@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff f8017317 Tue Nov 01 06:15:52 MDT 2022 Chen Zhongjin <chenzhongjin@huawei.com> net, neigh: Fix null-ptr-deref in neigh_table_clear()

When IPv6 module gets initialized but hits an error in the middle,
kenel panic with:

KASAN: null-ptr-deref in range [0x0000000000000598-0x000000000000059f]
CPU: 1 PID: 361 Comm: insmod
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:__neigh_ifdown.isra.0+0x24b/0x370
RSP: 0018:ffff888012677908 EFLAGS: 00000202
...
Call Trace:
<TASK>
neigh_table_clear+0x94/0x2d0
ndisc_cleanup+0x27/0x40 [ipv6]
inet6_init+0x21c/0x2cb [ipv6]
do_one_initcall+0xd3/0x4d0
do_init_module+0x1ae/0x670
...
Kernel panic - not syncing: Fatal exception

When ipv6 initialization fails, it will try to cleanup and calls:

neigh_table_clear()
neigh_ifdown(tbl, NULL)
pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev == NULL))
# dev_net(NULL) triggers null-ptr-deref.

Fix it by passing NULL to pneigh_queue_purge() in neigh_ifdown() if dev
is NULL, to make kernel not panic immediately.

Fixes: 66ba215cb513 ("neigh: fix possible DoS due to net iface start/stop loop")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Link: https://lore.kernel.org/r/20221101121552.21890-1-chenzhongjin@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff f8017317 Tue Nov 01 06:15:52 MDT 2022 Chen Zhongjin <chenzhongjin@huawei.com> net, neigh: Fix null-ptr-deref in neigh_table_clear()

When IPv6 module gets initialized but hits an error in the middle,
kenel panic with:

KASAN: null-ptr-deref in range [0x0000000000000598-0x000000000000059f]
CPU: 1 PID: 361 Comm: insmod
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:__neigh_ifdown.isra.0+0x24b/0x370
RSP: 0018:ffff888012677908 EFLAGS: 00000202
...
Call Trace:
<TASK>
neigh_table_clear+0x94/0x2d0
ndisc_cleanup+0x27/0x40 [ipv6]
inet6_init+0x21c/0x2cb [ipv6]
do_one_initcall+0xd3/0x4d0
do_init_module+0x1ae/0x670
...
Kernel panic - not syncing: Fatal exception

When ipv6 initialization fails, it will try to cleanup and calls:

neigh_table_clear()
neigh_ifdown(tbl, NULL)
pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev == NULL))
# dev_net(NULL) triggers null-ptr-deref.

Fix it by passing NULL to pneigh_queue_purge() in neigh_ifdown() if dev
is NULL, to make kernel not panic immediately.

Fixes: 66ba215cb513 ("neigh: fix possible DoS due to net iface start/stop loop")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Link: https://lore.kernel.org/r/20221101121552.21890-1-chenzhongjin@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff f8017317 Tue Nov 01 06:15:52 MDT 2022 Chen Zhongjin <chenzhongjin@huawei.com> net, neigh: Fix null-ptr-deref in neigh_table_clear()

When IPv6 module gets initialized but hits an error in the middle,
kenel panic with:

KASAN: null-ptr-deref in range [0x0000000000000598-0x000000000000059f]
CPU: 1 PID: 361 Comm: insmod
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:__neigh_ifdown.isra.0+0x24b/0x370
RSP: 0018:ffff888012677908 EFLAGS: 00000202
...
Call Trace:
<TASK>
neigh_table_clear+0x94/0x2d0
ndisc_cleanup+0x27/0x40 [ipv6]
inet6_init+0x21c/0x2cb [ipv6]
do_one_initcall+0xd3/0x4d0
do_init_module+0x1ae/0x670
...
Kernel panic - not syncing: Fatal exception

When ipv6 initialization fails, it will try to cleanup and calls:

neigh_table_clear()
neigh_ifdown(tbl, NULL)
pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev == NULL))
# dev_net(NULL) triggers null-ptr-deref.

Fix it by passing NULL to pneigh_queue_purge() in neigh_ifdown() if dev
is NULL, to make kernel not panic immediately.

Fixes: 66ba215cb513 ("neigh: fix possible DoS due to net iface start/stop loop")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Link: https://lore.kernel.org/r/20221101121552.21890-1-chenzhongjin@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff f8017317 Tue Nov 01 06:15:52 MDT 2022 Chen Zhongjin <chenzhongjin@huawei.com> net, neigh: Fix null-ptr-deref in neigh_table_clear()

When IPv6 module gets initialized but hits an error in the middle,
kenel panic with:

KASAN: null-ptr-deref in range [0x0000000000000598-0x000000000000059f]
CPU: 1 PID: 361 Comm: insmod
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:__neigh_ifdown.isra.0+0x24b/0x370
RSP: 0018:ffff888012677908 EFLAGS: 00000202
...
Call Trace:
<TASK>
neigh_table_clear+0x94/0x2d0
ndisc_cleanup+0x27/0x40 [ipv6]
inet6_init+0x21c/0x2cb [ipv6]
do_one_initcall+0xd3/0x4d0
do_init_module+0x1ae/0x670
...
Kernel panic - not syncing: Fatal exception

When ipv6 initialization fails, it will try to cleanup and calls:

neigh_table_clear()
neigh_ifdown(tbl, NULL)
pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev == NULL))
# dev_net(NULL) triggers null-ptr-deref.

Fix it by passing NULL to pneigh_queue_purge() in neigh_ifdown() if dev
is NULL, to make kernel not panic immediately.

Fixes: 66ba215cb513 ("neigh: fix possible DoS due to net iface start/stop loop")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Link: https://lore.kernel.org/r/20221101121552.21890-1-chenzhongjin@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff f8017317 Tue Nov 01 06:15:52 MDT 2022 Chen Zhongjin <chenzhongjin@huawei.com> net, neigh: Fix null-ptr-deref in neigh_table_clear()

When IPv6 module gets initialized but hits an error in the middle,
kenel panic with:

KASAN: null-ptr-deref in range [0x0000000000000598-0x000000000000059f]
CPU: 1 PID: 361 Comm: insmod
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:__neigh_ifdown.isra.0+0x24b/0x370
RSP: 0018:ffff888012677908 EFLAGS: 00000202
...
Call Trace:
<TASK>
neigh_table_clear+0x94/0x2d0
ndisc_cleanup+0x27/0x40 [ipv6]
inet6_init+0x21c/0x2cb [ipv6]
do_one_initcall+0xd3/0x4d0
do_init_module+0x1ae/0x670
...
Kernel panic - not syncing: Fatal exception

When ipv6 initialization fails, it will try to cleanup and calls:

neigh_table_clear()
neigh_ifdown(tbl, NULL)
pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev == NULL))
# dev_net(NULL) triggers null-ptr-deref.

Fix it by passing NULL to pneigh_queue_purge() in neigh_ifdown() if dev
is NULL, to make kernel not panic immediately.

Fixes: 66ba215cb513 ("neigh: fix possible DoS due to net iface start/stop loop")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Link: https://lore.kernel.org/r/20221101121552.21890-1-chenzhongjin@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff f8017317 Tue Nov 01 06:15:52 MDT 2022 Chen Zhongjin <chenzhongjin@huawei.com> net, neigh: Fix null-ptr-deref in neigh_table_clear()

When IPv6 module gets initialized but hits an error in the middle,
kenel panic with:

KASAN: null-ptr-deref in range [0x0000000000000598-0x000000000000059f]
CPU: 1 PID: 361 Comm: insmod
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:__neigh_ifdown.isra.0+0x24b/0x370
RSP: 0018:ffff888012677908 EFLAGS: 00000202
...
Call Trace:
<TASK>
neigh_table_clear+0x94/0x2d0
ndisc_cleanup+0x27/0x40 [ipv6]
inet6_init+0x21c/0x2cb [ipv6]
do_one_initcall+0xd3/0x4d0
do_init_module+0x1ae/0x670
...
Kernel panic - not syncing: Fatal exception

When ipv6 initialization fails, it will try to cleanup and calls:

neigh_table_clear()
neigh_ifdown(tbl, NULL)
pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev == NULL))
# dev_net(NULL) triggers null-ptr-deref.

Fix it by passing NULL to pneigh_queue_purge() in neigh_ifdown() if dev
is NULL, to make kernel not panic immediately.

Fixes: 66ba215cb513 ("neigh: fix possible DoS due to net iface start/stop loop")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Link: https://lore.kernel.org/r/20221101121552.21890-1-chenzhongjin@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff f8017317 Tue Nov 01 06:15:52 MDT 2022 Chen Zhongjin <chenzhongjin@huawei.com> net, neigh: Fix null-ptr-deref in neigh_table_clear()

When IPv6 module gets initialized but hits an error in the middle,
kenel panic with:

KASAN: null-ptr-deref in range [0x0000000000000598-0x000000000000059f]
CPU: 1 PID: 361 Comm: insmod
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:__neigh_ifdown.isra.0+0x24b/0x370
RSP: 0018:ffff888012677908 EFLAGS: 00000202
...
Call Trace:
<TASK>
neigh_table_clear+0x94/0x2d0
ndisc_cleanup+0x27/0x40 [ipv6]
inet6_init+0x21c/0x2cb [ipv6]
do_one_initcall+0xd3/0x4d0
do_init_module+0x1ae/0x670
...
Kernel panic - not syncing: Fatal exception

When ipv6 initialization fails, it will try to cleanup and calls:

neigh_table_clear()
neigh_ifdown(tbl, NULL)
pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev == NULL))
# dev_net(NULL) triggers null-ptr-deref.

Fix it by passing NULL to pneigh_queue_purge() in neigh_ifdown() if dev
is NULL, to make kernel not panic immediately.

Fixes: 66ba215cb513 ("neigh: fix possible DoS due to net iface start/stop loop")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Link: https://lore.kernel.org/r/20221101121552.21890-1-chenzhongjin@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff f8017317 Tue Nov 01 06:15:52 MDT 2022 Chen Zhongjin <chenzhongjin@huawei.com> net, neigh: Fix null-ptr-deref in neigh_table_clear()

When IPv6 module gets initialized but hits an error in the middle,
kenel panic with:

KASAN: null-ptr-deref in range [0x0000000000000598-0x000000000000059f]
CPU: 1 PID: 361 Comm: insmod
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:__neigh_ifdown.isra.0+0x24b/0x370
RSP: 0018:ffff888012677908 EFLAGS: 00000202
...
Call Trace:
<TASK>
neigh_table_clear+0x94/0x2d0
ndisc_cleanup+0x27/0x40 [ipv6]
inet6_init+0x21c/0x2cb [ipv6]
do_one_initcall+0xd3/0x4d0
do_init_module+0x1ae/0x670
...
Kernel panic - not syncing: Fatal exception

When ipv6 initialization fails, it will try to cleanup and calls:

neigh_table_clear()
neigh_ifdown(tbl, NULL)
pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev == NULL))
# dev_net(NULL) triggers null-ptr-deref.

Fix it by passing NULL to pneigh_queue_purge() in neigh_ifdown() if dev
is NULL, to make kernel not panic immediately.

Fixes: 66ba215cb513 ("neigh: fix possible DoS due to net iface start/stop loop")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Link: https://lore.kernel.org/r/20221101121552.21890-1-chenzhongjin@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff f8017317 Tue Nov 01 06:15:52 MDT 2022 Chen Zhongjin <chenzhongjin@huawei.com> net, neigh: Fix null-ptr-deref in neigh_table_clear()

When IPv6 module gets initialized but hits an error in the middle,
kenel panic with:

KASAN: null-ptr-deref in range [0x0000000000000598-0x000000000000059f]
CPU: 1 PID: 361 Comm: insmod
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:__neigh_ifdown.isra.0+0x24b/0x370
RSP: 0018:ffff888012677908 EFLAGS: 00000202
...
Call Trace:
<TASK>
neigh_table_clear+0x94/0x2d0
ndisc_cleanup+0x27/0x40 [ipv6]
inet6_init+0x21c/0x2cb [ipv6]
do_one_initcall+0xd3/0x4d0
do_init_module+0x1ae/0x670
...
Kernel panic - not syncing: Fatal exception

When ipv6 initialization fails, it will try to cleanup and calls:

neigh_table_clear()
neigh_ifdown(tbl, NULL)
pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev == NULL))
# dev_net(NULL) triggers null-ptr-deref.

Fix it by passing NULL to pneigh_queue_purge() in neigh_ifdown() if dev
is NULL, to make kernel not panic immediately.

Fixes: 66ba215cb513 ("neigh: fix possible DoS due to net iface start/stop loop")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Link: https://lore.kernel.org/r/20221101121552.21890-1-chenzhongjin@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff f8017317 Tue Nov 01 06:15:52 MDT 2022 Chen Zhongjin <chenzhongjin@huawei.com> net, neigh: Fix null-ptr-deref in neigh_table_clear()

When IPv6 module gets initialized but hits an error in the middle,
kenel panic with:

KASAN: null-ptr-deref in range [0x0000000000000598-0x000000000000059f]
CPU: 1 PID: 361 Comm: insmod
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:__neigh_ifdown.isra.0+0x24b/0x370
RSP: 0018:ffff888012677908 EFLAGS: 00000202
...
Call Trace:
<TASK>
neigh_table_clear+0x94/0x2d0
ndisc_cleanup+0x27/0x40 [ipv6]
inet6_init+0x21c/0x2cb [ipv6]
do_one_initcall+0xd3/0x4d0
do_init_module+0x1ae/0x670
...
Kernel panic - not syncing: Fatal exception

When ipv6 initialization fails, it will try to cleanup and calls:

neigh_table_clear()
neigh_ifdown(tbl, NULL)
pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev == NULL))
# dev_net(NULL) triggers null-ptr-deref.

Fix it by passing NULL to pneigh_queue_purge() in neigh_ifdown() if dev
is NULL, to make kernel not panic immediately.

Fixes: 66ba215cb513 ("neigh: fix possible DoS due to net iface start/stop loop")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Link: https://lore.kernel.org/r/20221101121552.21890-1-chenzhongjin@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff f8017317 Tue Nov 01 06:15:52 MDT 2022 Chen Zhongjin <chenzhongjin@huawei.com> net, neigh: Fix null-ptr-deref in neigh_table_clear()

When IPv6 module gets initialized but hits an error in the middle,
kenel panic with:

KASAN: null-ptr-deref in range [0x0000000000000598-0x000000000000059f]
CPU: 1 PID: 361 Comm: insmod
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:__neigh_ifdown.isra.0+0x24b/0x370
RSP: 0018:ffff888012677908 EFLAGS: 00000202
...
Call Trace:
<TASK>
neigh_table_clear+0x94/0x2d0
ndisc_cleanup+0x27/0x40 [ipv6]
inet6_init+0x21c/0x2cb [ipv6]
do_one_initcall+0xd3/0x4d0
do_init_module+0x1ae/0x670
...
Kernel panic - not syncing: Fatal exception

When ipv6 initialization fails, it will try to cleanup and calls:

neigh_table_clear()
neigh_ifdown(tbl, NULL)
pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev == NULL))
# dev_net(NULL) triggers null-ptr-deref.

Fix it by passing NULL to pneigh_queue_purge() in neigh_ifdown() if dev
is NULL, to make kernel not panic immediately.

Fixes: 66ba215cb513 ("neigh: fix possible DoS due to net iface start/stop loop")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Link: https://lore.kernel.org/r/20221101121552.21890-1-chenzhongjin@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff f8017317 Tue Nov 01 06:15:52 MDT 2022 Chen Zhongjin <chenzhongjin@huawei.com> net, neigh: Fix null-ptr-deref in neigh_table_clear()

When IPv6 module gets initialized but hits an error in the middle,
kenel panic with:

KASAN: null-ptr-deref in range [0x0000000000000598-0x000000000000059f]
CPU: 1 PID: 361 Comm: insmod
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:__neigh_ifdown.isra.0+0x24b/0x370
RSP: 0018:ffff888012677908 EFLAGS: 00000202
...
Call Trace:
<TASK>
neigh_table_clear+0x94/0x2d0
ndisc_cleanup+0x27/0x40 [ipv6]
inet6_init+0x21c/0x2cb [ipv6]
do_one_initcall+0xd3/0x4d0
do_init_module+0x1ae/0x670
...
Kernel panic - not syncing: Fatal exception

When ipv6 initialization fails, it will try to cleanup and calls:

neigh_table_clear()
neigh_ifdown(tbl, NULL)
pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev == NULL))
# dev_net(NULL) triggers null-ptr-deref.

Fix it by passing NULL to pneigh_queue_purge() in neigh_ifdown() if dev
is NULL, to make kernel not panic immediately.

Fixes: 66ba215cb513 ("neigh: fix possible DoS due to net iface start/stop loop")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Link: https://lore.kernel.org/r/20221101121552.21890-1-chenzhongjin@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff f8017317 Tue Nov 01 06:15:52 MDT 2022 Chen Zhongjin <chenzhongjin@huawei.com> net, neigh: Fix null-ptr-deref in neigh_table_clear()

When IPv6 module gets initialized but hits an error in the middle,
kenel panic with:

KASAN: null-ptr-deref in range [0x0000000000000598-0x000000000000059f]
CPU: 1 PID: 361 Comm: insmod
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:__neigh_ifdown.isra.0+0x24b/0x370
RSP: 0018:ffff888012677908 EFLAGS: 00000202
...
Call Trace:
<TASK>
neigh_table_clear+0x94/0x2d0
ndisc_cleanup+0x27/0x40 [ipv6]
inet6_init+0x21c/0x2cb [ipv6]
do_one_initcall+0xd3/0x4d0
do_init_module+0x1ae/0x670
...
Kernel panic - not syncing: Fatal exception

When ipv6 initialization fails, it will try to cleanup and calls:

neigh_table_clear()
neigh_ifdown(tbl, NULL)
pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev == NULL))
# dev_net(NULL) triggers null-ptr-deref.

Fix it by passing NULL to pneigh_queue_purge() in neigh_ifdown() if dev
is NULL, to make kernel not panic immediately.

Fixes: 66ba215cb513 ("neigh: fix possible DoS due to net iface start/stop loop")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Link: https://lore.kernel.org/r/20221101121552.21890-1-chenzhongjin@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
/linux-master/drivers/parport/
H A Dprocfs.cdiff 2f2665c1 Fri Jun 16 02:59:22 MDT 2023 Joel Granados <j.granados@samsung.com> sysctl: replace child with an enumeration

This is part of the effort to remove the empty element at the end of
ctl_table structs. "child" was a deprecated elem in this struct and was
being used to differentiate between two types of ctl_tables: "normal"
and "permanently emtpy".

What changed?:
* Replace "child" with an enumeration that will have two values: the
default (0) and the permanently empty (1). The latter is left at zero
so when struct ctl_table is created with kzalloc or in a local
context, it will have the zero value by default. We document the
new enum with kdoc.
* Remove the "empty child" check from sysctl_check_table
* Remove count_subheaders function as there is no longer a need to
calculate how many headers there are for every child
* Remove the recursive call to unregister_sysctl_table as there is no
need to traverse down the child tree any longer
* Add a new SYSCTL_PERM_EMPTY_DIR binary flag
* Remove the last remanence of child from partport/procfs.c

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
diff 0b4d4147 Wed Feb 14 01:34:09 MST 2007 Eric W. Biederman <ebiederm@xmission.com> [PATCH] sysctl: remove insert_at_head from register_sysctl

The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 0b4d4147 Wed Feb 14 01:34:09 MST 2007 Eric W. Biederman <ebiederm@xmission.com> [PATCH] sysctl: remove insert_at_head from register_sysctl

The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/linux-master/drivers/scsi/
H A Dscsi_sysctl.cdiff 0b4d4147 Wed Feb 14 01:34:09 MST 2007 Eric W. Biederman <ebiederm@xmission.com> [PATCH] sysctl: remove insert_at_head from register_sysctl

The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 0b4d4147 Wed Feb 14 01:34:09 MST 2007 Eric W. Biederman <ebiederm@xmission.com> [PATCH] sysctl: remove insert_at_head from register_sysctl

The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/linux-master/drivers/macintosh/
H A Dmac_hid.cdiff e99f5e74 Fri Jan 21 23:11:44 MST 2022 Luis Chamberlain <mcgrof@kernel.org> macintosh/mac_hid.c: simplify subdirectory registration with register_sysctl()

There is no need to user boiler plate code to specify a set of base
directories we're going to stuff sysctls under. Simplify this by using
register_sysctl() and specifying the directory path directly.

// pycocci sysctl-subdir-register-sysctl-simplify.cocci PATH

@c1@
expression E1;
identifier subdir, sysctls;
@@

static struct ctl_table subdir[] = {
{
.procname = E1,
.maxlen = 0,
.mode = 0555,
.child = sysctls,
},
{ }
};

@c2@
identifier c1.subdir;

expression E2;
identifier base;
@@

static struct ctl_table base[] = {
{
.procname = E2,
.maxlen = 0,
.mode = 0555,
.child = subdir,
},
{ }
};

@c3@
identifier c2.base;
identifier header;
@@

header = register_sysctl_table(base);

@r1 depends on c1 && c2 && c3@
expression c1.E1;
identifier c1.subdir, c1.sysctls;
@@

-static struct ctl_table subdir[] = {
- {
- .procname = E1,
- .maxlen = 0,
- .mode = 0555,
- .child = sysctls,
- },
- { }
-};

@r2 depends on c1 && c2 && c3@
identifier c1.subdir;

expression c2.E2;
identifier c2.base;
@@
-static struct ctl_table base[] = {
- {
- .procname = E2,
- .maxlen = 0,
- .mode = 0555,
- .child = subdir,
- },
- { }
-};

@initialize:python@
@@

def make_my_fresh_expression(s1, s2):
return '"' + s1.strip('"') + "/" + s2.strip('"') + '"'

@r3 depends on c1 && c2 && c3@
expression c1.E1;
identifier c1.sysctls;
expression c2.E2;
identifier c2.base;
identifier c3.header;
fresh identifier E3 = script:python(E2, E1) { make_my_fresh_expression(E2, E1) };
@@

header =
-register_sysctl_table(base);
+register_sysctl(E3, sysctls);

Generated-by: Coccinelle SmPL
Link: https://lkml.kernel.org/r/20211123202422.819032-4-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff e99f5e74 Fri Jan 21 23:11:44 MST 2022 Luis Chamberlain <mcgrof@kernel.org> macintosh/mac_hid.c: simplify subdirectory registration with register_sysctl()

There is no need to user boiler plate code to specify a set of base
directories we're going to stuff sysctls under. Simplify this by using
register_sysctl() and specifying the directory path directly.

// pycocci sysctl-subdir-register-sysctl-simplify.cocci PATH

@c1@
expression E1;
identifier subdir, sysctls;
@@

static struct ctl_table subdir[] = {
{
.procname = E1,
.maxlen = 0,
.mode = 0555,
.child = sysctls,
},
{ }
};

@c2@
identifier c1.subdir;

expression E2;
identifier base;
@@

static struct ctl_table base[] = {
{
.procname = E2,
.maxlen = 0,
.mode = 0555,
.child = subdir,
},
{ }
};

@c3@
identifier c2.base;
identifier header;
@@

header = register_sysctl_table(base);

@r1 depends on c1 && c2 && c3@
expression c1.E1;
identifier c1.subdir, c1.sysctls;
@@

-static struct ctl_table subdir[] = {
- {
- .procname = E1,
- .maxlen = 0,
- .mode = 0555,
- .child = sysctls,
- },
- { }
-};

@r2 depends on c1 && c2 && c3@
identifier c1.subdir;

expression c2.E2;
identifier c2.base;
@@
-static struct ctl_table base[] = {
- {
- .procname = E2,
- .maxlen = 0,
- .mode = 0555,
- .child = subdir,
- },
- { }
-};

@initialize:python@
@@

def make_my_fresh_expression(s1, s2):
return '"' + s1.strip('"') + "/" + s2.strip('"') + '"'

@r3 depends on c1 && c2 && c3@
expression c1.E1;
identifier c1.sysctls;
expression c2.E2;
identifier c2.base;
identifier c3.header;
fresh identifier E3 = script:python(E2, E1) { make_my_fresh_expression(E2, E1) };
@@

header =
-register_sysctl_table(base);
+register_sysctl(E3, sysctls);

Generated-by: Coccinelle SmPL
Link: https://lkml.kernel.org/r/20211123202422.819032-4-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff e99f5e74 Fri Jan 21 23:11:44 MST 2022 Luis Chamberlain <mcgrof@kernel.org> macintosh/mac_hid.c: simplify subdirectory registration with register_sysctl()

There is no need to user boiler plate code to specify a set of base
directories we're going to stuff sysctls under. Simplify this by using
register_sysctl() and specifying the directory path directly.

// pycocci sysctl-subdir-register-sysctl-simplify.cocci PATH

@c1@
expression E1;
identifier subdir, sysctls;
@@

static struct ctl_table subdir[] = {
{
.procname = E1,
.maxlen = 0,
.mode = 0555,
.child = sysctls,
},
{ }
};

@c2@
identifier c1.subdir;

expression E2;
identifier base;
@@

static struct ctl_table base[] = {
{
.procname = E2,
.maxlen = 0,
.mode = 0555,
.child = subdir,
},
{ }
};

@c3@
identifier c2.base;
identifier header;
@@

header = register_sysctl_table(base);

@r1 depends on c1 && c2 && c3@
expression c1.E1;
identifier c1.subdir, c1.sysctls;
@@

-static struct ctl_table subdir[] = {
- {
- .procname = E1,
- .maxlen = 0,
- .mode = 0555,
- .child = sysctls,
- },
- { }
-};

@r2 depends on c1 && c2 && c3@
identifier c1.subdir;

expression c2.E2;
identifier c2.base;
@@
-static struct ctl_table base[] = {
- {
- .procname = E2,
- .maxlen = 0,
- .mode = 0555,
- .child = subdir,
- },
- { }
-};

@initialize:python@
@@

def make_my_fresh_expression(s1, s2):
return '"' + s1.strip('"') + "/" + s2.strip('"') + '"'

@r3 depends on c1 && c2 && c3@
expression c1.E1;
identifier c1.sysctls;
expression c2.E2;
identifier c2.base;
identifier c3.header;
fresh identifier E3 = script:python(E2, E1) { make_my_fresh_expression(E2, E1) };
@@

header =
-register_sysctl_table(base);
+register_sysctl(E3, sysctls);

Generated-by: Coccinelle SmPL
Link: https://lkml.kernel.org/r/20211123202422.819032-4-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff e99f5e74 Fri Jan 21 23:11:44 MST 2022 Luis Chamberlain <mcgrof@kernel.org> macintosh/mac_hid.c: simplify subdirectory registration with register_sysctl()

There is no need to user boiler plate code to specify a set of base
directories we're going to stuff sysctls under. Simplify this by using
register_sysctl() and specifying the directory path directly.

// pycocci sysctl-subdir-register-sysctl-simplify.cocci PATH

@c1@
expression E1;
identifier subdir, sysctls;
@@

static struct ctl_table subdir[] = {
{
.procname = E1,
.maxlen = 0,
.mode = 0555,
.child = sysctls,
},
{ }
};

@c2@
identifier c1.subdir;

expression E2;
identifier base;
@@

static struct ctl_table base[] = {
{
.procname = E2,
.maxlen = 0,
.mode = 0555,
.child = subdir,
},
{ }
};

@c3@
identifier c2.base;
identifier header;
@@

header = register_sysctl_table(base);

@r1 depends on c1 && c2 && c3@
expression c1.E1;
identifier c1.subdir, c1.sysctls;
@@

-static struct ctl_table subdir[] = {
- {
- .procname = E1,
- .maxlen = 0,
- .mode = 0555,
- .child = sysctls,
- },
- { }
-};

@r2 depends on c1 && c2 && c3@
identifier c1.subdir;

expression c2.E2;
identifier c2.base;
@@
-static struct ctl_table base[] = {
- {
- .procname = E2,
- .maxlen = 0,
- .mode = 0555,
- .child = subdir,
- },
- { }
-};

@initialize:python@
@@

def make_my_fresh_expression(s1, s2):
return '"' + s1.strip('"') + "/" + s2.strip('"') + '"'

@r3 depends on c1 && c2 && c3@
expression c1.E1;
identifier c1.sysctls;
expression c2.E2;
identifier c2.base;
identifier c3.header;
fresh identifier E3 = script:python(E2, E1) { make_my_fresh_expression(E2, E1) };
@@

header =
-register_sysctl_table(base);
+register_sysctl(E3, sysctls);

Generated-by: Coccinelle SmPL
Link: https://lkml.kernel.org/r/20211123202422.819032-4-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 0b4d4147 Wed Feb 14 01:34:09 MST 2007 Eric W. Biederman <ebiederm@xmission.com> [PATCH] sysctl: remove insert_at_head from register_sysctl

The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 0b4d4147 Wed Feb 14 01:34:09 MST 2007 Eric W. Biederman <ebiederm@xmission.com> [PATCH] sysctl: remove insert_at_head from register_sysctl

The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 0a4b9b6e Wed Feb 14 01:33:32 MST 2007 Eric W. Biederman <ebiederm@xmission.com> [PATCH] sysctl: mac_hid: remove unnecessary insert_at_head flag

With unique sysctl binary numbers setting insert_at_head is pointless.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dmitry Torokhov <dtor@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/linux-master/drivers/char/ipmi/
H A Dipmi_poweroff.cdiff 0e6b9e8c Sat Aug 22 17:46:55 MDT 2009 Peter Huewe <peterhuewe@gmx.de> trivial: add __init macro/ fix of __exit macro location in ipmi_poweroff.c

Trivial patch which adds the __init to the module_init function of
drivers/char/ipmi/ipmy_poweroff.c and corrects the location of __exit for the
cleanup function.

Signed-off-by: Peter Huewe <peterhuewe@gmx.de>
Acked-by: Corey Minyard <minyard@acm.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
diff 0b4d4147 Wed Feb 14 01:34:09 MST 2007 Eric W. Biederman <ebiederm@xmission.com> [PATCH] sysctl: remove insert_at_head from register_sysctl

The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 0b4d4147 Wed Feb 14 01:34:09 MST 2007 Eric W. Biederman <ebiederm@xmission.com> [PATCH] sysctl: remove insert_at_head from register_sysctl

The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 0c8204b3 Sun Dec 10 03:19:06 MST 2006 Randy Dunlap <randy.dunlap@oracle.com> [PATCH] IPMI: remove zero inits

Remove all =0 and =NULL from static initializers. They are not needed and
removing them saves space in the object files.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Corey Minyard <minyard@acm.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
diff 0c8204b3 Sun Dec 10 03:19:06 MST 2006 Randy Dunlap <randy.dunlap@oracle.com> [PATCH] IPMI: remove zero inits

Remove all =0 and =NULL from static initializers. They are not needed and
removing them saves space in the object files.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Corey Minyard <minyard@acm.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
/linux-master/drivers/cdrom/
H A Dcdrom.cdiff ad8f7431 Fri Jan 21 23:12:04 MST 2022 Luis Chamberlain <mcgrof@kernel.org> cdrom: simplify subdirectory registration with register_sysctl()

There is no need to user boiler plate code to specify a set of base
directories we're going to stuff sysctls under. Simplify this by using
register_sysctl() and specifying the directory path directly.

// pycocci sysctl-subdir-register-sysctl-simplify.cocci PATH

@c1@
expression E1;
identifier subdir, sysctls;
@@

static struct ctl_table subdir[] = {
{
.procname = E1,
.maxlen = 0,
.mode = 0555,
.child = sysctls,
},
{ }
};

@c2@
identifier c1.subdir;

expression E2;
identifier base;
@@

static struct ctl_table base[] = {
{
.procname = E2,
.maxlen = 0,
.mode = 0555,
.child = subdir,
},
{ }
};

@c3@
identifier c2.base;
identifier header;
@@

header = register_sysctl_table(base);

@r1 depends on c1 && c2 && c3@
expression c1.E1;
identifier c1.subdir, c1.sysctls;
@@

-static struct ctl_table subdir[] = {
- {
- .procname = E1,
- .maxlen = 0,
- .mode = 0555,
- .child = sysctls,
- },
- { }
-};

@r2 depends on c1 && c2 && c3@
identifier c1.subdir;

expression c2.E2;
identifier c2.base;
@@
-static struct ctl_table base[] = {
- {
- .procname = E2,
- .maxlen = 0,
- .mode = 0555,
- .child = subdir,
- },
- { }
-};

@initialize:python@
@@

def make_my_fresh_expression(s1, s2):
return '"' + s1.strip('"') + "/" + s2.strip('"') + '"'

@r3 depends on c1 && c2 && c3@
expression c1.E1;
identifier c1.sysctls;
expression c2.E2;
identifier c2.base;
identifier c3.header;
fresh identifier E3 = script:python(E2, E1) { make_my_fresh_expression(E2, E1) };
@@

header =
-register_sysctl_table(base);
+register_sysctl(E3, sysctls);

Generated-by: Coccinelle SmPL
Link: https://lkml.kernel.org/r/20211123202422.819032-8-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff ad8f7431 Fri Jan 21 23:12:04 MST 2022 Luis Chamberlain <mcgrof@kernel.org> cdrom: simplify subdirectory registration with register_sysctl()

There is no need to user boiler plate code to specify a set of base
directories we're going to stuff sysctls under. Simplify this by using
register_sysctl() and specifying the directory path directly.

// pycocci sysctl-subdir-register-sysctl-simplify.cocci PATH

@c1@
expression E1;
identifier subdir, sysctls;
@@

static struct ctl_table subdir[] = {
{
.procname = E1,
.maxlen = 0,
.mode = 0555,
.child = sysctls,
},
{ }
};

@c2@
identifier c1.subdir;

expression E2;
identifier base;
@@

static struct ctl_table base[] = {
{
.procname = E2,
.maxlen = 0,
.mode = 0555,
.child = subdir,
},
{ }
};

@c3@
identifier c2.base;
identifier header;
@@

header = register_sysctl_table(base);

@r1 depends on c1 && c2 && c3@
expression c1.E1;
identifier c1.subdir, c1.sysctls;
@@

-static struct ctl_table subdir[] = {
- {
- .procname = E1,
- .maxlen = 0,
- .mode = 0555,
- .child = sysctls,
- },
- { }
-};

@r2 depends on c1 && c2 && c3@
identifier c1.subdir;

expression c2.E2;
identifier c2.base;
@@
-static struct ctl_table base[] = {
- {
- .procname = E2,
- .maxlen = 0,
- .mode = 0555,
- .child = subdir,
- },
- { }
-};

@initialize:python@
@@

def make_my_fresh_expression(s1, s2):
return '"' + s1.strip('"') + "/" + s2.strip('"') + '"'

@r3 depends on c1 && c2 && c3@
expression c1.E1;
identifier c1.sysctls;
expression c2.E2;
identifier c2.base;
identifier c3.header;
fresh identifier E3 = script:python(E2, E1) { make_my_fresh_expression(E2, E1) };
@@

header =
-register_sysctl_table(base);
+register_sysctl(E3, sysctls);

Generated-by: Coccinelle SmPL
Link: https://lkml.kernel.org/r/20211123202422.819032-8-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff ad8f7431 Fri Jan 21 23:12:04 MST 2022 Luis Chamberlain <mcgrof@kernel.org> cdrom: simplify subdirectory registration with register_sysctl()

There is no need to user boiler plate code to specify a set of base
directories we're going to stuff sysctls under. Simplify this by using
register_sysctl() and specifying the directory path directly.

// pycocci sysctl-subdir-register-sysctl-simplify.cocci PATH

@c1@
expression E1;
identifier subdir, sysctls;
@@

static struct ctl_table subdir[] = {
{
.procname = E1,
.maxlen = 0,
.mode = 0555,
.child = sysctls,
},
{ }
};

@c2@
identifier c1.subdir;

expression E2;
identifier base;
@@

static struct ctl_table base[] = {
{
.procname = E2,
.maxlen = 0,
.mode = 0555,
.child = subdir,
},
{ }
};

@c3@
identifier c2.base;
identifier header;
@@

header = register_sysctl_table(base);

@r1 depends on c1 && c2 && c3@
expression c1.E1;
identifier c1.subdir, c1.sysctls;
@@

-static struct ctl_table subdir[] = {
- {
- .procname = E1,
- .maxlen = 0,
- .mode = 0555,
- .child = sysctls,
- },
- { }
-};

@r2 depends on c1 && c2 && c3@
identifier c1.subdir;

expression c2.E2;
identifier c2.base;
@@
-static struct ctl_table base[] = {
- {
- .procname = E2,
- .maxlen = 0,
- .mode = 0555,
- .child = subdir,
- },
- { }
-};

@initialize:python@
@@

def make_my_fresh_expression(s1, s2):
return '"' + s1.strip('"') + "/" + s2.strip('"') + '"'

@r3 depends on c1 && c2 && c3@
expression c1.E1;
identifier c1.sysctls;
expression c2.E2;
identifier c2.base;
identifier c3.header;
fresh identifier E3 = script:python(E2, E1) { make_my_fresh_expression(E2, E1) };
@@

header =
-register_sysctl_table(base);
+register_sysctl(E3, sysctls);

Generated-by: Coccinelle SmPL
Link: https://lkml.kernel.org/r/20211123202422.819032-8-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff ad8f7431 Fri Jan 21 23:12:04 MST 2022 Luis Chamberlain <mcgrof@kernel.org> cdrom: simplify subdirectory registration with register_sysctl()

There is no need to user boiler plate code to specify a set of base
directories we're going to stuff sysctls under. Simplify this by using
register_sysctl() and specifying the directory path directly.

// pycocci sysctl-subdir-register-sysctl-simplify.cocci PATH

@c1@
expression E1;
identifier subdir, sysctls;
@@

static struct ctl_table subdir[] = {
{
.procname = E1,
.maxlen = 0,
.mode = 0555,
.child = sysctls,
},
{ }
};

@c2@
identifier c1.subdir;

expression E2;
identifier base;
@@

static struct ctl_table base[] = {
{
.procname = E2,
.maxlen = 0,
.mode = 0555,
.child = subdir,
},
{ }
};

@c3@
identifier c2.base;
identifier header;
@@

header = register_sysctl_table(base);

@r1 depends on c1 && c2 && c3@
expression c1.E1;
identifier c1.subdir, c1.sysctls;
@@

-static struct ctl_table subdir[] = {
- {
- .procname = E1,
- .maxlen = 0,
- .mode = 0555,
- .child = sysctls,
- },
- { }
-};

@r2 depends on c1 && c2 && c3@
identifier c1.subdir;

expression c2.E2;
identifier c2.base;
@@
-static struct ctl_table base[] = {
- {
- .procname = E2,
- .maxlen = 0,
- .mode = 0555,
- .child = subdir,
- },
- { }
-};

@initialize:python@
@@

def make_my_fresh_expression(s1, s2):
return '"' + s1.strip('"') + "/" + s2.strip('"') + '"'

@r3 depends on c1 && c2 && c3@
expression c1.E1;
identifier c1.sysctls;
expression c2.E2;
identifier c2.base;
identifier c3.header;
fresh identifier E3 = script:python(E2, E1) { make_my_fresh_expression(E2, E1) };
@@

header =
-register_sysctl_table(base);
+register_sysctl(E3, sysctls);

Generated-by: Coccinelle SmPL
Link: https://lkml.kernel.org/r/20211123202422.819032-8-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 67f1e027 Mon Sep 13 17:09:42 MDT 2021 Lukas Prediger <lumip@lumip.de> drivers/cdrom: improved ioctl for media change detection

The current implementation of the CDROM_MEDIA_CHANGED ioctl relies on
global state, meaning that only one process can detect a disc change
while the ioctl call will return 0 for other calling processes afterwards
(see bug 213267).

This introduces a new cdrom ioctl, CDROM_TIMED_MEDIA_CHANGE, that
works by maintaining a timestamp of the last detected disc change instead
of a boolean flag: Processes calling this ioctl command can provide
a timestamp of the last disc change known to them and receive
an indication whether the disc was changed since then and the updated
timestamp.

I considered fixing the buggy behavior in the original
CDROM_MEDIA_CHANGED ioctl but that would require maintaining state
for each calling process in the kernel, which seems like a worse
solution than introducing this new ioctl.

Signed-off-by: Lukas Prediger <lumip@lumip.de>
Link: https://lore.kernel.org/all/20210912191207.74449-1-lumip@lumip.de
Signed-off-by: Phillip Potter <phil@philpotter.co.uk>
Link: https://lore.kernel.org/r/20210913230942.1188-1-phil@philpotter.co.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff f25191bb Wed Feb 06 22:13:49 MST 2019 Guenter Roeck <linux@roeck-us.net> cdrom: Fix race condition in cdrom_sysctl_register

The following traceback is sometimes seen when booting an image in qemu:

[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50

The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.

Use a static atomic to ensure that the function is executed exactly once.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
/linux-master/arch/s390/mm/
H A Dcmm.cdiff b8e51a6a Thu Oct 24 15:31:48 MDT 2019 Yihui ZENG <yzeng56@asu.edu> s390/cmm: fix information leak in cmm_timeout_handler()

The problem is that we were putting the NUL terminator too far:

buf[sizeof(buf) - 1] = '\0';

If the user input isn't NUL terminated and they haven't initialized the
whole buffer then it leads to an info leak. The NUL terminator should
be:

buf[len - 1] = '\0';

Signed-off-by: Yihui Zeng <yzeng56@asu.edu>
Cc: stable@vger.kernel.org
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
[heiko.carstens@de.ibm.com: keep semantics of how *lenp and *ppos are handled]
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
diff b8e51a6a Thu Oct 24 15:31:48 MDT 2019 Yihui ZENG <yzeng56@asu.edu> s390/cmm: fix information leak in cmm_timeout_handler()

The problem is that we were putting the NUL terminator too far:

buf[sizeof(buf) - 1] = '\0';

If the user input isn't NUL terminated and they haven't initialized the
whole buffer then it leads to an info leak. The NUL terminator should
be:

buf[len - 1] = '\0';

Signed-off-by: Yihui Zeng <yzeng56@asu.edu>
Cc: stable@vger.kernel.org
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
[heiko.carstens@de.ibm.com: keep semantics of how *lenp and *ppos are handled]
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
diff 1d27e3e2 Wed Oct 04 17:27:04 MDT 2017 Kees Cook <keescook@chromium.org> timer: Remove expires and data arguments from DEFINE_TIMER

Drop the arguments from the macro and adjust all callers with the
following script:

perl -pi -e 's/DEFINE_TIMER\((.*), 0, 0\);/DEFINE_TIMER($1);/g;' \
$(git grep DEFINE_TIMER | cut -d: -f1 | sort -u | grep -v timer.h)

Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # for m68k parts
Acked-by: Guenter Roeck <linux@roeck-us.net> # for watchdog parts
Acked-by: David S. Miller <davem@davemloft.net> # for networking parts
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Kalle Valo <kvalo@codeaurora.org> # for wireless parts
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: linux-mips@linux-mips.org
Cc: Petr Mladek <pmladek@suse.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Kalle Valo <kvalo@qca.qualcomm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: linux1394-devel@lists.sourceforge.net
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: linux-s390@vger.kernel.org
Cc: linux-wireless@vger.kernel.org
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: Wim Van Sebroeck <wim@iguana.be>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ursula Braun <ubraun@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Harish Patil <harish.patil@cavium.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Michael Reed <mdr@sgi.com>
Cc: Manish Chopra <manish.chopra@cavium.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-pm@vger.kernel.org
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Mark Gross <mark.gross@intel.com>
Cc: linux-watchdog@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: netdev@vger.kernel.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Link: https://lkml.kernel.org/r/1507159627-127660-11-git-send-email-keescook@chromium.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
diff 1d27e3e2 Wed Oct 04 17:27:04 MDT 2017 Kees Cook <keescook@chromium.org> timer: Remove expires and data arguments from DEFINE_TIMER

Drop the arguments from the macro and adjust all callers with the
following script:

perl -pi -e 's/DEFINE_TIMER\((.*), 0, 0\);/DEFINE_TIMER($1);/g;' \
$(git grep DEFINE_TIMER | cut -d: -f1 | sort -u | grep -v timer.h)

Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # for m68k parts
Acked-by: Guenter Roeck <linux@roeck-us.net> # for watchdog parts
Acked-by: David S. Miller <davem@davemloft.net> # for networking parts
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Kalle Valo <kvalo@codeaurora.org> # for wireless parts
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: linux-mips@linux-mips.org
Cc: Petr Mladek <pmladek@suse.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Kalle Valo <kvalo@qca.qualcomm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: linux1394-devel@lists.sourceforge.net
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: linux-s390@vger.kernel.org
Cc: linux-wireless@vger.kernel.org
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: Wim Van Sebroeck <wim@iguana.be>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ursula Braun <ubraun@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Harish Patil <harish.patil@cavium.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Michael Reed <mdr@sgi.com>
Cc: Manish Chopra <manish.chopra@cavium.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-pm@vger.kernel.org
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Mark Gross <mark.gross@intel.com>
Cc: linux-watchdog@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: netdev@vger.kernel.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Link: https://lkml.kernel.org/r/1507159627-127660-11-git-send-email-keescook@chromium.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
diff 0a87c5cf Wed Aug 22 05:51:40 MDT 2007 Michael Holzheu <holzheu@de.ibm.com> [S390] vmur: fix diag14 exceptions with addresses > 2GB.

There are several s390 diagnose calls, which must be executed below the
2GB memory boundary. In order to enforce this, those diagnoses must be
compiled into the kernel. Currently diag 14 can be called within the
vmur kernel module from addresses above 2GB. This leads to specification
exceptions. This patch moves diag10, diag14 and diag210 into the new
diag.c file.

Signed-off-by: Michael Holzheu <holzheu@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
diff 0b4d4147 Wed Feb 14 01:34:09 MST 2007 Eric W. Biederman <ebiederm@xmission.com> [PATCH] sysctl: remove insert_at_head from register_sysctl

The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 0b4d4147 Wed Feb 14 01:34:09 MST 2007 Eric W. Biederman <ebiederm@xmission.com> [PATCH] sysctl: remove insert_at_head from register_sysctl

The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/linux-master/arch/s390/kernel/
H A Ddebug.cdiff d72541f9 Fri Aug 13 07:05:04 MDT 2021 Peter Oberparleiter <oberpar@linux.ibm.com> s390/debug: add early tracing support

Debug areas can currently only be used after s390dbf initialization
which occurs as a postcore_initcall. This is too late for tracing
earlier code such as that related to console_init().

This patch introduces a macro for defining a statically initialized
debug area that can be used to trace very early code. The macro is made
available for built-in code only because modules are never running
during early boot.

Example usage:

1. Define static debug area:

DEFINE_STATIC_DEBUG_INFO(my_debug, "my_debug", 4, 1, 16,
&debug_hex_ascii_view);

2. Add trace entry:

debug_event(&my_debug, 0, "DATA", 4);

Note: The debug area is automatically registered in debugfs during boot.
A driver must not call any of the debug_register()/_unregister()
functions on a static debug_info_t!

Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
diff 860ec7c6 Thu Nov 23 11:26:20 MST 2017 Christian Borntraeger <borntraeger@de.ibm.com> s390/debug: use pK for kernel pointers

the s390 debug feature (/sys/kernel/debug/s390dbf/) shows the kernel
pointer of the calling function even for kptr_restrict == 2. Let us
use pK instead of p.

This hides the kernel addresses for kptr_restrict == 2:

root@host $ echo 2 > /proc/sys/kernel/kptr_restrict
root@host $ tail -n1 /sys/kernel/debug/s390dbf/cio_msg/sprintf
00 01511461280:386645 2 - 00 0000000000000000 snsid: device 0.0.3f68: rc=0 3990/e9 3390/0c

root@host $ echo 1 > /proc/sys/kernel/kptr_restrict
root@host $ tail -n1 /sys/kernel/debug/s390dbf/cio_msg/sprintf
00 01511461280:386645 2 - 00 000000000071171c snsid: device 0.0.3f68: rc=0 3990/e9 3390/0c

root@host $ echo 0 > /proc/sys/kernel/kptr_restrict
root@host $ tail -n1 /sys/kernel/debug/s390dbf/cio_msg/sprintf
00 01511461280:386645 2 - 00 000000000071171c snsid: device 0.0.3f68: rc=0 3990/e9 3390/0c

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 860ec7c6 Thu Nov 23 11:26:20 MST 2017 Christian Borntraeger <borntraeger@de.ibm.com> s390/debug: use pK for kernel pointers

the s390 debug feature (/sys/kernel/debug/s390dbf/) shows the kernel
pointer of the calling function even for kptr_restrict == 2. Let us
use pK instead of p.

This hides the kernel addresses for kptr_restrict == 2:

root@host $ echo 2 > /proc/sys/kernel/kptr_restrict
root@host $ tail -n1 /sys/kernel/debug/s390dbf/cio_msg/sprintf
00 01511461280:386645 2 - 00 0000000000000000 snsid: device 0.0.3f68: rc=0 3990/e9 3390/0c

root@host $ echo 1 > /proc/sys/kernel/kptr_restrict
root@host $ tail -n1 /sys/kernel/debug/s390dbf/cio_msg/sprintf
00 01511461280:386645 2 - 00 000000000071171c snsid: device 0.0.3f68: rc=0 3990/e9 3390/0c

root@host $ echo 0 > /proc/sys/kernel/kptr_restrict
root@host $ tail -n1 /sys/kernel/debug/s390dbf/cio_msg/sprintf
00 01511461280:386645 2 - 00 000000000071171c snsid: device 0.0.3f68: rc=0 3990/e9 3390/0c

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 860ec7c6 Thu Nov 23 11:26:20 MST 2017 Christian Borntraeger <borntraeger@de.ibm.com> s390/debug: use pK for kernel pointers

the s390 debug feature (/sys/kernel/debug/s390dbf/) shows the kernel
pointer of the calling function even for kptr_restrict == 2. Let us
use pK instead of p.

This hides the kernel addresses for kptr_restrict == 2:

root@host $ echo 2 > /proc/sys/kernel/kptr_restrict
root@host $ tail -n1 /sys/kernel/debug/s390dbf/cio_msg/sprintf
00 01511461280:386645 2 - 00 0000000000000000 snsid: device 0.0.3f68: rc=0 3990/e9 3390/0c

root@host $ echo 1 > /proc/sys/kernel/kptr_restrict
root@host $ tail -n1 /sys/kernel/debug/s390dbf/cio_msg/sprintf
00 01511461280:386645 2 - 00 000000000071171c snsid: device 0.0.3f68: rc=0 3990/e9 3390/0c

root@host $ echo 0 > /proc/sys/kernel/kptr_restrict
root@host $ tail -n1 /sys/kernel/debug/s390dbf/cio_msg/sprintf
00 01511461280:386645 2 - 00 000000000071171c snsid: device 0.0.3f68: rc=0 3990/e9 3390/0c

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 860ec7c6 Thu Nov 23 11:26:20 MST 2017 Christian Borntraeger <borntraeger@de.ibm.com> s390/debug: use pK for kernel pointers

the s390 debug feature (/sys/kernel/debug/s390dbf/) shows the kernel
pointer of the calling function even for kptr_restrict == 2. Let us
use pK instead of p.

This hides the kernel addresses for kptr_restrict == 2:

root@host $ echo 2 > /proc/sys/kernel/kptr_restrict
root@host $ tail -n1 /sys/kernel/debug/s390dbf/cio_msg/sprintf
00 01511461280:386645 2 - 00 0000000000000000 snsid: device 0.0.3f68: rc=0 3990/e9 3390/0c

root@host $ echo 1 > /proc/sys/kernel/kptr_restrict
root@host $ tail -n1 /sys/kernel/debug/s390dbf/cio_msg/sprintf
00 01511461280:386645 2 - 00 000000000071171c snsid: device 0.0.3f68: rc=0 3990/e9 3390/0c

root@host $ echo 0 > /proc/sys/kernel/kptr_restrict
root@host $ tail -n1 /sys/kernel/debug/s390dbf/cio_msg/sprintf
00 01511461280:386645 2 - 00 000000000071171c snsid: device 0.0.3f68: rc=0 3990/e9 3390/0c

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 860ec7c6 Thu Nov 23 11:26:20 MST 2017 Christian Borntraeger <borntraeger@de.ibm.com> s390/debug: use pK for kernel pointers

the s390 debug feature (/sys/kernel/debug/s390dbf/) shows the kernel
pointer of the calling function even for kptr_restrict == 2. Let us
use pK instead of p.

This hides the kernel addresses for kptr_restrict == 2:

root@host $ echo 2 > /proc/sys/kernel/kptr_restrict
root@host $ tail -n1 /sys/kernel/debug/s390dbf/cio_msg/sprintf
00 01511461280:386645 2 - 00 0000000000000000 snsid: device 0.0.3f68: rc=0 3990/e9 3390/0c

root@host $ echo 1 > /proc/sys/kernel/kptr_restrict
root@host $ tail -n1 /sys/kernel/debug/s390dbf/cio_msg/sprintf
00 01511461280:386645 2 - 00 000000000071171c snsid: device 0.0.3f68: rc=0 3990/e9 3390/0c

root@host $ echo 0 > /proc/sys/kernel/kptr_restrict
root@host $ tail -n1 /sys/kernel/debug/s390dbf/cio_msg/sprintf
00 01511461280:386645 2 - 00 000000000071171c snsid: device 0.0.3f68: rc=0 3990/e9 3390/0c

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 860ec7c6 Thu Nov 23 11:26:20 MST 2017 Christian Borntraeger <borntraeger@de.ibm.com> s390/debug: use pK for kernel pointers

the s390 debug feature (/sys/kernel/debug/s390dbf/) shows the kernel
pointer of the calling function even for kptr_restrict == 2. Let us
use pK instead of p.

This hides the kernel addresses for kptr_restrict == 2:

root@host $ echo 2 > /proc/sys/kernel/kptr_restrict
root@host $ tail -n1 /sys/kernel/debug/s390dbf/cio_msg/sprintf
00 01511461280:386645 2 - 00 0000000000000000 snsid: device 0.0.3f68: rc=0 3990/e9 3390/0c

root@host $ echo 1 > /proc/sys/kernel/kptr_restrict
root@host $ tail -n1 /sys/kernel/debug/s390dbf/cio_msg/sprintf
00 01511461280:386645 2 - 00 000000000071171c snsid: device 0.0.3f68: rc=0 3990/e9 3390/0c

root@host $ echo 0 > /proc/sys/kernel/kptr_restrict
root@host $ tail -n1 /sys/kernel/debug/s390dbf/cio_msg/sprintf
00 01511461280:386645 2 - 00 000000000071171c snsid: device 0.0.3f68: rc=0 3990/e9 3390/0c

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 860ec7c6 Thu Nov 23 11:26:20 MST 2017 Christian Borntraeger <borntraeger@de.ibm.com> s390/debug: use pK for kernel pointers

the s390 debug feature (/sys/kernel/debug/s390dbf/) shows the kernel
pointer of the calling function even for kptr_restrict == 2. Let us
use pK instead of p.

This hides the kernel addresses for kptr_restrict == 2:

root@host $ echo 2 > /proc/sys/kernel/kptr_restrict
root@host $ tail -n1 /sys/kernel/debug/s390dbf/cio_msg/sprintf
00 01511461280:386645 2 - 00 0000000000000000 snsid: device 0.0.3f68: rc=0 3990/e9 3390/0c

root@host $ echo 1 > /proc/sys/kernel/kptr_restrict
root@host $ tail -n1 /sys/kernel/debug/s390dbf/cio_msg/sprintf
00 01511461280:386645 2 - 00 000000000071171c snsid: device 0.0.3f68: rc=0 3990/e9 3390/0c

root@host $ echo 0 > /proc/sys/kernel/kptr_restrict
root@host $ tail -n1 /sys/kernel/debug/s390dbf/cio_msg/sprintf
00 01511461280:386645 2 - 00 000000000071171c snsid: device 0.0.3f68: rc=0 3990/e9 3390/0c

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 0dcd91a9 Thu Oct 12 05:57:26 MDT 2017 Sebastian Ott <sebott@linux.vnet.ibm.com> s390/debug: only write data once

debug_event_common memsets the active debug entry with zeros to
prevent stale data leakage. This is overwritten with the actual
debug data in the next step. Only write zeros to that part of the
debug entry that's not used by new debug data.

Micro benchmarks show a 2-10% reduction of cpu cycles with this
approach.

Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Acked-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 5a334c08 Thu Dec 20 09:30:52 MST 2012 Michael Holzheu <holzheu@linux.vnet.ibm.com> s390/debug: Fix s390dbf lockdep problem in debug_(un)register_view()

The debug_register/unregister_view() functions call debugfs_remove()
while holding the debug_info spinlock. Because debugfs_remove() takes
a mutex and therefore can sleep this is not allowed. To fix the problem
we give up the debug_info lock before calling debugfs_remove().

The following shows the lockdep message:

[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
rmmod/4379 is trying to acquire lock:
(&sb->s_type->i_mutex_key#2){+.+.+.}, at: [<00000000003acae2>] debugfs_remove+0x5e/0xa

but task is already holding lock:
(&(&rc->lock)->rlock){-.-...}, at: [<000000000010a5ae>] debug_unregister_view+0x3a/0xd

which lock already depends on the new lock.

-> #0 (&sb->s_type->i_mutex_key#2){+.+.+.}:
[<00000000001b1644>] validate_chain+0x880/0x1154
[<00000000001b4d6c>] __lock_acquire+0x414/0xc44
[<00000000001b5c16>] lock_acquire+0xbe/0x178
[<0000000000614016>] mutex_lock_nested+0x66/0x36c
[<00000000003acae2>] debugfs_remove+0x5e/0xac
[<000000000010a620>] debug_unregister_view+0xac/0xd0
[<000003ff8002f140>] qeth_core_exit+0x48/0xf08 [qeth]
[<00000000001c35a4>] SyS_delete_module+0x1a4/0x260
[<0000000000618134>] sysc_noemu+0x22/0x28
[<000003fffd4704da>] 0x3fffd4704da

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 5a334c08 Thu Dec 20 09:30:52 MST 2012 Michael Holzheu <holzheu@linux.vnet.ibm.com> s390/debug: Fix s390dbf lockdep problem in debug_(un)register_view()

The debug_register/unregister_view() functions call debugfs_remove()
while holding the debug_info spinlock. Because debugfs_remove() takes
a mutex and therefore can sleep this is not allowed. To fix the problem
we give up the debug_info lock before calling debugfs_remove().

The following shows the lockdep message:

[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
rmmod/4379 is trying to acquire lock:
(&sb->s_type->i_mutex_key#2){+.+.+.}, at: [<00000000003acae2>] debugfs_remove+0x5e/0xa

but task is already holding lock:
(&(&rc->lock)->rlock){-.-...}, at: [<000000000010a5ae>] debug_unregister_view+0x3a/0xd

which lock already depends on the new lock.

-> #0 (&sb->s_type->i_mutex_key#2){+.+.+.}:
[<00000000001b1644>] validate_chain+0x880/0x1154
[<00000000001b4d6c>] __lock_acquire+0x414/0xc44
[<00000000001b5c16>] lock_acquire+0xbe/0x178
[<0000000000614016>] mutex_lock_nested+0x66/0x36c
[<00000000003acae2>] debugfs_remove+0x5e/0xac
[<000000000010a620>] debug_unregister_view+0xac/0xd0
[<000003ff8002f140>] qeth_core_exit+0x48/0xf08 [qeth]
[<00000000001c35a4>] SyS_delete_module+0x1a4/0x260
[<0000000000618134>] sysc_noemu+0x22/0x28
[<000003fffd4704da>] 0x3fffd4704da

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 5a334c08 Thu Dec 20 09:30:52 MST 2012 Michael Holzheu <holzheu@linux.vnet.ibm.com> s390/debug: Fix s390dbf lockdep problem in debug_(un)register_view()

The debug_register/unregister_view() functions call debugfs_remove()
while holding the debug_info spinlock. Because debugfs_remove() takes
a mutex and therefore can sleep this is not allowed. To fix the problem
we give up the debug_info lock before calling debugfs_remove().

The following shows the lockdep message:

[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
rmmod/4379 is trying to acquire lock:
(&sb->s_type->i_mutex_key#2){+.+.+.}, at: [<00000000003acae2>] debugfs_remove+0x5e/0xa

but task is already holding lock:
(&(&rc->lock)->rlock){-.-...}, at: [<000000000010a5ae>] debug_unregister_view+0x3a/0xd

which lock already depends on the new lock.

-> #0 (&sb->s_type->i_mutex_key#2){+.+.+.}:
[<00000000001b1644>] validate_chain+0x880/0x1154
[<00000000001b4d6c>] __lock_acquire+0x414/0xc44
[<00000000001b5c16>] lock_acquire+0xbe/0x178
[<0000000000614016>] mutex_lock_nested+0x66/0x36c
[<00000000003acae2>] debugfs_remove+0x5e/0xac
[<000000000010a620>] debug_unregister_view+0xac/0xd0
[<000003ff8002f140>] qeth_core_exit+0x48/0xf08 [qeth]
[<00000000001c35a4>] SyS_delete_module+0x1a4/0x260
[<0000000000618134>] sysc_noemu+0x22/0x28
[<000003fffd4704da>] 0x3fffd4704da

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 5a334c08 Thu Dec 20 09:30:52 MST 2012 Michael Holzheu <holzheu@linux.vnet.ibm.com> s390/debug: Fix s390dbf lockdep problem in debug_(un)register_view()

The debug_register/unregister_view() functions call debugfs_remove()
while holding the debug_info spinlock. Because debugfs_remove() takes
a mutex and therefore can sleep this is not allowed. To fix the problem
we give up the debug_info lock before calling debugfs_remove().

The following shows the lockdep message:

[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
rmmod/4379 is trying to acquire lock:
(&sb->s_type->i_mutex_key#2){+.+.+.}, at: [<00000000003acae2>] debugfs_remove+0x5e/0xa

but task is already holding lock:
(&(&rc->lock)->rlock){-.-...}, at: [<000000000010a5ae>] debug_unregister_view+0x3a/0xd

which lock already depends on the new lock.

-> #0 (&sb->s_type->i_mutex_key#2){+.+.+.}:
[<00000000001b1644>] validate_chain+0x880/0x1154
[<00000000001b4d6c>] __lock_acquire+0x414/0xc44
[<00000000001b5c16>] lock_acquire+0xbe/0x178
[<0000000000614016>] mutex_lock_nested+0x66/0x36c
[<00000000003acae2>] debugfs_remove+0x5e/0xac
[<000000000010a620>] debug_unregister_view+0xac/0xd0
[<000003ff8002f140>] qeth_core_exit+0x48/0xf08 [qeth]
[<00000000001c35a4>] SyS_delete_module+0x1a4/0x260
[<0000000000618134>] sysc_noemu+0x22/0x28
[<000003fffd4704da>] 0x3fffd4704da

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 5a334c08 Thu Dec 20 09:30:52 MST 2012 Michael Holzheu <holzheu@linux.vnet.ibm.com> s390/debug: Fix s390dbf lockdep problem in debug_(un)register_view()

The debug_register/unregister_view() functions call debugfs_remove()
while holding the debug_info spinlock. Because debugfs_remove() takes
a mutex and therefore can sleep this is not allowed. To fix the problem
we give up the debug_info lock before calling debugfs_remove().

The following shows the lockdep message:

[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
rmmod/4379 is trying to acquire lock:
(&sb->s_type->i_mutex_key#2){+.+.+.}, at: [<00000000003acae2>] debugfs_remove+0x5e/0xa

but task is already holding lock:
(&(&rc->lock)->rlock){-.-...}, at: [<000000000010a5ae>] debug_unregister_view+0x3a/0xd

which lock already depends on the new lock.

-> #0 (&sb->s_type->i_mutex_key#2){+.+.+.}:
[<00000000001b1644>] validate_chain+0x880/0x1154
[<00000000001b4d6c>] __lock_acquire+0x414/0xc44
[<00000000001b5c16>] lock_acquire+0xbe/0x178
[<0000000000614016>] mutex_lock_nested+0x66/0x36c
[<00000000003acae2>] debugfs_remove+0x5e/0xac
[<000000000010a620>] debug_unregister_view+0xac/0xd0
[<000003ff8002f140>] qeth_core_exit+0x48/0xf08 [qeth]
[<00000000001c35a4>] SyS_delete_module+0x1a4/0x260
[<0000000000618134>] sysc_noemu+0x22/0x28
[<000003fffd4704da>] 0x3fffd4704da

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 5a334c08 Thu Dec 20 09:30:52 MST 2012 Michael Holzheu <holzheu@linux.vnet.ibm.com> s390/debug: Fix s390dbf lockdep problem in debug_(un)register_view()

The debug_register/unregister_view() functions call debugfs_remove()
while holding the debug_info spinlock. Because debugfs_remove() takes
a mutex and therefore can sleep this is not allowed. To fix the problem
we give up the debug_info lock before calling debugfs_remove().

The following shows the lockdep message:

[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
rmmod/4379 is trying to acquire lock:
(&sb->s_type->i_mutex_key#2){+.+.+.}, at: [<00000000003acae2>] debugfs_remove+0x5e/0xa

but task is already holding lock:
(&(&rc->lock)->rlock){-.-...}, at: [<000000000010a5ae>] debug_unregister_view+0x3a/0xd

which lock already depends on the new lock.

-> #0 (&sb->s_type->i_mutex_key#2){+.+.+.}:
[<00000000001b1644>] validate_chain+0x880/0x1154
[<00000000001b4d6c>] __lock_acquire+0x414/0xc44
[<00000000001b5c16>] lock_acquire+0xbe/0x178
[<0000000000614016>] mutex_lock_nested+0x66/0x36c
[<00000000003acae2>] debugfs_remove+0x5e/0xac
[<000000000010a620>] debug_unregister_view+0xac/0xd0
[<000003ff8002f140>] qeth_core_exit+0x48/0xf08 [qeth]
[<00000000001c35a4>] SyS_delete_module+0x1a4/0x260
[<0000000000618134>] sysc_noemu+0x22/0x28
[<000003fffd4704da>] 0x3fffd4704da

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 5a334c08 Thu Dec 20 09:30:52 MST 2012 Michael Holzheu <holzheu@linux.vnet.ibm.com> s390/debug: Fix s390dbf lockdep problem in debug_(un)register_view()

The debug_register/unregister_view() functions call debugfs_remove()
while holding the debug_info spinlock. Because debugfs_remove() takes
a mutex and therefore can sleep this is not allowed. To fix the problem
we give up the debug_info lock before calling debugfs_remove().

The following shows the lockdep message:

[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
rmmod/4379 is trying to acquire lock:
(&sb->s_type->i_mutex_key#2){+.+.+.}, at: [<00000000003acae2>] debugfs_remove+0x5e/0xa

but task is already holding lock:
(&(&rc->lock)->rlock){-.-...}, at: [<000000000010a5ae>] debug_unregister_view+0x3a/0xd

which lock already depends on the new lock.

-> #0 (&sb->s_type->i_mutex_key#2){+.+.+.}:
[<00000000001b1644>] validate_chain+0x880/0x1154
[<00000000001b4d6c>] __lock_acquire+0x414/0xc44
[<00000000001b5c16>] lock_acquire+0xbe/0x178
[<0000000000614016>] mutex_lock_nested+0x66/0x36c
[<00000000003acae2>] debugfs_remove+0x5e/0xac
[<000000000010a620>] debug_unregister_view+0xac/0xd0
[<000003ff8002f140>] qeth_core_exit+0x48/0xf08 [qeth]
[<00000000001c35a4>] SyS_delete_module+0x1a4/0x260
[<0000000000618134>] sysc_noemu+0x22/0x28
[<000003fffd4704da>] 0x3fffd4704da

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 5a334c08 Thu Dec 20 09:30:52 MST 2012 Michael Holzheu <holzheu@linux.vnet.ibm.com> s390/debug: Fix s390dbf lockdep problem in debug_(un)register_view()

The debug_register/unregister_view() functions call debugfs_remove()
while holding the debug_info spinlock. Because debugfs_remove() takes
a mutex and therefore can sleep this is not allowed. To fix the problem
we give up the debug_info lock before calling debugfs_remove().

The following shows the lockdep message:

[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
rmmod/4379 is trying to acquire lock:
(&sb->s_type->i_mutex_key#2){+.+.+.}, at: [<00000000003acae2>] debugfs_remove+0x5e/0xa

but task is already holding lock:
(&(&rc->lock)->rlock){-.-...}, at: [<000000000010a5ae>] debug_unregister_view+0x3a/0xd

which lock already depends on the new lock.

-> #0 (&sb->s_type->i_mutex_key#2){+.+.+.}:
[<00000000001b1644>] validate_chain+0x880/0x1154
[<00000000001b4d6c>] __lock_acquire+0x414/0xc44
[<00000000001b5c16>] lock_acquire+0xbe/0x178
[<0000000000614016>] mutex_lock_nested+0x66/0x36c
[<00000000003acae2>] debugfs_remove+0x5e/0xac
[<000000000010a620>] debug_unregister_view+0xac/0xd0
[<000003ff8002f140>] qeth_core_exit+0x48/0xf08 [qeth]
[<00000000001c35a4>] SyS_delete_module+0x1a4/0x260
[<0000000000618134>] sysc_noemu+0x22/0x28
[<000003fffd4704da>] 0x3fffd4704da

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 5a334c08 Thu Dec 20 09:30:52 MST 2012 Michael Holzheu <holzheu@linux.vnet.ibm.com> s390/debug: Fix s390dbf lockdep problem in debug_(un)register_view()

The debug_register/unregister_view() functions call debugfs_remove()
while holding the debug_info spinlock. Because debugfs_remove() takes
a mutex and therefore can sleep this is not allowed. To fix the problem
we give up the debug_info lock before calling debugfs_remove().

The following shows the lockdep message:

[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
rmmod/4379 is trying to acquire lock:
(&sb->s_type->i_mutex_key#2){+.+.+.}, at: [<00000000003acae2>] debugfs_remove+0x5e/0xa

but task is already holding lock:
(&(&rc->lock)->rlock){-.-...}, at: [<000000000010a5ae>] debug_unregister_view+0x3a/0xd

which lock already depends on the new lock.

-> #0 (&sb->s_type->i_mutex_key#2){+.+.+.}:
[<00000000001b1644>] validate_chain+0x880/0x1154
[<00000000001b4d6c>] __lock_acquire+0x414/0xc44
[<00000000001b5c16>] lock_acquire+0xbe/0x178
[<0000000000614016>] mutex_lock_nested+0x66/0x36c
[<00000000003acae2>] debugfs_remove+0x5e/0xac
[<000000000010a620>] debug_unregister_view+0xac/0xd0
[<000003ff8002f140>] qeth_core_exit+0x48/0xf08 [qeth]
[<00000000001c35a4>] SyS_delete_module+0x1a4/0x260
[<0000000000618134>] sysc_noemu+0x22/0x28
[<000003fffd4704da>] 0x3fffd4704da

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 5a334c08 Thu Dec 20 09:30:52 MST 2012 Michael Holzheu <holzheu@linux.vnet.ibm.com> s390/debug: Fix s390dbf lockdep problem in debug_(un)register_view()

The debug_register/unregister_view() functions call debugfs_remove()
while holding the debug_info spinlock. Because debugfs_remove() takes
a mutex and therefore can sleep this is not allowed. To fix the problem
we give up the debug_info lock before calling debugfs_remove().

The following shows the lockdep message:

[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
rmmod/4379 is trying to acquire lock:
(&sb->s_type->i_mutex_key#2){+.+.+.}, at: [<00000000003acae2>] debugfs_remove+0x5e/0xa

but task is already holding lock:
(&(&rc->lock)->rlock){-.-...}, at: [<000000000010a5ae>] debug_unregister_view+0x3a/0xd

which lock already depends on the new lock.

-> #0 (&sb->s_type->i_mutex_key#2){+.+.+.}:
[<00000000001b1644>] validate_chain+0x880/0x1154
[<00000000001b4d6c>] __lock_acquire+0x414/0xc44
[<00000000001b5c16>] lock_acquire+0xbe/0x178
[<0000000000614016>] mutex_lock_nested+0x66/0x36c
[<00000000003acae2>] debugfs_remove+0x5e/0xac
[<000000000010a620>] debug_unregister_view+0xac/0xd0
[<000003ff8002f140>] qeth_core_exit+0x48/0xf08 [qeth]
[<00000000001c35a4>] SyS_delete_module+0x1a4/0x260
[<0000000000618134>] sysc_noemu+0x22/0x28
[<000003fffd4704da>] 0x3fffd4704da

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 5a334c08 Thu Dec 20 09:30:52 MST 2012 Michael Holzheu <holzheu@linux.vnet.ibm.com> s390/debug: Fix s390dbf lockdep problem in debug_(un)register_view()

The debug_register/unregister_view() functions call debugfs_remove()
while holding the debug_info spinlock. Because debugfs_remove() takes
a mutex and therefore can sleep this is not allowed. To fix the problem
we give up the debug_info lock before calling debugfs_remove().

The following shows the lockdep message:

[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
rmmod/4379 is trying to acquire lock:
(&sb->s_type->i_mutex_key#2){+.+.+.}, at: [<00000000003acae2>] debugfs_remove+0x5e/0xa

but task is already holding lock:
(&(&rc->lock)->rlock){-.-...}, at: [<000000000010a5ae>] debug_unregister_view+0x3a/0xd

which lock already depends on the new lock.

-> #0 (&sb->s_type->i_mutex_key#2){+.+.+.}:
[<00000000001b1644>] validate_chain+0x880/0x1154
[<00000000001b4d6c>] __lock_acquire+0x414/0xc44
[<00000000001b5c16>] lock_acquire+0xbe/0x178
[<0000000000614016>] mutex_lock_nested+0x66/0x36c
[<00000000003acae2>] debugfs_remove+0x5e/0xac
[<000000000010a620>] debug_unregister_view+0xac/0xd0
[<000003ff8002f140>] qeth_core_exit+0x48/0xf08 [qeth]
[<00000000001c35a4>] SyS_delete_module+0x1a4/0x260
[<0000000000618134>] sysc_noemu+0x22/0x28
[<000003fffd4704da>] 0x3fffd4704da

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 5a334c08 Thu Dec 20 09:30:52 MST 2012 Michael Holzheu <holzheu@linux.vnet.ibm.com> s390/debug: Fix s390dbf lockdep problem in debug_(un)register_view()

The debug_register/unregister_view() functions call debugfs_remove()
while holding the debug_info spinlock. Because debugfs_remove() takes
a mutex and therefore can sleep this is not allowed. To fix the problem
we give up the debug_info lock before calling debugfs_remove().

The following shows the lockdep message:

[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
rmmod/4379 is trying to acquire lock:
(&sb->s_type->i_mutex_key#2){+.+.+.}, at: [<00000000003acae2>] debugfs_remove+0x5e/0xa

but task is already holding lock:
(&(&rc->lock)->rlock){-.-...}, at: [<000000000010a5ae>] debug_unregister_view+0x3a/0xd

which lock already depends on the new lock.

-> #0 (&sb->s_type->i_mutex_key#2){+.+.+.}:
[<00000000001b1644>] validate_chain+0x880/0x1154
[<00000000001b4d6c>] __lock_acquire+0x414/0xc44
[<00000000001b5c16>] lock_acquire+0xbe/0x178
[<0000000000614016>] mutex_lock_nested+0x66/0x36c
[<00000000003acae2>] debugfs_remove+0x5e/0xac
[<000000000010a620>] debug_unregister_view+0xac/0xd0
[<000003ff8002f140>] qeth_core_exit+0x48/0xf08 [qeth]
[<00000000001c35a4>] SyS_delete_module+0x1a4/0x260
[<0000000000618134>] sysc_noemu+0x22/0x28
[<000003fffd4704da>] 0x3fffd4704da

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 5a334c08 Thu Dec 20 09:30:52 MST 2012 Michael Holzheu <holzheu@linux.vnet.ibm.com> s390/debug: Fix s390dbf lockdep problem in debug_(un)register_view()

The debug_register/unregister_view() functions call debugfs_remove()
while holding the debug_info spinlock. Because debugfs_remove() takes
a mutex and therefore can sleep this is not allowed. To fix the problem
we give up the debug_info lock before calling debugfs_remove().

The following shows the lockdep message:

[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
rmmod/4379 is trying to acquire lock:
(&sb->s_type->i_mutex_key#2){+.+.+.}, at: [<00000000003acae2>] debugfs_remove+0x5e/0xa

but task is already holding lock:
(&(&rc->lock)->rlock){-.-...}, at: [<000000000010a5ae>] debug_unregister_view+0x3a/0xd

which lock already depends on the new lock.

-> #0 (&sb->s_type->i_mutex_key#2){+.+.+.}:
[<00000000001b1644>] validate_chain+0x880/0x1154
[<00000000001b4d6c>] __lock_acquire+0x414/0xc44
[<00000000001b5c16>] lock_acquire+0xbe/0x178
[<0000000000614016>] mutex_lock_nested+0x66/0x36c
[<00000000003acae2>] debugfs_remove+0x5e/0xac
[<000000000010a620>] debug_unregister_view+0xac/0xd0
[<000003ff8002f140>] qeth_core_exit+0x48/0xf08 [qeth]
[<00000000001c35a4>] SyS_delete_module+0x1a4/0x260
[<0000000000618134>] sysc_noemu+0x22/0x28
[<000003fffd4704da>] 0x3fffd4704da

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 5a334c08 Thu Dec 20 09:30:52 MST 2012 Michael Holzheu <holzheu@linux.vnet.ibm.com> s390/debug: Fix s390dbf lockdep problem in debug_(un)register_view()

The debug_register/unregister_view() functions call debugfs_remove()
while holding the debug_info spinlock. Because debugfs_remove() takes
a mutex and therefore can sleep this is not allowed. To fix the problem
we give up the debug_info lock before calling debugfs_remove().

The following shows the lockdep message:

[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
rmmod/4379 is trying to acquire lock:
(&sb->s_type->i_mutex_key#2){+.+.+.}, at: [<00000000003acae2>] debugfs_remove+0x5e/0xa

but task is already holding lock:
(&(&rc->lock)->rlock){-.-...}, at: [<000000000010a5ae>] debug_unregister_view+0x3a/0xd

which lock already depends on the new lock.

-> #0 (&sb->s_type->i_mutex_key#2){+.+.+.}:
[<00000000001b1644>] validate_chain+0x880/0x1154
[<00000000001b4d6c>] __lock_acquire+0x414/0xc44
[<00000000001b5c16>] lock_acquire+0xbe/0x178
[<0000000000614016>] mutex_lock_nested+0x66/0x36c
[<00000000003acae2>] debugfs_remove+0x5e/0xac
[<000000000010a620>] debug_unregister_view+0xac/0xd0
[<000003ff8002f140>] qeth_core_exit+0x48/0xf08 [qeth]
[<00000000001c35a4>] SyS_delete_module+0x1a4/0x260
[<0000000000618134>] sysc_noemu+0x22/0x28
[<000003fffd4704da>] 0x3fffd4704da

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 5a334c08 Thu Dec 20 09:30:52 MST 2012 Michael Holzheu <holzheu@linux.vnet.ibm.com> s390/debug: Fix s390dbf lockdep problem in debug_(un)register_view()

The debug_register/unregister_view() functions call debugfs_remove()
while holding the debug_info spinlock. Because debugfs_remove() takes
a mutex and therefore can sleep this is not allowed. To fix the problem
we give up the debug_info lock before calling debugfs_remove().

The following shows the lockdep message:

[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
rmmod/4379 is trying to acquire lock:
(&sb->s_type->i_mutex_key#2){+.+.+.}, at: [<00000000003acae2>] debugfs_remove+0x5e/0xa

but task is already holding lock:
(&(&rc->lock)->rlock){-.-...}, at: [<000000000010a5ae>] debug_unregister_view+0x3a/0xd

which lock already depends on the new lock.

-> #0 (&sb->s_type->i_mutex_key#2){+.+.+.}:
[<00000000001b1644>] validate_chain+0x880/0x1154
[<00000000001b4d6c>] __lock_acquire+0x414/0xc44
[<00000000001b5c16>] lock_acquire+0xbe/0x178
[<0000000000614016>] mutex_lock_nested+0x66/0x36c
[<00000000003acae2>] debugfs_remove+0x5e/0xac
[<000000000010a620>] debug_unregister_view+0xac/0xd0
[<000003ff8002f140>] qeth_core_exit+0x48/0xf08 [qeth]
[<00000000001c35a4>] SyS_delete_module+0x1a4/0x260
[<0000000000618134>] sysc_noemu+0x22/0x28
[<000003fffd4704da>] 0x3fffd4704da

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 5a334c08 Thu Dec 20 09:30:52 MST 2012 Michael Holzheu <holzheu@linux.vnet.ibm.com> s390/debug: Fix s390dbf lockdep problem in debug_(un)register_view()

The debug_register/unregister_view() functions call debugfs_remove()
while holding the debug_info spinlock. Because debugfs_remove() takes
a mutex and therefore can sleep this is not allowed. To fix the problem
we give up the debug_info lock before calling debugfs_remove().

The following shows the lockdep message:

[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
rmmod/4379 is trying to acquire lock:
(&sb->s_type->i_mutex_key#2){+.+.+.}, at: [<00000000003acae2>] debugfs_remove+0x5e/0xa

but task is already holding lock:
(&(&rc->lock)->rlock){-.-...}, at: [<000000000010a5ae>] debug_unregister_view+0x3a/0xd

which lock already depends on the new lock.

-> #0 (&sb->s_type->i_mutex_key#2){+.+.+.}:
[<00000000001b1644>] validate_chain+0x880/0x1154
[<00000000001b4d6c>] __lock_acquire+0x414/0xc44
[<00000000001b5c16>] lock_acquire+0xbe/0x178
[<0000000000614016>] mutex_lock_nested+0x66/0x36c
[<00000000003acae2>] debugfs_remove+0x5e/0xac
[<000000000010a620>] debug_unregister_view+0xac/0xd0
[<000003ff8002f140>] qeth_core_exit+0x48/0xf08 [qeth]
[<00000000001c35a4>] SyS_delete_module+0x1a4/0x260
[<0000000000618134>] sysc_noemu+0x22/0x28
[<000003fffd4704da>] 0x3fffd4704da

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 5a334c08 Thu Dec 20 09:30:52 MST 2012 Michael Holzheu <holzheu@linux.vnet.ibm.com> s390/debug: Fix s390dbf lockdep problem in debug_(un)register_view()

The debug_register/unregister_view() functions call debugfs_remove()
while holding the debug_info spinlock. Because debugfs_remove() takes
a mutex and therefore can sleep this is not allowed. To fix the problem
we give up the debug_info lock before calling debugfs_remove().

The following shows the lockdep message:

[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
rmmod/4379 is trying to acquire lock:
(&sb->s_type->i_mutex_key#2){+.+.+.}, at: [<00000000003acae2>] debugfs_remove+0x5e/0xa

but task is already holding lock:
(&(&rc->lock)->rlock){-.-...}, at: [<000000000010a5ae>] debug_unregister_view+0x3a/0xd

which lock already depends on the new lock.

-> #0 (&sb->s_type->i_mutex_key#2){+.+.+.}:
[<00000000001b1644>] validate_chain+0x880/0x1154
[<00000000001b4d6c>] __lock_acquire+0x414/0xc44
[<00000000001b5c16>] lock_acquire+0xbe/0x178
[<0000000000614016>] mutex_lock_nested+0x66/0x36c
[<00000000003acae2>] debugfs_remove+0x5e/0xac
[<000000000010a620>] debug_unregister_view+0xac/0xd0
[<000003ff8002f140>] qeth_core_exit+0x48/0xf08 [qeth]
[<00000000001c35a4>] SyS_delete_module+0x1a4/0x260
[<0000000000618134>] sysc_noemu+0x22/0x28
[<000003fffd4704da>] 0x3fffd4704da

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 5a334c08 Thu Dec 20 09:30:52 MST 2012 Michael Holzheu <holzheu@linux.vnet.ibm.com> s390/debug: Fix s390dbf lockdep problem in debug_(un)register_view()

The debug_register/unregister_view() functions call debugfs_remove()
while holding the debug_info spinlock. Because debugfs_remove() takes
a mutex and therefore can sleep this is not allowed. To fix the problem
we give up the debug_info lock before calling debugfs_remove().

The following shows the lockdep message:

[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
rmmod/4379 is trying to acquire lock:
(&sb->s_type->i_mutex_key#2){+.+.+.}, at: [<00000000003acae2>] debugfs_remove+0x5e/0xa

but task is already holding lock:
(&(&rc->lock)->rlock){-.-...}, at: [<000000000010a5ae>] debug_unregister_view+0x3a/0xd

which lock already depends on the new lock.

-> #0 (&sb->s_type->i_mutex_key#2){+.+.+.}:
[<00000000001b1644>] validate_chain+0x880/0x1154
[<00000000001b4d6c>] __lock_acquire+0x414/0xc44
[<00000000001b5c16>] lock_acquire+0xbe/0x178
[<0000000000614016>] mutex_lock_nested+0x66/0x36c
[<00000000003acae2>] debugfs_remove+0x5e/0xac
[<000000000010a620>] debug_unregister_view+0xac/0xd0
[<000003ff8002f140>] qeth_core_exit+0x48/0xf08 [qeth]
[<00000000001c35a4>] SyS_delete_module+0x1a4/0x260
[<0000000000618134>] sysc_noemu+0x22/0x28
[<000003fffd4704da>] 0x3fffd4704da

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 5a334c08 Thu Dec 20 09:30:52 MST 2012 Michael Holzheu <holzheu@linux.vnet.ibm.com> s390/debug: Fix s390dbf lockdep problem in debug_(un)register_view()

The debug_register/unregister_view() functions call debugfs_remove()
while holding the debug_info spinlock. Because debugfs_remove() takes
a mutex and therefore can sleep this is not allowed. To fix the problem
we give up the debug_info lock before calling debugfs_remove().

The following shows the lockdep message:

[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
rmmod/4379 is trying to acquire lock:
(&sb->s_type->i_mutex_key#2){+.+.+.}, at: [<00000000003acae2>] debugfs_remove+0x5e/0xa

but task is already holding lock:
(&(&rc->lock)->rlock){-.-...}, at: [<000000000010a5ae>] debug_unregister_view+0x3a/0xd

which lock already depends on the new lock.

-> #0 (&sb->s_type->i_mutex_key#2){+.+.+.}:
[<00000000001b1644>] validate_chain+0x880/0x1154
[<00000000001b4d6c>] __lock_acquire+0x414/0xc44
[<00000000001b5c16>] lock_acquire+0xbe/0x178
[<0000000000614016>] mutex_lock_nested+0x66/0x36c
[<00000000003acae2>] debugfs_remove+0x5e/0xac
[<000000000010a620>] debug_unregister_view+0xac/0xd0
[<000003ff8002f140>] qeth_core_exit+0x48/0xf08 [qeth]
[<00000000001c35a4>] SyS_delete_module+0x1a4/0x260
[<0000000000618134>] sysc_noemu+0x22/0x28
[<000003fffd4704da>] 0x3fffd4704da

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 5a334c08 Thu Dec 20 09:30:52 MST 2012 Michael Holzheu <holzheu@linux.vnet.ibm.com> s390/debug: Fix s390dbf lockdep problem in debug_(un)register_view()

The debug_register/unregister_view() functions call debugfs_remove()
while holding the debug_info spinlock. Because debugfs_remove() takes
a mutex and therefore can sleep this is not allowed. To fix the problem
we give up the debug_info lock before calling debugfs_remove().

The following shows the lockdep message:

[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
rmmod/4379 is trying to acquire lock:
(&sb->s_type->i_mutex_key#2){+.+.+.}, at: [<00000000003acae2>] debugfs_remove+0x5e/0xa

but task is already holding lock:
(&(&rc->lock)->rlock){-.-...}, at: [<000000000010a5ae>] debug_unregister_view+0x3a/0xd

which lock already depends on the new lock.

-> #0 (&sb->s_type->i_mutex_key#2){+.+.+.}:
[<00000000001b1644>] validate_chain+0x880/0x1154
[<00000000001b4d6c>] __lock_acquire+0x414/0xc44
[<00000000001b5c16>] lock_acquire+0xbe/0x178
[<0000000000614016>] mutex_lock_nested+0x66/0x36c
[<00000000003acae2>] debugfs_remove+0x5e/0xac
[<000000000010a620>] debug_unregister_view+0xac/0xd0
[<000003ff8002f140>] qeth_core_exit+0x48/0xf08 [qeth]
[<00000000001c35a4>] SyS_delete_module+0x1a4/0x260
[<0000000000618134>] sysc_noemu+0x22/0x28
[<000003fffd4704da>] 0x3fffd4704da

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 5a334c08 Thu Dec 20 09:30:52 MST 2012 Michael Holzheu <holzheu@linux.vnet.ibm.com> s390/debug: Fix s390dbf lockdep problem in debug_(un)register_view()

The debug_register/unregister_view() functions call debugfs_remove()
while holding the debug_info spinlock. Because debugfs_remove() takes
a mutex and therefore can sleep this is not allowed. To fix the problem
we give up the debug_info lock before calling debugfs_remove().

The following shows the lockdep message:

[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
rmmod/4379 is trying to acquire lock:
(&sb->s_type->i_mutex_key#2){+.+.+.}, at: [<00000000003acae2>] debugfs_remove+0x5e/0xa

but task is already holding lock:
(&(&rc->lock)->rlock){-.-...}, at: [<000000000010a5ae>] debug_unregister_view+0x3a/0xd

which lock already depends on the new lock.

-> #0 (&sb->s_type->i_mutex_key#2){+.+.+.}:
[<00000000001b1644>] validate_chain+0x880/0x1154
[<00000000001b4d6c>] __lock_acquire+0x414/0xc44
[<00000000001b5c16>] lock_acquire+0xbe/0x178
[<0000000000614016>] mutex_lock_nested+0x66/0x36c
[<00000000003acae2>] debugfs_remove+0x5e/0xac
[<000000000010a620>] debug_unregister_view+0xac/0xd0
[<000003ff8002f140>] qeth_core_exit+0x48/0xf08 [qeth]
[<00000000001c35a4>] SyS_delete_module+0x1a4/0x260
[<0000000000618134>] sysc_noemu+0x22/0x28
[<000003fffd4704da>] 0x3fffd4704da

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 5a334c08 Thu Dec 20 09:30:52 MST 2012 Michael Holzheu <holzheu@linux.vnet.ibm.com> s390/debug: Fix s390dbf lockdep problem in debug_(un)register_view()

The debug_register/unregister_view() functions call debugfs_remove()
while holding the debug_info spinlock. Because debugfs_remove() takes
a mutex and therefore can sleep this is not allowed. To fix the problem
we give up the debug_info lock before calling debugfs_remove().

The following shows the lockdep message:

[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
rmmod/4379 is trying to acquire lock:
(&sb->s_type->i_mutex_key#2){+.+.+.}, at: [<00000000003acae2>] debugfs_remove+0x5e/0xa

but task is already holding lock:
(&(&rc->lock)->rlock){-.-...}, at: [<000000000010a5ae>] debug_unregister_view+0x3a/0xd

which lock already depends on the new lock.

-> #0 (&sb->s_type->i_mutex_key#2){+.+.+.}:
[<00000000001b1644>] validate_chain+0x880/0x1154
[<00000000001b4d6c>] __lock_acquire+0x414/0xc44
[<00000000001b5c16>] lock_acquire+0xbe/0x178
[<0000000000614016>] mutex_lock_nested+0x66/0x36c
[<00000000003acae2>] debugfs_remove+0x5e/0xac
[<000000000010a620>] debug_unregister_view+0xac/0xd0
[<000003ff8002f140>] qeth_core_exit+0x48/0xf08 [qeth]
[<00000000001c35a4>] SyS_delete_module+0x1a4/0x260
[<0000000000618134>] sysc_noemu+0x22/0x28
[<000003fffd4704da>] 0x3fffd4704da

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 5a334c08 Thu Dec 20 09:30:52 MST 2012 Michael Holzheu <holzheu@linux.vnet.ibm.com> s390/debug: Fix s390dbf lockdep problem in debug_(un)register_view()

The debug_register/unregister_view() functions call debugfs_remove()
while holding the debug_info spinlock. Because debugfs_remove() takes
a mutex and therefore can sleep this is not allowed. To fix the problem
we give up the debug_info lock before calling debugfs_remove().

The following shows the lockdep message:

[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
rmmod/4379 is trying to acquire lock:
(&sb->s_type->i_mutex_key#2){+.+.+.}, at: [<00000000003acae2>] debugfs_remove+0x5e/0xa

but task is already holding lock:
(&(&rc->lock)->rlock){-.-...}, at: [<000000000010a5ae>] debug_unregister_view+0x3a/0xd

which lock already depends on the new lock.

-> #0 (&sb->s_type->i_mutex_key#2){+.+.+.}:
[<00000000001b1644>] validate_chain+0x880/0x1154
[<00000000001b4d6c>] __lock_acquire+0x414/0xc44
[<00000000001b5c16>] lock_acquire+0xbe/0x178
[<0000000000614016>] mutex_lock_nested+0x66/0x36c
[<00000000003acae2>] debugfs_remove+0x5e/0xac
[<000000000010a620>] debug_unregister_view+0xac/0xd0
[<000003ff8002f140>] qeth_core_exit+0x48/0xf08 [qeth]
[<00000000001c35a4>] SyS_delete_module+0x1a4/0x260
[<0000000000618134>] sysc_noemu+0x22/0x28
[<000003fffd4704da>] 0x3fffd4704da

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 5a334c08 Thu Dec 20 09:30:52 MST 2012 Michael Holzheu <holzheu@linux.vnet.ibm.com> s390/debug: Fix s390dbf lockdep problem in debug_(un)register_view()

The debug_register/unregister_view() functions call debugfs_remove()
while holding the debug_info spinlock. Because debugfs_remove() takes
a mutex and therefore can sleep this is not allowed. To fix the problem
we give up the debug_info lock before calling debugfs_remove().

The following shows the lockdep message:

[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
rmmod/4379 is trying to acquire lock:
(&sb->s_type->i_mutex_key#2){+.+.+.}, at: [<00000000003acae2>] debugfs_remove+0x5e/0xa

but task is already holding lock:
(&(&rc->lock)->rlock){-.-...}, at: [<000000000010a5ae>] debug_unregister_view+0x3a/0xd

which lock already depends on the new lock.

-> #0 (&sb->s_type->i_mutex_key#2){+.+.+.}:
[<00000000001b1644>] validate_chain+0x880/0x1154
[<00000000001b4d6c>] __lock_acquire+0x414/0xc44
[<00000000001b5c16>] lock_acquire+0xbe/0x178
[<0000000000614016>] mutex_lock_nested+0x66/0x36c
[<00000000003acae2>] debugfs_remove+0x5e/0xac
[<000000000010a620>] debug_unregister_view+0xac/0xd0
[<000003ff8002f140>] qeth_core_exit+0x48/0xf08 [qeth]
[<00000000001c35a4>] SyS_delete_module+0x1a4/0x260
[<0000000000618134>] sysc_noemu+0x22/0x28
[<000003fffd4704da>] 0x3fffd4704da

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
/linux-master/arch/s390/appldata/
H A Dappldata_base.cdiff 7db12246 Fri Mar 10 16:45:25 MST 2023 Luis Chamberlain <mcgrof@kernel.org> s390: simplify dynamic sysctl registration for appldata_register_ops

The routine appldata_register_ops() allocates a sysctl table
with 4 entries. The firsts one, ops->ctl_table[0] is the parent directory
with an empty entry following it, ops->ctl_table[1]. The next entry is
for the ops->name and that is ops->ctl_table[2]. It needs an empty
entry following that, and that is ops->ctl_table[3]. And so hence the
kcalloc(4, sizeof(struct ctl_table), GFP_KERNEL).

We can simplify this considerably since sysctl_register("foo", table)
can create the parent directory for us if it does not exist. So we
can just remove the first two entries and move back the ops->name to
the first entry, and just use kcalloc(2, ...).

[gor@linux.ibm.com: appldata_generic_handler fixup ctl_table index 2->0]
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20230310234525.3986352-7-mcgrof@kernel.org
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
diff 7db12246 Fri Mar 10 16:45:25 MST 2023 Luis Chamberlain <mcgrof@kernel.org> s390: simplify dynamic sysctl registration for appldata_register_ops

The routine appldata_register_ops() allocates a sysctl table
with 4 entries. The firsts one, ops->ctl_table[0] is the parent directory
with an empty entry following it, ops->ctl_table[1]. The next entry is
for the ops->name and that is ops->ctl_table[2]. It needs an empty
entry following that, and that is ops->ctl_table[3]. And so hence the
kcalloc(4, sizeof(struct ctl_table), GFP_KERNEL).

We can simplify this considerably since sysctl_register("foo", table)
can create the parent directory for us if it does not exist. So we
can just remove the first two entries and move back the ops->name to
the first entry, and just use kcalloc(2, ...).

[gor@linux.ibm.com: appldata_generic_handler fixup ctl_table index 2->0]
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20230310234525.3986352-7-mcgrof@kernel.org
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
diff eec4844f Thu Jul 18 16:58:50 MDT 2019 Matteo Croce <mcroce@redhat.com> proc/sysctl: add shared variables for range check

In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.

On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.

The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:

$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248

Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.

This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:

# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%

[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff eec4844f Thu Jul 18 16:58:50 MDT 2019 Matteo Croce <mcroce@redhat.com> proc/sysctl: add shared variables for range check

In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.

On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.

The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:

$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248

Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.

This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:

# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%

[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff eec4844f Thu Jul 18 16:58:50 MDT 2019 Matteo Croce <mcroce@redhat.com> proc/sysctl: add shared variables for range check

In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.

On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.

The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:

$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248

Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.

This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:

# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%

[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff f6ea4d25 Mon Jun 18 08:50:42 MDT 2018 Vasily Gorbik <gor@linux.ibm.com> s390: fix gcc 8 stringop-truncation warnings in proc handlers

arch/s390/kernel/topology.c:591:3: warning: 'strncpy' output truncated
before terminating nul copying 2 bytes from a string of the same length
[-Wstringop-truncation]
strncpy(buf, topology_is_enabled() ? "1\n" : "0\n", ARRAY_SIZE(buf));

arch/s390/appldata/appldata_base.c:326:3: warning: 'strncpy' output truncated
before terminating nul copying 2 bytes from a string of the same length
[-Wstringop-truncation]
strncpy(buf, ops->active ? "1\n" : "0\n", ARRAY_SIZE(buf));

arch/s390/appldata/appldata_base.c:217:3: warning: 'strncpy' output truncated
before terminating nul copying 2 bytes from a string of the same length
[-Wstringop-truncation]
strncpy(buf, appldata_timer_active ? "1\n" : "0\n", ARRAY_SIZE(buf));

To avoid the warning, just reuse memcpy.

Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff f6ea4d25 Mon Jun 18 08:50:42 MDT 2018 Vasily Gorbik <gor@linux.ibm.com> s390: fix gcc 8 stringop-truncation warnings in proc handlers

arch/s390/kernel/topology.c:591:3: warning: 'strncpy' output truncated
before terminating nul copying 2 bytes from a string of the same length
[-Wstringop-truncation]
strncpy(buf, topology_is_enabled() ? "1\n" : "0\n", ARRAY_SIZE(buf));

arch/s390/appldata/appldata_base.c:326:3: warning: 'strncpy' output truncated
before terminating nul copying 2 bytes from a string of the same length
[-Wstringop-truncation]
strncpy(buf, ops->active ? "1\n" : "0\n", ARRAY_SIZE(buf));

arch/s390/appldata/appldata_base.c:217:3: warning: 'strncpy' output truncated
before terminating nul copying 2 bytes from a string of the same length
[-Wstringop-truncation]
strncpy(buf, appldata_timer_active ? "1\n" : "0\n", ARRAY_SIZE(buf));

To avoid the warning, just reuse memcpy.

Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff f6ea4d25 Mon Jun 18 08:50:42 MDT 2018 Vasily Gorbik <gor@linux.ibm.com> s390: fix gcc 8 stringop-truncation warnings in proc handlers

arch/s390/kernel/topology.c:591:3: warning: 'strncpy' output truncated
before terminating nul copying 2 bytes from a string of the same length
[-Wstringop-truncation]
strncpy(buf, topology_is_enabled() ? "1\n" : "0\n", ARRAY_SIZE(buf));

arch/s390/appldata/appldata_base.c:326:3: warning: 'strncpy' output truncated
before terminating nul copying 2 bytes from a string of the same length
[-Wstringop-truncation]
strncpy(buf, ops->active ? "1\n" : "0\n", ARRAY_SIZE(buf));

arch/s390/appldata/appldata_base.c:217:3: warning: 'strncpy' output truncated
before terminating nul copying 2 bytes from a string of the same length
[-Wstringop-truncation]
strncpy(buf, appldata_timer_active ? "1\n" : "0\n", ARRAY_SIZE(buf));

To avoid the warning, just reuse memcpy.

Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff 0caa8cdf Fri Nov 24 07:00:37 MST 2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org> s390: appldata: add SPDX identifiers to the remaining files

It's good to have SPDX identifiers in all files to make it easier to
audit the kernel tree for correct licenses.

Update the arch/s390/appldata/ files with the correct SPDX license
identifier based on the license text in the file itself. The SPDX
identifier is a legally binding shorthand, which can be used instead of
the full boiler plate text.

This work is based on a script and data from Thomas Gleixner, Philippe
Ombredanne, and Kate Stewart.

Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diff d5b4c2f4 Mon May 27 03:55:13 MDT 2013 Chen Gang <gang.chen@asianux.com> arch: s390: appldata: using strncpy() and strnlen() instead of sprintf()

'buf[2]' is 2 bytes length, and sprintf() will append '\0' at the end
of string "?\n", so original implementation is memory overflow.

Need use strncpy() and strnlen() instead of sprintf().

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
/linux-master/arch/powerpc/kernel/
H A Didle.cdiff b793a010 Mon May 02 01:29:25 MDT 2022 Randy Dunlap <rdunlap@infradead.org> powerpc/idle: Fix return value of __setup() handler

__setup() handlers should return 1 to obsolete_checksetup() in
init/main.c to indicate that the boot option has been handled.

A return of 0 causes the boot option/value to be listed as an Unknown
kernel parameter and added to init's (limited) argument or environment
strings.

Also, error return codes don't mean anything to obsolete_checksetup() --
only non-zero (usually 1) or zero. So return 1 from powersave_off().

Fixes: 302eca184fb8 ("[POWERPC] cell: use ppc_md->power_save instead of cbe_idle_loop")
Reported-by: Igor Zhbanov <izh1979@gmail.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220502192925.19954-1-rdunlap@infradead.org
diff a7c2bb82 Wed Aug 04 12:28:33 MDT 2010 Signed-off-by: Darren Hart <dvhltc@us.ibm.com> powerpc: Re-enable preemption before cpu_die()

start_secondary() is called shortly after _start and also via

cpu_idle()->cpu_die()->pseries_mach_cpu_die()

start_secondary() expects a preempt_count() of 0. pseries_mach_cpu_die() is
called via the cpu_idle() routine with preemption disabled, resulting in the
following repeating message during rapid cpu offline/online tests
with CONFIG_PREEMPT=y:

BUG: scheduling while atomic: swapper/0/0x00000002
Modules linked in: autofs4 binfmt_misc dm_mirror dm_region_hash dm_log [last unloaded: scsi_wait_scan]
Call Trace:
[c00000010e7079c0] [c0000000000133ec] .show_stack+0xd8/0x218 (unreliable)
[c00000010e707aa0] [c0000000006a47f0] .dump_stack+0x28/0x3c
[c00000010e707b20] [c00000000006e7a4] .__schedule_bug+0x7c/0x9c
[c00000010e707bb0] [c000000000699d9c] .schedule+0x104/0x800
[c00000010e707cd0] [c000000000015b24] .cpu_idle+0x1c4/0x1d8
[c00000010e707d70] [c0000000006aa1b4] .start_secondary+0x398/0x3d4
[c00000010e707e30] [c000000000008278] .start_secondary_resume+0x10/0x14

Move the cpu_die() call inside the existing preemption enabled block of
cpu_idle(). This is safe as the idle task is affined to a single CPU so the
debug_smp_processor_id() tests (from cpu_should_die()) won't trigger as we are
in a "migration disabled" region.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
diff a7c2bb82 Wed Aug 04 12:28:33 MDT 2010 Signed-off-by: Darren Hart <dvhltc@us.ibm.com> powerpc: Re-enable preemption before cpu_die()

start_secondary() is called shortly after _start and also via

cpu_idle()->cpu_die()->pseries_mach_cpu_die()

start_secondary() expects a preempt_count() of 0. pseries_mach_cpu_die() is
called via the cpu_idle() routine with preemption disabled, resulting in the
following repeating message during rapid cpu offline/online tests
with CONFIG_PREEMPT=y:

BUG: scheduling while atomic: swapper/0/0x00000002
Modules linked in: autofs4 binfmt_misc dm_mirror dm_region_hash dm_log [last unloaded: scsi_wait_scan]
Call Trace:
[c00000010e7079c0] [c0000000000133ec] .show_stack+0xd8/0x218 (unreliable)
[c00000010e707aa0] [c0000000006a47f0] .dump_stack+0x28/0x3c
[c00000010e707b20] [c00000000006e7a4] .__schedule_bug+0x7c/0x9c
[c00000010e707bb0] [c000000000699d9c] .schedule+0x104/0x800
[c00000010e707cd0] [c000000000015b24] .cpu_idle+0x1c4/0x1d8
[c00000010e707d70] [c0000000006aa1b4] .start_secondary+0x398/0x3d4
[c00000010e707e30] [c000000000008278] .start_secondary_resume+0x10/0x14

Move the cpu_die() call inside the existing preemption enabled block of
cpu_idle(). This is safe as the idle task is affined to a single CPU so the
debug_smp_processor_id() tests (from cpu_should_die()) won't trigger as we are
in a "migration disabled" region.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
diff a7c2bb82 Wed Aug 04 12:28:33 MDT 2010 Signed-off-by: Darren Hart <dvhltc@us.ibm.com> powerpc: Re-enable preemption before cpu_die()

start_secondary() is called shortly after _start and also via

cpu_idle()->cpu_die()->pseries_mach_cpu_die()

start_secondary() expects a preempt_count() of 0. pseries_mach_cpu_die() is
called via the cpu_idle() routine with preemption disabled, resulting in the
following repeating message during rapid cpu offline/online tests
with CONFIG_PREEMPT=y:

BUG: scheduling while atomic: swapper/0/0x00000002
Modules linked in: autofs4 binfmt_misc dm_mirror dm_region_hash dm_log [last unloaded: scsi_wait_scan]
Call Trace:
[c00000010e7079c0] [c0000000000133ec] .show_stack+0xd8/0x218 (unreliable)
[c00000010e707aa0] [c0000000006a47f0] .dump_stack+0x28/0x3c
[c00000010e707b20] [c00000000006e7a4] .__schedule_bug+0x7c/0x9c
[c00000010e707bb0] [c000000000699d9c] .schedule+0x104/0x800
[c00000010e707cd0] [c000000000015b24] .cpu_idle+0x1c4/0x1d8
[c00000010e707d70] [c0000000006aa1b4] .start_secondary+0x398/0x3d4
[c00000010e707e30] [c000000000008278] .start_secondary_resume+0x10/0x14

Move the cpu_die() call inside the existing preemption enabled block of
cpu_idle(). This is safe as the idle task is affined to a single CPU so the
debug_smp_processor_id() tests (from cpu_should_die()) won't trigger as we are
in a "migration disabled" region.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
diff a7c2bb82 Wed Aug 04 12:28:33 MDT 2010 Signed-off-by: Darren Hart <dvhltc@us.ibm.com> powerpc: Re-enable preemption before cpu_die()

start_secondary() is called shortly after _start and also via

cpu_idle()->cpu_die()->pseries_mach_cpu_die()

start_secondary() expects a preempt_count() of 0. pseries_mach_cpu_die() is
called via the cpu_idle() routine with preemption disabled, resulting in the
following repeating message during rapid cpu offline/online tests
with CONFIG_PREEMPT=y:

BUG: scheduling while atomic: swapper/0/0x00000002
Modules linked in: autofs4 binfmt_misc dm_mirror dm_region_hash dm_log [last unloaded: scsi_wait_scan]
Call Trace:
[c00000010e7079c0] [c0000000000133ec] .show_stack+0xd8/0x218 (unreliable)
[c00000010e707aa0] [c0000000006a47f0] .dump_stack+0x28/0x3c
[c00000010e707b20] [c00000000006e7a4] .__schedule_bug+0x7c/0x9c
[c00000010e707bb0] [c000000000699d9c] .schedule+0x104/0x800
[c00000010e707cd0] [c000000000015b24] .cpu_idle+0x1c4/0x1d8
[c00000010e707d70] [c0000000006aa1b4] .start_secondary+0x398/0x3d4
[c00000010e707e30] [c000000000008278] .start_secondary_resume+0x10/0x14

Move the cpu_die() call inside the existing preemption enabled block of
cpu_idle(). This is safe as the idle task is affined to a single CPU so the
debug_smp_processor_id() tests (from cpu_should_die()) won't trigger as we are
in a "migration disabled" region.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
diff a7c2bb82 Wed Aug 04 12:28:33 MDT 2010 Signed-off-by: Darren Hart <dvhltc@us.ibm.com> powerpc: Re-enable preemption before cpu_die()

start_secondary() is called shortly after _start and also via

cpu_idle()->cpu_die()->pseries_mach_cpu_die()

start_secondary() expects a preempt_count() of 0. pseries_mach_cpu_die() is
called via the cpu_idle() routine with preemption disabled, resulting in the
following repeating message during rapid cpu offline/online tests
with CONFIG_PREEMPT=y:

BUG: scheduling while atomic: swapper/0/0x00000002
Modules linked in: autofs4 binfmt_misc dm_mirror dm_region_hash dm_log [last unloaded: scsi_wait_scan]
Call Trace:
[c00000010e7079c0] [c0000000000133ec] .show_stack+0xd8/0x218 (unreliable)
[c00000010e707aa0] [c0000000006a47f0] .dump_stack+0x28/0x3c
[c00000010e707b20] [c00000000006e7a4] .__schedule_bug+0x7c/0x9c
[c00000010e707bb0] [c000000000699d9c] .schedule+0x104/0x800
[c00000010e707cd0] [c000000000015b24] .cpu_idle+0x1c4/0x1d8
[c00000010e707d70] [c0000000006aa1b4] .start_secondary+0x398/0x3d4
[c00000010e707e30] [c000000000008278] .start_secondary_resume+0x10/0x14

Move the cpu_die() call inside the existing preemption enabled block of
cpu_idle(). This is safe as the idle task is affined to a single CPU so the
debug_smp_processor_id() tests (from cpu_should_die()) won't trigger as we are
in a "migration disabled" region.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
diff a7c2bb82 Wed Aug 04 12:28:33 MDT 2010 Signed-off-by: Darren Hart <dvhltc@us.ibm.com> powerpc: Re-enable preemption before cpu_die()

start_secondary() is called shortly after _start and also via

cpu_idle()->cpu_die()->pseries_mach_cpu_die()

start_secondary() expects a preempt_count() of 0. pseries_mach_cpu_die() is
called via the cpu_idle() routine with preemption disabled, resulting in the
following repeating message during rapid cpu offline/online tests
with CONFIG_PREEMPT=y:

BUG: scheduling while atomic: swapper/0/0x00000002
Modules linked in: autofs4 binfmt_misc dm_mirror dm_region_hash dm_log [last unloaded: scsi_wait_scan]
Call Trace:
[c00000010e7079c0] [c0000000000133ec] .show_stack+0xd8/0x218 (unreliable)
[c00000010e707aa0] [c0000000006a47f0] .dump_stack+0x28/0x3c
[c00000010e707b20] [c00000000006e7a4] .__schedule_bug+0x7c/0x9c
[c00000010e707bb0] [c000000000699d9c] .schedule+0x104/0x800
[c00000010e707cd0] [c000000000015b24] .cpu_idle+0x1c4/0x1d8
[c00000010e707d70] [c0000000006aa1b4] .start_secondary+0x398/0x3d4
[c00000010e707e30] [c000000000008278] .start_secondary_resume+0x10/0x14

Move the cpu_die() call inside the existing preemption enabled block of
cpu_idle(). This is safe as the idle task is affined to a single CPU so the
debug_smp_processor_id() tests (from cpu_should_die()) won't trigger as we are
in a "migration disabled" region.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
diff a7c2bb82 Wed Aug 04 12:28:33 MDT 2010 Signed-off-by: Darren Hart <dvhltc@us.ibm.com> powerpc: Re-enable preemption before cpu_die()

start_secondary() is called shortly after _start and also via

cpu_idle()->cpu_die()->pseries_mach_cpu_die()

start_secondary() expects a preempt_count() of 0. pseries_mach_cpu_die() is
called via the cpu_idle() routine with preemption disabled, resulting in the
following repeating message during rapid cpu offline/online tests
with CONFIG_PREEMPT=y:

BUG: scheduling while atomic: swapper/0/0x00000002
Modules linked in: autofs4 binfmt_misc dm_mirror dm_region_hash dm_log [last unloaded: scsi_wait_scan]
Call Trace:
[c00000010e7079c0] [c0000000000133ec] .show_stack+0xd8/0x218 (unreliable)
[c00000010e707aa0] [c0000000006a47f0] .dump_stack+0x28/0x3c
[c00000010e707b20] [c00000000006e7a4] .__schedule_bug+0x7c/0x9c
[c00000010e707bb0] [c000000000699d9c] .schedule+0x104/0x800
[c00000010e707cd0] [c000000000015b24] .cpu_idle+0x1c4/0x1d8
[c00000010e707d70] [c0000000006aa1b4] .start_secondary+0x398/0x3d4
[c00000010e707e30] [c000000000008278] .start_secondary_resume+0x10/0x14

Move the cpu_die() call inside the existing preemption enabled block of
cpu_idle(). This is safe as the idle task is affined to a single CPU so the
debug_smp_processor_id() tests (from cpu_should_die()) won't trigger as we are
in a "migration disabled" region.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
diff a7c2bb82 Wed Aug 04 12:28:33 MDT 2010 Signed-off-by: Darren Hart <dvhltc@us.ibm.com> powerpc: Re-enable preemption before cpu_die()

start_secondary() is called shortly after _start and also via

cpu_idle()->cpu_die()->pseries_mach_cpu_die()

start_secondary() expects a preempt_count() of 0. pseries_mach_cpu_die() is
called via the cpu_idle() routine with preemption disabled, resulting in the
following repeating message during rapid cpu offline/online tests
with CONFIG_PREEMPT=y:

BUG: scheduling while atomic: swapper/0/0x00000002
Modules linked in: autofs4 binfmt_misc dm_mirror dm_region_hash dm_log [last unloaded: scsi_wait_scan]
Call Trace:
[c00000010e7079c0] [c0000000000133ec] .show_stack+0xd8/0x218 (unreliable)
[c00000010e707aa0] [c0000000006a47f0] .dump_stack+0x28/0x3c
[c00000010e707b20] [c00000000006e7a4] .__schedule_bug+0x7c/0x9c
[c00000010e707bb0] [c000000000699d9c] .schedule+0x104/0x800
[c00000010e707cd0] [c000000000015b24] .cpu_idle+0x1c4/0x1d8
[c00000010e707d70] [c0000000006aa1b4] .start_secondary+0x398/0x3d4
[c00000010e707e30] [c000000000008278] .start_secondary_resume+0x10/0x14

Move the cpu_die() call inside the existing preemption enabled block of
cpu_idle(). This is safe as the idle task is affined to a single CPU so the
debug_smp_processor_id() tests (from cpu_should_die()) won't trigger as we are
in a "migration disabled" region.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
diff a7c2bb82 Wed Aug 04 12:28:33 MDT 2010 Signed-off-by: Darren Hart <dvhltc@us.ibm.com> powerpc: Re-enable preemption before cpu_die()

start_secondary() is called shortly after _start and also via

cpu_idle()->cpu_die()->pseries_mach_cpu_die()

start_secondary() expects a preempt_count() of 0. pseries_mach_cpu_die() is
called via the cpu_idle() routine with preemption disabled, resulting in the
following repeating message during rapid cpu offline/online tests
with CONFIG_PREEMPT=y:

BUG: scheduling while atomic: swapper/0/0x00000002
Modules linked in: autofs4 binfmt_misc dm_mirror dm_region_hash dm_log [last unloaded: scsi_wait_scan]
Call Trace:
[c00000010e7079c0] [c0000000000133ec] .show_stack+0xd8/0x218 (unreliable)
[c00000010e707aa0] [c0000000006a47f0] .dump_stack+0x28/0x3c
[c00000010e707b20] [c00000000006e7a4] .__schedule_bug+0x7c/0x9c
[c00000010e707bb0] [c000000000699d9c] .schedule+0x104/0x800
[c00000010e707cd0] [c000000000015b24] .cpu_idle+0x1c4/0x1d8
[c00000010e707d70] [c0000000006aa1b4] .start_secondary+0x398/0x3d4
[c00000010e707e30] [c000000000008278] .start_secondary_resume+0x10/0x14

Move the cpu_die() call inside the existing preemption enabled block of
cpu_idle(). This is safe as the idle task is affined to a single CPU so the
debug_smp_processor_id() tests (from cpu_should_die()) won't trigger as we are
in a "migration disabled" region.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
diff a7c2bb82 Wed Aug 04 12:28:33 MDT 2010 Signed-off-by: Darren Hart <dvhltc@us.ibm.com> powerpc: Re-enable preemption before cpu_die()

start_secondary() is called shortly after _start and also via

cpu_idle()->cpu_die()->pseries_mach_cpu_die()

start_secondary() expects a preempt_count() of 0. pseries_mach_cpu_die() is
called via the cpu_idle() routine with preemption disabled, resulting in the
following repeating message during rapid cpu offline/online tests
with CONFIG_PREEMPT=y:

BUG: scheduling while atomic: swapper/0/0x00000002
Modules linked in: autofs4 binfmt_misc dm_mirror dm_region_hash dm_log [last unloaded: scsi_wait_scan]
Call Trace:
[c00000010e7079c0] [c0000000000133ec] .show_stack+0xd8/0x218 (unreliable)
[c00000010e707aa0] [c0000000006a47f0] .dump_stack+0x28/0x3c
[c00000010e707b20] [c00000000006e7a4] .__schedule_bug+0x7c/0x9c
[c00000010e707bb0] [c000000000699d9c] .schedule+0x104/0x800
[c00000010e707cd0] [c000000000015b24] .cpu_idle+0x1c4/0x1d8
[c00000010e707d70] [c0000000006aa1b4] .start_secondary+0x398/0x3d4
[c00000010e707e30] [c000000000008278] .start_secondary_resume+0x10/0x14

Move the cpu_die() call inside the existing preemption enabled block of
cpu_idle(). This is safe as the idle task is affined to a single CPU so the
debug_smp_processor_id() tests (from cpu_should_die()) won't trigger as we are
in a "migration disabled" region.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
diff a7c2bb82 Wed Aug 04 12:28:33 MDT 2010 Signed-off-by: Darren Hart <dvhltc@us.ibm.com> powerpc: Re-enable preemption before cpu_die()

start_secondary() is called shortly after _start and also via

cpu_idle()->cpu_die()->pseries_mach_cpu_die()

start_secondary() expects a preempt_count() of 0. pseries_mach_cpu_die() is
called via the cpu_idle() routine with preemption disabled, resulting in the
following repeating message during rapid cpu offline/online tests
with CONFIG_PREEMPT=y:

BUG: scheduling while atomic: swapper/0/0x00000002
Modules linked in: autofs4 binfmt_misc dm_mirror dm_region_hash dm_log [last unloaded: scsi_wait_scan]
Call Trace:
[c00000010e7079c0] [c0000000000133ec] .show_stack+0xd8/0x218 (unreliable)
[c00000010e707aa0] [c0000000006a47f0] .dump_stack+0x28/0x3c
[c00000010e707b20] [c00000000006e7a4] .__schedule_bug+0x7c/0x9c
[c00000010e707bb0] [c000000000699d9c] .schedule+0x104/0x800
[c00000010e707cd0] [c000000000015b24] .cpu_idle+0x1c4/0x1d8
[c00000010e707d70] [c0000000006aa1b4] .start_secondary+0x398/0x3d4
[c00000010e707e30] [c000000000008278] .start_secondary_resume+0x10/0x14

Move the cpu_die() call inside the existing preemption enabled block of
cpu_idle(). This is safe as the idle task is affined to a single CPU so the
debug_smp_processor_id() tests (from cpu_should_die()) won't trigger as we are
in a "migration disabled" region.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
diff a7c2bb82 Wed Aug 04 12:28:33 MDT 2010 Signed-off-by: Darren Hart <dvhltc@us.ibm.com> powerpc: Re-enable preemption before cpu_die()

start_secondary() is called shortly after _start and also via

cpu_idle()->cpu_die()->pseries_mach_cpu_die()

start_secondary() expects a preempt_count() of 0. pseries_mach_cpu_die() is
called via the cpu_idle() routine with preemption disabled, resulting in the
following repeating message during rapid cpu offline/online tests
with CONFIG_PREEMPT=y:

BUG: scheduling while atomic: swapper/0/0x00000002
Modules linked in: autofs4 binfmt_misc dm_mirror dm_region_hash dm_log [last unloaded: scsi_wait_scan]
Call Trace:
[c00000010e7079c0] [c0000000000133ec] .show_stack+0xd8/0x218 (unreliable)
[c00000010e707aa0] [c0000000006a47f0] .dump_stack+0x28/0x3c
[c00000010e707b20] [c00000000006e7a4] .__schedule_bug+0x7c/0x9c
[c00000010e707bb0] [c000000000699d9c] .schedule+0x104/0x800
[c00000010e707cd0] [c000000000015b24] .cpu_idle+0x1c4/0x1d8
[c00000010e707d70] [c0000000006aa1b4] .start_secondary+0x398/0x3d4
[c00000010e707e30] [c000000000008278] .start_secondary_resume+0x10/0x14

Move the cpu_die() call inside the existing preemption enabled block of
cpu_idle(). This is safe as the idle task is affined to a single CPU so the
debug_smp_processor_id() tests (from cpu_should_die()) won't trigger as we are
in a "migration disabled" region.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
diff a7c2bb82 Wed Aug 04 12:28:33 MDT 2010 Signed-off-by: Darren Hart <dvhltc@us.ibm.com> powerpc: Re-enable preemption before cpu_die()

start_secondary() is called shortly after _start and also via

cpu_idle()->cpu_die()->pseries_mach_cpu_die()

start_secondary() expects a preempt_count() of 0. pseries_mach_cpu_die() is
called via the cpu_idle() routine with preemption disabled, resulting in the
following repeating message during rapid cpu offline/online tests
with CONFIG_PREEMPT=y:

BUG: scheduling while atomic: swapper/0/0x00000002
Modules linked in: autofs4 binfmt_misc dm_mirror dm_region_hash dm_log [last unloaded: scsi_wait_scan]
Call Trace:
[c00000010e7079c0] [c0000000000133ec] .show_stack+0xd8/0x218 (unreliable)
[c00000010e707aa0] [c0000000006a47f0] .dump_stack+0x28/0x3c
[c00000010e707b20] [c00000000006e7a4] .__schedule_bug+0x7c/0x9c
[c00000010e707bb0] [c000000000699d9c] .schedule+0x104/0x800
[c00000010e707cd0] [c000000000015b24] .cpu_idle+0x1c4/0x1d8
[c00000010e707d70] [c0000000006aa1b4] .start_secondary+0x398/0x3d4
[c00000010e707e30] [c000000000008278] .start_secondary_resume+0x10/0x14

Move the cpu_die() call inside the existing preemption enabled block of
cpu_idle(). This is safe as the idle task is affined to a single CPU so the
debug_smp_processor_id() tests (from cpu_should_die()) won't trigger as we are
in a "migration disabled" region.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
diff a7c2bb82 Wed Aug 04 12:28:33 MDT 2010 Signed-off-by: Darren Hart <dvhltc@us.ibm.com> powerpc: Re-enable preemption before cpu_die()

start_secondary() is called shortly after _start and also via

cpu_idle()->cpu_die()->pseries_mach_cpu_die()

start_secondary() expects a preempt_count() of 0. pseries_mach_cpu_die() is
called via the cpu_idle() routine with preemption disabled, resulting in the
following repeating message during rapid cpu offline/online tests
with CONFIG_PREEMPT=y:

BUG: scheduling while atomic: swapper/0/0x00000002
Modules linked in: autofs4 binfmt_misc dm_mirror dm_region_hash dm_log [last unloaded: scsi_wait_scan]
Call Trace:
[c00000010e7079c0] [c0000000000133ec] .show_stack+0xd8/0x218 (unreliable)
[c00000010e707aa0] [c0000000006a47f0] .dump_stack+0x28/0x3c
[c00000010e707b20] [c00000000006e7a4] .__schedule_bug+0x7c/0x9c
[c00000010e707bb0] [c000000000699d9c] .schedule+0x104/0x800
[c00000010e707cd0] [c000000000015b24] .cpu_idle+0x1c4/0x1d8
[c00000010e707d70] [c0000000006aa1b4] .start_secondary+0x398/0x3d4
[c00000010e707e30] [c000000000008278] .start_secondary_resume+0x10/0x14

Move the cpu_die() call inside the existing preemption enabled block of
cpu_idle(). This is safe as the idle task is affined to a single CPU so the
debug_smp_processor_id() tests (from cpu_should_die()) won't trigger as we are
in a "migration disabled" region.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
diff a7c2bb82 Wed Aug 04 12:28:33 MDT 2010 Signed-off-by: Darren Hart <dvhltc@us.ibm.com> powerpc: Re-enable preemption before cpu_die()

start_secondary() is called shortly after _start and also via

cpu_idle()->cpu_die()->pseries_mach_cpu_die()

start_secondary() expects a preempt_count() of 0. pseries_mach_cpu_die() is
called via the cpu_idle() routine with preemption disabled, resulting in the
following repeating message during rapid cpu offline/online tests
with CONFIG_PREEMPT=y:

BUG: scheduling while atomic: swapper/0/0x00000002
Modules linked in: autofs4 binfmt_misc dm_mirror dm_region_hash dm_log [last unloaded: scsi_wait_scan]
Call Trace:
[c00000010e7079c0] [c0000000000133ec] .show_stack+0xd8/0x218 (unreliable)
[c00000010e707aa0] [c0000000006a47f0] .dump_stack+0x28/0x3c
[c00000010e707b20] [c00000000006e7a4] .__schedule_bug+0x7c/0x9c
[c00000010e707bb0] [c000000000699d9c] .schedule+0x104/0x800
[c00000010e707cd0] [c000000000015b24] .cpu_idle+0x1c4/0x1d8
[c00000010e707d70] [c0000000006aa1b4] .start_secondary+0x398/0x3d4
[c00000010e707e30] [c000000000008278] .start_secondary_resume+0x10/0x14

Move the cpu_die() call inside the existing preemption enabled block of
cpu_idle(). This is safe as the idle task is affined to a single CPU so the
debug_smp_processor_id() tests (from cpu_should_die()) won't trigger as we are
in a "migration disabled" region.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
diff a7c2bb82 Wed Aug 04 12:28:33 MDT 2010 Signed-off-by: Darren Hart <dvhltc@us.ibm.com> powerpc: Re-enable preemption before cpu_die()

start_secondary() is called shortly after _start and also via

cpu_idle()->cpu_die()->pseries_mach_cpu_die()

start_secondary() expects a preempt_count() of 0. pseries_mach_cpu_die() is
called via the cpu_idle() routine with preemption disabled, resulting in the
following repeating message during rapid cpu offline/online tests
with CONFIG_PREEMPT=y:

BUG: scheduling while atomic: swapper/0/0x00000002
Modules linked in: autofs4 binfmt_misc dm_mirror dm_region_hash dm_log [last unloaded: scsi_wait_scan]
Call Trace:
[c00000010e7079c0] [c0000000000133ec] .show_stack+0xd8/0x218 (unreliable)
[c00000010e707aa0] [c0000000006a47f0] .dump_stack+0x28/0x3c
[c00000010e707b20] [c00000000006e7a4] .__schedule_bug+0x7c/0x9c
[c00000010e707bb0] [c000000000699d9c] .schedule+0x104/0x800
[c00000010e707cd0] [c000000000015b24] .cpu_idle+0x1c4/0x1d8
[c00000010e707d70] [c0000000006aa1b4] .start_secondary+0x398/0x3d4
[c00000010e707e30] [c000000000008278] .start_secondary_resume+0x10/0x14

Move the cpu_die() call inside the existing preemption enabled block of
cpu_idle(). This is safe as the idle task is affined to a single CPU so the
debug_smp_processor_id() tests (from cpu_should_die()) won't trigger as we are
in a "migration disabled" region.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
diff a7c2bb82 Wed Aug 04 12:28:33 MDT 2010 Signed-off-by: Darren Hart <dvhltc@us.ibm.com> powerpc: Re-enable preemption before cpu_die()

start_secondary() is called shortly after _start and also via

cpu_idle()->cpu_die()->pseries_mach_cpu_die()

start_secondary() expects a preempt_count() of 0. pseries_mach_cpu_die() is
called via the cpu_idle() routine with preemption disabled, resulting in the
following repeating message during rapid cpu offline/online tests
with CONFIG_PREEMPT=y:

BUG: scheduling while atomic: swapper/0/0x00000002
Modules linked in: autofs4 binfmt_misc dm_mirror dm_region_hash dm_log [last unloaded: scsi_wait_scan]
Call Trace:
[c00000010e7079c0] [c0000000000133ec] .show_stack+0xd8/0x218 (unreliable)
[c00000010e707aa0] [c0000000006a47f0] .dump_stack+0x28/0x3c
[c00000010e707b20] [c00000000006e7a4] .__schedule_bug+0x7c/0x9c
[c00000010e707bb0] [c000000000699d9c] .schedule+0x104/0x800
[c00000010e707cd0] [c000000000015b24] .cpu_idle+0x1c4/0x1d8
[c00000010e707d70] [c0000000006aa1b4] .start_secondary+0x398/0x3d4
[c00000010e707e30] [c000000000008278] .start_secondary_resume+0x10/0x14

Move the cpu_die() call inside the existing preemption enabled block of
cpu_idle(). This is safe as the idle task is affined to a single CPU so the
debug_smp_processor_id() tests (from cpu_should_die()) won't trigger as we are
in a "migration disabled" region.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
/linux-master/arch/arm/kernel/
H A Disa.cdiff 0b4d4147 Wed Feb 14 01:34:09 MST 2007 Eric W. Biederman <ebiederm@xmission.com> [PATCH] sysctl: remove insert_at_head from register_sysctl

The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 0b4d4147 Wed Feb 14 01:34:09 MST 2007 Eric W. Biederman <ebiederm@xmission.com> [PATCH] sysctl: remove insert_at_head from register_sysctl

The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/linux-master/net/unix/
H A Dsysctl_net_unix.cdiff 0b4d4147 Wed Feb 14 01:34:09 MST 2007 Eric W. Biederman <ebiederm@xmission.com> [PATCH] sysctl: remove insert_at_head from register_sysctl

The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 0b4d4147 Wed Feb 14 01:34:09 MST 2007 Eric W. Biederman <ebiederm@xmission.com> [PATCH] sysctl: remove insert_at_head from register_sysctl

The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Completed in 1874 milliseconds

12