Cross Reference: /linux-master/net/xdp/xsk.c

History log of /linux-master/net/xdp/xsk.c
Revision	Date	Author	Comments
# 237f3cf1	04-Apr-2024	Eric Dumazet <edumazet@google.com>	xsk: validate user input for XDP_{UMEM\|COMPLETION}_FILL_RING syzbot reported an illegal copy in xsk_setsockopt() [1] Make sure to validate setsockopt() @optlen parameter. [1] BUG: KASAN: slab-out-of-bounds in copy_from_sockptr_offset include/linux/sockptr.h:49 [inline] BUG: KASAN: slab-out-of-bounds in copy_from_sockptr include/linux/sockptr.h:55 [inline] BUG: KASAN: slab-out-of-bounds in xsk_setsockopt+0x909/0xa40 net/xdp/xsk.c:1420 Read of size 4 at addr ffff888028c6cde3 by task syz-executor.0/7549 CPU: 0 PID: 7549 Comm: syz-executor.0 Not tainted 6.8.0-syzkaller-08951-gfe46a7dd189e #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 copy_from_sockptr_offset include/linux/sockptr.h:49 [inline] copy_from_sockptr include/linux/sockptr.h:55 [inline] xsk_setsockopt+0x909/0xa40 net/xdp/xsk.c:1420 do_sock_setsockopt+0x3af/0x720 net/socket.c:2311 __sys_setsockopt+0x1ae/0x250 net/socket.c:2334 __do_sys_setsockopt net/socket.c:2343 [inline] __se_sys_setsockopt net/socket.c:2340 [inline] __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340 do_syscall_64+0xfb/0x240 entry_SYSCALL_64_after_hwframe+0x6d/0x75 RIP: 0033:0x7fb40587de69 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fb40665a0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036 RAX: ffffffffffffffda RBX: 00007fb4059abf80 RCX: 00007fb40587de69 RDX: 0000000000000005 RSI: 000000000000011b RDI: 0000000000000006 RBP: 00007fb4058ca47a R08: 0000000000000002 R09: 0000000000000000 R10: 0000000020001980 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000000b R14: 00007fb4059abf80 R15: 00007fff57ee4d08 </TASK> Allocated by task 7549: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 poison_kmalloc_redzone mm/kasan/common.c:370 [inline] __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:387 kasan_kmalloc include/linux/kasan.h:211 [inline] __do_kmalloc_node mm/slub.c:3966 [inline] __kmalloc+0x233/0x4a0 mm/slub.c:3979 kmalloc include/linux/slab.h:632 [inline] __cgroup_bpf_run_filter_setsockopt+0xd2f/0x1040 kernel/bpf/cgroup.c:1869 do_sock_setsockopt+0x6b4/0x720 net/socket.c:2293 __sys_setsockopt+0x1ae/0x250 net/socket.c:2334 __do_sys_setsockopt net/socket.c:2343 [inline] __se_sys_setsockopt net/socket.c:2340 [inline] __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340 do_syscall_64+0xfb/0x240 entry_SYSCALL_64_after_hwframe+0x6d/0x75 The buggy address belongs to the object at ffff888028c6cde0 which belongs to the cache kmalloc-8 of size 8 The buggy address is located 1 bytes to the right of allocated 2-byte region [ffff888028c6cde0, ffff888028c6cde2) The buggy address belongs to the physical page: page:ffffea0000a31b00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888028c6c9c0 pfn:0x28c6c anon flags: 0xfff00000000800(slab\|node=0\|zone=1\|lastcpupid=0x7ff) page_type: 0xffffffff() raw: 00fff00000000800 ffff888014c41280 0000000000000000 dead000000000001 raw: ffff888028c6c9c0 0000000080800057 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112cc0(GFP_USER\|__GFP_NOWARN\|__GFP_NORETRY), pid 6648, tgid 6644 (syz-executor.0), ts 133906047828, free_ts 133859922223 set_page_owner include/linux/page_owner.h:31 [inline] post_alloc_hook+0x1ea/0x210 mm/page_alloc.c:1533 prep_new_page mm/page_alloc.c:1540 [inline] get_page_from_freelist+0x33ea/0x3580 mm/page_alloc.c:3311 __alloc_pages+0x256/0x680 mm/page_alloc.c:4569 __alloc_pages_node include/linux/gfp.h:238 [inline] alloc_pages_node include/linux/gfp.h:261 [inline] alloc_slab_page+0x5f/0x160 mm/slub.c:2175 allocate_slab mm/slub.c:2338 [inline] new_slab+0x84/0x2f0 mm/slub.c:2391 ___slab_alloc+0xc73/0x1260 mm/slub.c:3525 __slab_alloc mm/slub.c:3610 [inline] __slab_alloc_node mm/slub.c:3663 [inline] slab_alloc_node mm/slub.c:3835 [inline] __do_kmalloc_node mm/slub.c:3965 [inline] __kmalloc_node+0x2db/0x4e0 mm/slub.c:3973 kmalloc_node include/linux/slab.h:648 [inline] __vmalloc_area_node mm/vmalloc.c:3197 [inline] __vmalloc_node_range+0x5f9/0x14a0 mm/vmalloc.c:3392 __vmalloc_node mm/vmalloc.c:3457 [inline] vzalloc+0x79/0x90 mm/vmalloc.c:3530 bpf_check+0x260/0x19010 kernel/bpf/verifier.c:21162 bpf_prog_load+0x1667/0x20f0 kernel/bpf/syscall.c:2895 __sys_bpf+0x4ee/0x810 kernel/bpf/syscall.c:5631 __do_sys_bpf kernel/bpf/syscall.c:5738 [inline] __se_sys_bpf kernel/bpf/syscall.c:5736 [inline] __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5736 do_syscall_64+0xfb/0x240 entry_SYSCALL_64_after_hwframe+0x6d/0x75 page last free pid 6650 tgid 6647 stack trace: reset_page_owner include/linux/page_owner.h:24 [inline] free_pages_prepare mm/page_alloc.c:1140 [inline] free_unref_page_prepare+0x95d/0xa80 mm/page_alloc.c:2346 free_unref_page_list+0x5a3/0x850 mm/page_alloc.c:2532 release_pages+0x2117/0x2400 mm/swap.c:1042 tlb_batch_pages_flush mm/mmu_gather.c:98 [inline] tlb_flush_mmu_free mm/mmu_gather.c:293 [inline] tlb_flush_mmu+0x34d/0x4e0 mm/mmu_gather.c:300 tlb_finish_mmu+0xd4/0x200 mm/mmu_gather.c:392 exit_mmap+0x4b6/0xd40 mm/mmap.c:3300 __mmput+0x115/0x3c0 kernel/fork.c:1345 exit_mm+0x220/0x310 kernel/exit.c:569 do_exit+0x99e/0x27e0 kernel/exit.c:865 do_group_exit+0x207/0x2c0 kernel/exit.c:1027 get_signal+0x176e/0x1850 kernel/signal.c:2907 arch_do_signal_or_restart+0x96/0x860 arch/x86/kernel/signal.c:310 exit_to_user_mode_loop kernel/entry/common.c:105 [inline] exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:201 [inline] syscall_exit_to_user_mode+0xc9/0x360 kernel/entry/common.c:212 do_syscall_64+0x10a/0x240 arch/x86/entry/common.c:89 entry_SYSCALL_64_after_hwframe+0x6d/0x75 Memory state around the buggy address: ffff888028c6cc80: fa fc fc fc fa fc fc fc fa fc fc fc fa fc fc fc ffff888028c6cd00: fa fc fc fc fa fc fc fc 00 fc fc fc 06 fc fc fc >ffff888028c6cd80: fa fc fc fc fa fc fc fc fa fc fc fc 02 fc fc fc ^ ffff888028c6ce00: fa fc fc fc fa fc fc fc fa fc fc fc fa fc fc fc ffff888028c6ce80: fa fc fc fc fa fc fc fc fa fc fc fc fa fc fc fc Fixes: 423f38329d26 ("xsk: add umem fill queue support and mmap") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: "Björn Töpel" <bjorn@kernel.org> Cc: Magnus Karlsson <magnus.karlsson@intel.com> Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20240404202738.3634547-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
# 2863d665	05-Feb-2024	Magnus Karlsson <magnus.karlsson@intel.com>	xsk: support redirect to any socket bound to the same umem Add support for directing a packet to any socket bound to the same umem. This makes it possible to use the XDP program to select what socket the packet should be received on. The user can populate the XSKMAP with various sockets and as long as they share the same umem, the XDP program can pick any one of them. Suggested-by: Yuval El-Hanany <yuvale@radware.com> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240205123553.22180-2-magnus.karlsson@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
# 2127c604	02-Feb-2024	Sebastian Andrzej Siewior <bigeasy@linutronix.de>	xsk: Add truesize to skb_add_rx_frag(). xsk_build_skb() allocates a page and adds it to the skb via skb_add_rx_frag() and specifies 0 for truesize. This leads to a warning in skb_add_rx_frag() with CONFIG_DEBUG_NET enabled because size is larger than truesize. Increasing truesize requires to add the same amount to socket's sk_wmem_alloc counter in order not to underflow the counter during release in the destructor (sock_wfree()). Pass the size of the allocated page as truesize to skb_add_rx_frag(). Add this mount to socket's sk_wmem_alloc counter. Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20240202163221.2488589-1-bigeasy@linutronix.de
# 26900989	24-Jan-2024	Maciej Fijalkowski <maciej.fijalkowski@intel.com>	xsk: recycle buffer in case Rx queue was full Add missing xsk_buff_free() call when __xsk_rcv_zc() failed to produce descriptor to XSK Rx queue. Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX") Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20240124191602.566724-2-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
# 11614723	27-Nov-2023	Stanislav Fomichev <sdf@google.com>	xsk: Add option to calculate TX checksum in SW For XDP_COPY mode, add a UMEM option XDP_UMEM_TX_SW_CSUM to call skb_checksum_help in transmit path. Might be useful to debugging issues with real hardware. I also use this mode in the selftests. Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20231127190319.1190813-9-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
# ce59f968	27-Nov-2023	Stanislav Fomichev <sdf@google.com>	xsk: Validate xsk_tx_metadata flags Accept only the flags that the kernel knows about to make sure we can extend this field in the future. Note that only in XDP_COPY mode we propagate the error signal back to the user (via sendmsg). For zerocopy mode we silently skip the metadata for the descriptors that have wrong flags (since we process the descriptors deep in the driver). Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20231127190319.1190813-8-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
# 48eb03dd	27-Nov-2023	Stanislav Fomichev <sdf@google.com>	xsk: Add TX timestamp and TX checksum offload support This change actually defines the (initial) metadata layout that should be used by AF_XDP userspace (xsk_tx_metadata). The first field is flags which requests appropriate offloads, followed by the offload-specific fields. The supported per-device offloads are exported via netlink (new xsk-flags). The offloads themselves are still implemented in a bit of a framework-y fashion that's left from my initial kfunc attempt. I'm introducing new xsk_tx_metadata_ops which drivers are supposed to implement. The drivers are also supposed to call xsk_tx_metadata_request/xsk_tx_metadata_complete in the right places. Since xsk_tx_metadata_{request,_complete} are static inline, we don't incur any extra overhead doing indirect calls. The benefit of this scheme is as follows: - keeps all metadata layout parsing away from driver code - makes it easy to grep and see which drivers implement what - don't need any extra flags to maintain to keep track of what offloads are implemented; if the callback is implemented - the offload is supported (used by netlink reporting code) Two offloads are defined right now: 1. XDP_TXMD_FLAGS_CHECKSUM: skb-style csum_start+csum_offset 2. XDP_TXMD_FLAGS_TIMESTAMP: writes TX timestamp back into metadata area upon completion (tx_timestamp field) XDP_TXMD_FLAGS_TIMESTAMP is also implemented for XDP_COPY mode: it writes SW timestamp from the skb destructor (note I'm reusing hwtstamps to pass metadata pointer). The struct is forward-compatible and can be extended in the future by appending more fields. Reviewed-by: Song Yoong Siang <yoong.siang.song@intel.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20231127190319.1190813-3-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
# 341ac980	27-Nov-2023	Stanislav Fomichev <sdf@google.com>	xsk: Support tx_metadata_len For zerocopy mode, tx_desc->addr can point to an arbitrary offset and carry some TX metadata in the headroom. For copy mode, there is no way currently to populate skb metadata. Introduce new tx_metadata_len umem config option that indicates how many bytes to treat as metadata. Metadata bytes come prior to tx_desc address (same as in RX case). The size of the metadata has mostly the same constraints as XDP: - less than 256 bytes - 8-byte aligned (compared to 4-byte alignment on xdp, due to 8-byte timestamp in the completion) - non-zero This data is not interpreted in any way right now. Reviewed-by: Song Yoong Siang <yoong.siang.song@intel.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20231127190319.1190813-2-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
# e4d008d4	30-Nov-2023	Yewon Choi <woni9911@gmail.com>	xsk: Skip polling event check for unbound socket In xsk_poll(), checking available events and setting mask bits should be executed only when a socket has been bound. Setting mask bits for unbound socket is meaningless. Currently, it checks events even when xsk_check_common() failed. To prevent this, we move goto location (skip_tx) after that checking. Fixes: 1596dae2f17e ("xsk: check IFF_UP earlier in Tx path") Signed-off-by: Yewon Choi <woni9911@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20231201061048.GA1510@libra05
# 99b29a49	23-Oct-2023	Albert Huang <huangjie.albert@bytedance.com>	xsk: Avoid starving the xsk further down the list In the previous implementation, when multiple xsk sockets were associated with a single xsk_buff_pool, a situation could arise where the xsk_tx_list maintained data at the front for one xsk socket while starving the xsk sockets at the back of the list. This could result in issues such as the inability to transmit packets, increased latency, and jitter. To address this problem, we introduce a new variable called tx_budget_spent, which limits each xsk to transmit a maximum of MAX_PER_SOCKET_BUDGET tx descriptors. This allocation ensures equitable opportunities for subsequent xsk sockets to send tx descriptors. The value of MAX_PER_SOCKET_BUDGET is set to 32. Signed-off-by: Albert Huang <huangjie.albert@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20231023125732.82261-1-huangjie.albert@bytedance.com
# 9a675ba5	16-Oct-2023	Sebastian Andrzej Siewior <bigeasy@linutronix.de>	net, bpf: Add a warning if NAPI cb missed xdp_do_flush(). A few drivers were missing a xdp_do_flush() invocation after XDP_REDIRECT. Add three helper functions each for one of the per-CPU lists. Return true if the per-CPU list is non-empty and flush the list. Add xdp_do_check_flushed() which invokes each helper functions and creates a warning if one of the functions had a non-empty list. Hide everything behind CONFIG_DEBUG_NET. Suggested-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20231016125738.Yt79p1uF@linutronix.de
# 10bbf165	21-Sep-2023	Eric Dumazet <edumazet@google.com>	net: implement lockless SO_PRIORITY This is a followup of 8bf43be799d4 ("net: annotate data-races around sk->sk_priority"). sk->sk_priority can be read and written without holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
# d609f3d2	06-Sep-2023	Tirthendu Sarkar <tirthendu.sarkar@intel.com>	xsk: add multi-buffer support for sockets sharing umem Userspace applications indicate their multi-buffer capability to xsk using XSK_USE_SG socket bind flag. For sockets using shared umem the bind flag may contain XSK_USE_SG only for the first socket. For any subsequent socket the only option supported is XDP_SHARED_UMEM. Add option XDP_UMEM_SG_FLAG in umem config flags to store the multi-buffer handling capability when indicated by XSK_USE_SG option in bing flag by the first socket. Use this to derive multi-buffer capability for subsequent sockets in xsk core. Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com> Fixes: 81470b5c3c66 ("xsk: introduce XSK_USE_SG bind flag for xsk socket") Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20230907035032.2627879-1-tirthendu.sarkar@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
# 9d0a67b9	23-Aug-2023	Tirthendu Sarkar <tirthendu.sarkar@intel.com>	xsk: Fix xsk_build_skb() error: 'skb' dereferencing possible ERR_PTR() Currently, xsk_build_skb() is a function that builds skb in two possible ways and then is ended with common error handling. We can distinguish four possible error paths and handling in xsk_build_skb(): 1. sock_alloc_send_skb fails: Retry (skb is NULL). 2. skb_store_bits fails : Free skb and retry. 3. MAX_SKB_FRAGS exceeded: Free skb, cleanup and drop packet. 4. alloc_page fails for frag: Retry page allocation w/o freeing skb 1] and 3] can happen in xsk_build_skb_zerocopy(), which is one of the two code paths responsible for building skb. Common error path in xsk_build_skb() assumes that in case errno != -EAGAIN, skb is a valid pointer, which is wrong as kernel test robot reports that in xsk_build_skb_zerocopy() other errno values are returned for skb being NULL. To fix this, set -EOVERFLOW as error when MAX_SKB_FRAGS are exceeded and packet needs to be dropped in both xsk_build_skb() and xsk_build_skb_zerocopy() and use this to distinguish against all other error cases. Also, add explicit kfree_skb() for 3] so that handling of 1], 2], and 3] becomes identical where allocation needs to be retried. Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Closes: https://lore.kernel.org/r/202307210434.OjgqFcbB-lkp@intel.com Link: https://lore.kernel.org/bpf/20230823144713.2231808-1-tirthendu.sarkar@intel.com
# 49e47a5b	02-Aug-2023	Jakub Kicinski <kuba@kernel.org>	net: move struct netdev_rx_queue out of netdevice.h struct netdev_rx_queue is touched in only a few places and having it defined in netdevice.h brings in the dependency on xdp.h, because struct xdp_rxq_info gets embedded in struct netdev_rx_queue. In prep for removal of xdp.h from netdevice.h move all the netdev_rx_queue stuff to a new header. We could technically break the new header up to avoid the sysfs.h include but it's so rarely included it doesn't seem to be worth it at this point. Reviewed-by: Amritha Nambiar <amritha.nambiar@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/20230803010230.1755386-3-kuba@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
# 24ea5012	19-Jul-2023	Maciej Fijalkowski <maciej.fijalkowski@intel.com>	xsk: support mbuf on ZC RX Given that skb_shared_info relies on skb_frag_t, in order to support xskb chaining, introduce xdp_buff_xsk::xskb_list_node and xsk_buff_pool::xskb_list. This is needed so ZC drivers can add frags as xskb nodes which will make it possible to handle it both when producing AF_XDP Rx descriptors as well as freeing/recycling all the frags that a single frame carries. Speaking of latter, update xsk_buff_free() to take care of list nodes. For the former (adding as frags), introduce xsk_buff_add_frag() for ZC drivers usage that is going to be used to add a frag to xskb list from pool. xsk_buff_get_frag() will be utilized by XDP_TX and, on contrary, will return xdp_buff. One of the previous patches added a wrapper for ZC Rx so implement xskb list walk and production of Rx descriptors there. On bind() path, bail out if socket wants to use ZC multi-buffer but underlying netdev does not support it. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-12-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
# cf24f5a5	19-Jul-2023	Tirthendu Sarkar <tirthendu.sarkar@intel.com>	xsk: add support for AF_XDP multi-buffer on Tx path For transmitting an AF_XDP packet, allocate skb while processing the first desc and copy data to it. The 'XDP_PKT_CONTD' flag in 'options' field of the desc indicates the EOP status of the packet. If the current desc is not EOP, store the skb, release the current desc and go on to read the next descs. Allocate a page for each subsequent desc, copy data to it and add it as a frag in the skb stored in xsk. On processing EOP, transmit the skb with frags. Addresses contained in descs have been already queued in consumer queue and skb destructor updated the completion count. On transmit failure cancel the releases, clear the descs from the completion queue and consume the skb for retrying packet transmission. For any invalid descriptor (invalid length/address/options) in the middle of a packet, all pending descriptors will be dropped by xsk core along with the invalid one and the next descriptor is treated as the start of a new packet. Maximum supported frames for a packet is MAX_SKB_FRAGS + 1. If it is exceeded, all descriptors accumulated so far are dropped. Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-9-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
# b7f72a30	19-Jul-2023	Tirthendu Sarkar <tirthendu.sarkar@intel.com>	xsk: introduce wrappers and helpers for supporting multi-buffer in Tx path In Tx path, xsk core reserves space for each desc to be transmitted in the completion queue and it's address contained in it is stored in the skb destructor arg. After successful transmission the skb destructor submits the addr marking completion. To handle multiple descriptors per packet, now along with reserving space for each descriptor, the corresponding address is also stored in completion queue. The number of pending descriptors are stored in skb destructor arg and is used by the skb destructor to update completions. Introduce 'skb' in xdp_sock to store a partially built packet when __xsk_generic_xmit() must return before it sees the EOP descriptor for the current packet so that packet building can resume in next call of __xsk_generic_xmit(). Helper functions are introduced to set and get the pending descriptors in the skb destructor arg. Also, wrappers are introduced for storing descriptor addresses, submitting and cancelling (for unsuccessful transmissions) the number of completions. Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-7-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
# 80462775	19-Jul-2023	Tirthendu Sarkar <tirthendu.sarkar@intel.com>	xsk: add support for AF_XDP multi-buffer on Rx path Add multi-buffer support for AF_XDP by extending the XDP multi-buffer support to be reflected in user-space when a packet is redirected to an AF_XDP socket. In the XDP implementation, the NIC driver builds the xdp_buff from the first frag of the packet and adds any subsequent frags in the skb_shinfo area of the xdp_buff. In AF_XDP core, XDP buffers are allocated from xdp_sock's pool and data is copied from the driver's xdp_buff and frags. Once an allocated XDP buffer is full and there is still data to be copied, the 'XDP_PKT_CONTD' flag in'options' field of the corresponding xdp ring descriptor is set and passed to the application. When application sees the aforementioned flag set it knows there is pending data for this packet that will be carried in the following descriptors. If there is no more data to be copied, the flag in 'options' field is cleared for that descriptor signalling EOP to the application. If application reads a batch of descriptors using for example the libxdp interfaces, it is not guaranteed that the batch will end with a full packet. It might end in the middle of a packet and the rest of the frames of that packet will arrive at the beginning of the next batch. AF_XDP ensures that only a complete packet (along with all its frags) is sent to application. Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-6-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
# faa91b83	19-Jul-2023	Tirthendu Sarkar <tirthendu.sarkar@intel.com>	xsk: move xdp_buff's data length check to xsk_rcv_check If the data in xdp_buff exceeds the xsk frame length, the packet needs to be dropped. This check is currently being done in __xsk_rcv(). Move the described logic to xsk_rcv_check() so that such a xdp_buff will only be dropped if the application does not support multi-buffer (absence of XDP_USE_SG bind flag). This is applicable for all cases: copy mode, zero copy mode as well as skb mode. Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-5-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
# 556444c4	19-Jul-2023	Maciej Fijalkowski <maciej.fijalkowski@intel.com>	xsk: prepare both copy and zero-copy modes to co-exist Currently, __xsk_rcv_zc() is a function that is responsible for producing AF_XDP Rx descriptors. It is used by both copy and zero-copy mode. Both of these modes are going to differ when multi-buffer support is going to be added. ZC will work on a chain of xdp_buff_xsk structs whereas copy-mode is going to utilize skb_shared_info contents. This means that ZC-specific changes would affect the copy mode. Let's modify __xsk_rcv_zc() to work directly on xdp_buff_xsk so the callsites have to retrieve this from xdp_buff. Also, introduce xsk_rcv_zc() which will carry all the needed later changes for supporting multi-buffer on ZC side that do not apply to copy mode. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-4-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
# 81470b5c	19-Jul-2023	Tirthendu Sarkar <tirthendu.sarkar@intel.com>	xsk: introduce XSK_USE_SG bind flag for xsk socket As of now xsk core drops any xdp_buff with data size greater than the xsk frame_size as set by the af_xdp application. With multi-buffer support introduced in the next patch xsk core can now split those buffers into multiple descriptors provided the af_xdp application can handle them. Such capability of the application needs to be independent of the xdp_prog's frag support capability since there are cases where even a single xdp_buffer may need to be split into multiple descriptors owing to a smaller xsk frame size. For e.g., with NIC rx_buffer size set to 4kB, a 3kB packet will constitute of a single buffer and so will be sent as such to AF_XDP layer irrespective of 'xdp.frags' capability of the XDP program. Now if the xsk frame size is set to 2kB by the AF_XDP application, then the packet will need to be split into 2 descriptors if AF_XDP application can handle multi-buffer, else it needs to be dropped. Applications can now advertise their frag handling capability to xsk core so that xsk core can decide if it should drop or split xdp_buffs that exceed xsk frame size. This is done using a new 'XSK_USE_SG' bind flag for the xdp socket. Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-3-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
# 63a64a56	19-Jul-2023	Tirthendu Sarkar <tirthendu.sarkar@intel.com>	xsk: prepare 'options' in xdp_desc for multi-buffer use Use the 'options' field in xdp_desc as a packet continuity marker. Since 'options' field was unused till now and was expected to be set to 0, the 'eop' descriptor will have it set to 0, while the non-eop descriptors will have to set it to 1. This ensures legacy applications continue to work without needing any change for single-buffer packets. Add helper functions and extend xskq_prod_reserve_desc() to use the 'options' field. Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-2-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
# 85c2c79a	09-Aug-2023	Magnus Karlsson <magnus.karlsson@intel.com>	xsk: fix refcount underflow in error path Fix a refcount underflow problem reported by syzbot that can happen when a system is running out of memory. If xp_alloc_tx_descs() fails, and it can only fail due to not having enough memory, then the error path is triggered. In this error path, the refcount of the pool is decremented as it has incremented before. However, the reference to the pool in the socket was not nulled. This means that when the socket is closed later, the socket teardown logic will think that there is a pool attached to the socket and try to decrease the refcount again, leading to a refcount underflow. I chose this fix as it involved adding just a single line. Another option would have been to move xp_get_pool() and the assignment of xs->pool to after the if-statement and using xs_umem->pool instead of xs->pool in the whole if-statement resulting in somewhat simpler code, but this would have led to much more churn in the code base perhaps making it harder to backport. Fixes: ba3beec2ec1d ("xsk: Fix possible crash when multiple sockets are created") Reported-by: syzbot+8ada0057e69293a05fd4@syzkaller.appspotmail.com Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230809142843.13944-1-magnus.karlsson@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
# 3c5b4d69	28-Jul-2023	Eric Dumazet <edumazet@google.com>	net: annotate data-races around sk->sk_mark sk->sk_mark is often read while another thread could change the value. Fixes: 4a19ec5800fc ("[NET]: Introducing socket mark socket option.") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
# f7306ace	03-Jul-2023	Ilya Maximets <i.maximets@ovn.org>	xsk: Honor SO_BINDTODEVICE on bind Initial creation of an AF_XDP socket requires CAP_NET_RAW capability. A privileged process might create the socket and pass it to a non-privileged process for later use. However, that process will be able to bind the socket to any network interface. Even though it will not be able to receive any traffic without modification of the BPF map, the situation is not ideal. Sockets already have a mechanism that can be used to restrict what interface they can be attached to. That is SO_BINDTODEVICE. To change the SO_BINDTODEVICE binding the process will need CAP_NET_RAW. Make xsk_bind() honor the SO_BINDTODEVICE in order to allow safer workflow when non-privileged process is using AF_XDP. The intended workflow is following: 1. First process creates a bare socket with socket(AF_XDP, ...). 2. First process loads the XSK program to the interface. 3. First process adds the socket fd to a BPF map. 4. First process ties socket fd to a particular interface using SO_BINDTODEVICE. 5. First process sends socket fd to a second process. 6. Second process allocates UMEM. 7. Second process binds socket to the interface with bind(...). 8. Second process sends/receives the traffic. All the steps above are possible today if the first process is privileged and the second one has sufficient RLIMIT_MEMLOCK and no capabilities. However, the second process will be able to bind the socket to any interface it wants on step 7 and send traffic from it. With the proposed change, the second process will be able to bind the socket only to a specific interface chosen by the first process at step 4. Fixes: 965a99098443 ("xsk: add support for bind for Rx") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/bpf/20230703175329.3259672-1-i.maximets@ovn.org
# dc97391e	23-Jun-2023	David Howells <dhowells@redhat.com>	sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES) Remove ->sendpage() and ->sendpage_locked(). sendmsg() with MSG_SPLICE_PAGES should be used instead. This allows multiple pages and multipage folios to be passed through. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for net/can cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: linux-afs@lists.infradead.org cc: mptcp@lists.linux.dev cc: rds-devel@oss.oracle.com cc: tipc-discussion@lists.sourceforge.net cc: virtualization@lists.linux-foundation.org Link: https://lore.kernel.org/r/20230623225513.2732256-16-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
# 5f5a7d8d	24-Mar-2023	Nuno Gonçalves <nunog@fr24.com>	xsk: allow remap of fill and/or completion rings The remap of fill and completion rings was frowned upon as they control the usage of UMEM which does not support concurrent use. At the same time this would disallow the remap of these rings into another process. A possible use case is that the user wants to transfer the socket/ UMEM ownership to another process (via SYS_pidfd_getfd) and so would need to also remap these rings. This will have no impact on current usages and just relaxes the remap limitation. Signed-off-by: Nuno Gonçalves <nunog@fr24.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230324100222.13434-1-nunog@fr24.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
# 951bce29	21-Feb-2023	Xuan Zhuo <xuanzhuo@linux.alibaba.com>	xsk: add linux/vmalloc.h to xsk.c Fix the failure of the compilation under the sh4. Because we introduced remap_vmalloc_range() earlier, this has caused the compilation failure on the sh4 platform. So this introduction of the header file of linux/vmalloc.h. config: sh-allmodconfig (https://download.01.org/0day-ci/archive/20230221/202302210041.kpPQLlNQ-lkp@intel.com/config) compiler: sh4-linux-gcc (GCC) 12.1.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=9f78bf330a66cd400b3e00f370f597e9fa939207 git remote add net-next https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git git fetch --no-tags net-next master git checkout 9f78bf330a66cd400b3e00f370f597e9fa939207 # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=sh olddefconfig COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=sh SHELL=/bin/bash net/ Fixes: 9f78bf330a66 ("xsk: support use vaddr as ring") Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202302210041.kpPQLlNQ-lkp@intel.com/ Link: https://lore.kernel.org/r/20230221075140.46988-1-xuanzhuo@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
# 9f78bf33	16-Feb-2023	Xuan Zhuo <xuanzhuo@linux.alibaba.com>	xsk: support use vaddr as ring When we try to start AF_XDP on some machines with long running time, due to the machine's memory fragmentation problem, there is no sufficient contiguous physical memory that will cause the start failure. If the size of the queue is 8 * 1024, then the size of the desc[] is 8 * 1024 * 8 = 16 * PAGE, but we also add struct xdp_ring size, so it is 16page+. This is necessary to apply for a 4-order memory. If there are a lot of queues, it is difficult to these machine with long running time. Here, that we actually waste 15 pages. 4-Order memory is 32 pages, but we only use 17 pages. This patch replaces __get_free_pages() by vmalloc() to allocate memory to solve these problems. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
# 1596dae2	15-Feb-2023	Maciej Fijalkowski <maciej.fijalkowski@intel.com>	xsk: check IFF_UP earlier in Tx path Xsk Tx can be triggered via either sendmsg() or poll() syscalls. These two paths share a call to common function xsk_xmit() which has two sanity checks within. A pseudo code example to show the two paths: __xsk_sendmsg() : xsk_poll(): if (unlikely(!xsk_is_bound(xs))) if (unlikely(!xsk_is_bound(xs))) return -ENXIO; return mask; if (unlikely(need_wait)) (...) return -EOPNOTSUPP; xsk_xmit() mark napi id (...) xsk_xmit() xsk_xmit(): if (unlikely(!(xs->dev->flags & IFF_UP))) return -ENETDOWN; if (unlikely(!xs->tx)) return -ENOBUFS; As it can be observed above, in sendmsg() napi id can be marked on interface that was not brought up and this causes a NULL ptr dereference: [31757.505631] BUG: kernel NULL pointer dereference, address: 0000000000000018 [31757.512710] #PF: supervisor read access in kernel mode [31757.517936] #PF: error_code(0x0000) - not-present page [31757.523149] PGD 0 P4D 0 [31757.525726] Oops: 0000 [#1] PREEMPT SMP NOPTI [31757.530154] CPU: 26 PID: 95641 Comm: xdpsock Not tainted 6.2.0-rc5+ #40 [31757.536871] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019 [31757.547457] RIP: 0010:xsk_sendmsg+0xde/0x180 [31757.551799] Code: 00 75 a2 48 8b 00 a8 04 75 9b 84 d2 74 69 8b 85 14 01 00 00 85 c0 75 1b 48 8b 85 28 03 00 00 48 8b 80 98 00 00 00 48 8b 40 20 <8b> 40 18 89 85 14 01 00 00 8b bd 14 01 00 00 81 ff 00 01 00 00 0f [31757.570840] RSP: 0018:ffffc90034f27dc0 EFLAGS: 00010246 [31757.576143] RAX: 0000000000000000 RBX: ffffc90034f27e18 RCX: 0000000000000000 [31757.583389] RDX: 0000000000000001 RSI: ffffc90034f27e18 RDI: ffff88984cf3c100 [31757.590631] RBP: ffff88984714a800 R08: ffff88984714a800 R09: 0000000000000000 [31757.597877] R10: 0000000000000001 R11: 0000000000000000 R12: 00000000fffffffa [31757.605123] R13: 0000000000000000 R14: 0000000000000003 R15: 0000000000000000 [31757.612364] FS: 00007fb4c5931180(0000) GS:ffff88afdfa00000(0000) knlGS:0000000000000000 [31757.620571] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [31757.626406] CR2: 0000000000000018 CR3: 000000184b41c003 CR4: 00000000007706e0 [31757.633648] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [31757.640894] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [31757.648139] PKRU: 55555554 [31757.650894] Call Trace: [31757.653385] <TASK> [31757.655524] sock_sendmsg+0x8f/0xa0 [31757.659077] ? sockfd_lookup_light+0x12/0x70 [31757.663416] __sys_sendto+0xfc/0x170 [31757.667051] ? do_sched_setscheduler+0xdb/0x1b0 [31757.671658] __x64_sys_sendto+0x20/0x30 [31757.675557] do_syscall_64+0x38/0x90 [31757.679197] entry_SYSCALL_64_after_hwframe+0x72/0xdc [31757.687969] Code: 8e f6 ff 44 8b 4c 24 2c 4c 8b 44 24 20 41 89 c4 44 8b 54 24 28 48 8b 54 24 18 b8 2c 00 00 00 48 8b 74 24 10 8b 7c 24 08 0f 05 <48> 3d 00 f0 ff ff 77 3a 44 89 e7 48 89 44 24 08 e8 b5 8e f6 ff 48 [31757.707007] RSP: 002b:00007ffd49c73c70 EFLAGS: 00000293 ORIG_RAX: 000000000000002c [31757.714694] RAX: ffffffffffffffda RBX: 000055a996565380 RCX: 00007fb4c5727c16 [31757.721939] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003 [31757.729184] RBP: 0000000000000040 R08: 0000000000000000 R09: 0000000000000000 [31757.736429] R10: 0000000000000040 R11: 0000000000000293 R12: 0000000000000000 [31757.743673] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [31757.754940] </TASK> To fix this, let's make xsk_xmit a function that will be responsible for generic Tx, where RCU is handled accordingly and pull out sanity checks and xs->zc handling. Populate sanity checks to __xsk_sendmsg() and xsk_poll(). Fixes: ca2e1a627035 ("xsk: Mark napi_id on sendmsg()") Fixes: 18b1ab7aa76b ("xsk: Fix race at socket teardown") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20230215143309.13145-1-maciej.fijalkowski@intel.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
# fe33311c	13-Feb-2023	Jason Xing <kernelxing@tencent.com>	net: no longer support SOCK_REFCNT_DEBUG feature Commit e48c414ee61f ("[INET]: Generalise the TCP sock ID lookup routines") commented out the definition of SOCK_REFCNT_DEBUG in 2005 and later another commit 463c84b97f24 ("[NET]: Introduce inet_connection_sock") removed it. Since we could track all of them through bpf and kprobe related tools and the feature could print loads of information which might not be that helpful even under a little bit pressure, the whole feature which has been inactive for many years is no longer supported. Link: https://lore.kernel.org/lkml/20230211065153.54116-1-kerneljasonxing@gmail.com/ Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Wenjia Zhang <wenjia@linux.ibm.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
# 60240bc2	21-Sep-2022	Jalal Mostafa <jalal.a.mostapha@gmail.com>	xsk: Inherit need_wakeup flag for shared sockets The flag for need_wakeup is not set for xsks with `XDP_SHARED_UMEM` flag and of different queue ids and/or devices. They should inherit the flag from the first socket buffer pool since no flags can be specified once `XDP_SHARED_UMEM` is specified. Fixes: b5aea28dca134 ("xsk: Add shared umem support between queue ids") Signed-off-by: Jalal Mostafa <jalal.a.mostapha@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220921135701.10199-1-jalal.a.mostapha@gmail.com
# c00c4461	30-Aug-2022	Maciej Fijalkowski <maciej.fijalkowski@intel.com>	xsk: Fix backpressure mechanism on Tx Commit d678cbd2f867 ("xsk: Fix handling of invalid descriptors in XSK TX batching API") fixed batch API usage against set of descriptors with invalid ones but introduced a problem when AF_XDP SW rings are smaller than HW ones. Mismatch of reported Tx'ed frames between HW generator and user space app was observed. It turned out that backpressure mechanism became a bottleneck when the amount of produced descriptors to CQ is lower than what we grabbed from XSK Tx ring. Say that 512 entries had been taken from XSK Tx ring but we had only 490 free entries in CQ. Then callsite (ZC driver) will produce only 490 entries onto HW Tx ring but 512 entries will be released from Tx ring and this is what will be seen by the user space. In order to fix this case, mix XSK Tx/CQ ring interractions by moving around internal functions and changing call order: * pull out xskq_prod_nb_free() from xskq_prod_reserve_addr_batch() up to xsk_tx_peek_release_desc_batch(); ** move xskq_cons_release_n() into xskq_cons_read_desc_batch() After doing so, algorithm can be described as follows: 1. lookup Tx entries 2. use value from 1. to reserve space in CQ () 3. Read from Tx ring as much descriptors as value from 2 3a. release descriptors from XSK Tx ring (*) 4. Finally produce addresses to CQ Fixes: d678cbd2f867 ("xsk: Fix handling of invalid descriptors in XSK TX batching API") Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220830121705.8618-1-maciej.fijalkowski@intel.com
# ca2e1a62	07-Jul-2022	Maciej Fijalkowski <maciej.fijalkowski@intel.com>	xsk: Mark napi_id on sendmsg() When application runs in busy poll mode and does not receive a single packet but only sends them, it is currently impossible to get into napi_busy_loop() as napi_id is only marked on Rx side in xsk_rcv_check(). In there, napi_id is being taken from xdp_rxq_info carried by xdp_buff. From Tx perspective, we do not have access to it. What we have handy is the xsk pool. Xsk pool works on a pool of internal xdp_buff wrappers called xdp_buff_xsk. AF_XDP ZC enabled drivers call xp_set_rxq_info() so each of xdp_buff_xsk has a valid pointer to xdp_rxq_info of underlying queue. Therefore, on Tx side, napi_id can be pulled from xs->pool->heads[0].xdp.rxq->napi_id. Hide this pointer chase under helper function, xsk_pool_get_napi_id(). Do this only for sockets working in ZC mode as otherwise rxq pointers would not be initialized. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220707130842.49408-1-maciej.fijalkowski@intel.com
# a6e944f2	14-Jun-2022	Ciara Loftus <ciara.loftus@intel.com>	xsk: Fix generic transmit when completion queue reservation fails Two points of potential failure in the generic transmit function are: 1. completion queue (cq) reservation failure. 2. skb allocation failure Originally the cq reservation was performed first, followed by the skb allocation. Commit 675716400da6 ("xdp: fix possible cq entry leak") reversed the order because at the time there was no mechanism available to undo the cq reservation which could have led to possible cq entry leaks in the event of skb allocation failure. However if the skb allocation is performed first and the cq reservation then fails, the xsk skb destructor is called which blindly adds the skb address to the already full cq leading to undefined behavior. This commit restores the original order (cq reservation followed by skb allocation) and uses the xskq_prod_cancel helper to undo the cq reserve in event of skb allocation failure. Fixes: 675716400da6 ("xdp: fix possible cq entry leak") Signed-off-by: Ciara Loftus <ciara.loftus@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220614070746.8871-1-ciara.loftus@intel.com
# d678cbd2	07-Jun-2022	Maciej Fijalkowski <maciej.fijalkowski@intel.com>	xsk: Fix handling of invalid descriptors in XSK TX batching API xdpxceiver run on a AF_XDP ZC enabled driver revealed a problem with XSK Tx batching API. There is a test that checks how invalid Tx descriptors are handled by AF_XDP. Each valid descriptor is followed by invalid one on Tx side whereas the Rx side expects only to receive a set of valid descriptors. In current xsk_tx_peek_release_desc_batch() function, the amount of available descriptors is hidden inside xskq_cons_peek_desc_batch(). This can be problematic in cases where invalid descriptors are present due to the fact that xskq_cons_peek_desc_batch() returns only a count of valid descriptors. This means that it is impossible to properly update XSK ring state when calling xskq_cons_release_n(). To address this issue, pull out the contents of xskq_cons_peek_desc_batch() so that callers (currently only xsk_tx_peek_release_desc_batch()) will always be able to update the state of ring properly, as total count of entries is now available and use this value as an argument in xskq_cons_release_n(). By doing so, xskq_cons_peek_desc_batch() can be dropped altogether. Fixes: 9349eb3a9d2a ("xsk: Introduce batched Tx descriptor interfaces") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220607142200.576735-1-maciej.fijalkowski@intel.com
# 2be4a677	13-Apr-2022	Maciej Fijalkowski <maciej.fijalkowski@intel.com>	xsk: Diversify return codes in xsk_rcv_check() Inspired by patch that made xdp_do_redirect() return values for XSKMAP more meaningful, return -ENXIO instead of -EINVAL for socket being unbound in xsk_rcv_check() as this is the usual value that is returned for such event. In turn, it is now possible to easily distinguish what went wrong, which is a bit harder when for both cases checked, -EINVAL was returned. Return codes can be counted in a nice way via bpftrace oneliner that Jesper has shown: bpftrace -e 'tracepoint:xdp:xdp_redirect* {@err[-args->err] = count();}' Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/20220413153015.453864-3-maciej.fijalkowski@intel.com
# c6c1f11b	13-Apr-2022	Björn Töpel <bjorn@kernel.org>	xsk: Improve xdp_do_redirect() error codes The error codes returned by xdp_do_redirect() when redirecting a frame to an AF_XDP socket has not been very useful. A driver could not distinguish between different errors. Prior this change the following codes where used: Socket not bound or incorrect queue/netdev: EINVAL XDP frame/AF_XDP buffer size mismatch: ENOSPC Could not allocate buffer (copy mode): ENOSPC AF_XDP Rx buffer full: ENOSPC After this change: Socket not bound or incorrect queue/netdev: EINVAL XDP frame/AF_XDP buffer size mismatch: ENOSPC Could not allocate buffer (copy mode): ENOMEM AF_XDP Rx buffer full: ENOBUFS An AF_XDP zero-copy driver can now potentially determine if the failure was due to a full Rx buffer, and if so stop processing more frames, yielding to the userland AF_XDP application. Signed-off-by: Björn Töpel <bjorn@kernel.org> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/20220413153015.453864-2-maciej.fijalkowski@intel.com
# ba3beec2	25-Apr-2022	Maciej Fijalkowski <maciej.fijalkowski@intel.com>	xsk: Fix possible crash when multiple sockets are created Fix a crash that happens if an Rx only socket is created first, then a second socket is created that is Tx only and bound to the same umem as the first socket and also the same netdev and queue_id together with the XDP_SHARED_UMEM flag. In this specific case, the tx_descs array page pool was not created by the first socket as it was an Rx only socket. When the second socket is bound it needs this tx_descs array of this shared page pool as it has a Tx component, but unfortunately it was never allocated, leading to a crash. Note that this array is only used for zero-copy drivers using the batched Tx APIs, currently only ice and i40e. [ 5511.150360] BUG: kernel NULL pointer dereference, address: 0000000000000008 [ 5511.158419] #PF: supervisor write access in kernel mode [ 5511.164472] #PF: error_code(0x0002) - not-present page [ 5511.170416] PGD 0 P4D 0 [ 5511.173347] Oops: 0002 [#1] PREEMPT SMP PTI [ 5511.178186] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G E 5.18.0-rc1+ #97 [ 5511.187245] Hardware name: Intel Corp. GRANTLEY/GRANTLEY, BIOS GRRFCRB1.86B.0276.D07.1605190235 05/19/2016 [ 5511.198418] RIP: 0010:xsk_tx_peek_release_desc_batch+0x198/0x310 [ 5511.205375] Code: c0 83 c6 01 84 c2 74 6d 8d 46 ff 23 07 44 89 e1 48 83 c0 14 48 c1 e1 04 48 c1 e0 04 48 03 47 10 4c 01 c1 48 8b 50 08 48 8b 00 <48> 89 51 08 48 89 01 41 80 bd d7 00 00 00 00 75 82 48 8b 19 49 8b [ 5511.227091] RSP: 0018:ffffc90000003dd0 EFLAGS: 00010246 [ 5511.233135] RAX: 0000000000000000 RBX: ffff88810c8da600 RCX: 0000000000000000 [ 5511.241384] RDX: 000000000000003c RSI: 0000000000000001 RDI: ffff888115f555c0 [ 5511.249634] RBP: ffffc90000003e08 R08: 0000000000000000 R09: ffff889092296b48 [ 5511.257886] R10: 0000ffffffffffff R11: ffff889092296800 R12: 0000000000000000 [ 5511.266138] R13: ffff88810c8db500 R14: 0000000000000040 R15: 0000000000000100 [ 5511.274387] FS: 0000000000000000(0000) GS:ffff88903f800000(0000) knlGS:0000000000000000 [ 5511.283746] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5511.290389] CR2: 0000000000000008 CR3: 00000001046e2001 CR4: 00000000003706f0 [ 5511.298640] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 5511.306892] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 5511.315142] Call Trace: [ 5511.317972] <IRQ> [ 5511.320301] ice_xmit_zc+0x68/0x2f0 [ice] [ 5511.324977] ? ktime_get+0x38/0xa0 [ 5511.328913] ice_napi_poll+0x7a/0x6a0 [ice] [ 5511.333784] __napi_poll+0x2c/0x160 [ 5511.337821] net_rx_action+0xdd/0x200 [ 5511.342058] __do_softirq+0xe6/0x2dd [ 5511.346198] irq_exit_rcu+0xb5/0x100 [ 5511.350339] common_interrupt+0xa4/0xc0 [ 5511.354777] </IRQ> [ 5511.357201] <TASK> [ 5511.359625] asm_common_interrupt+0x1e/0x40 [ 5511.364466] RIP: 0010:cpuidle_enter_state+0xd2/0x360 [ 5511.370211] Code: 49 89 c5 0f 1f 44 00 00 31 ff e8 e9 00 7b ff 45 84 ff 74 12 9c 58 f6 c4 02 0f 85 72 02 00 00 31 ff e8 02 0c 80 ff fb 45 85 f6 <0f> 88 11 01 00 00 49 63 c6 4c 2b 2c 24 48 8d 14 40 48 8d 14 90 49 [ 5511.391921] RSP: 0018:ffffffff82a03e60 EFLAGS: 00000202 [ 5511.397962] RAX: ffff88903f800000 RBX: 0000000000000001 RCX: 000000000000001f [ 5511.406214] RDX: 0000000000000000 RSI: ffffffff823400b9 RDI: ffffffff8234c046 [ 5511.424646] RBP: ffff88810a384800 R08: 000005032a28c046 R09: 0000000000000008 [ 5511.443233] R10: 000000000000000b R11: 0000000000000006 R12: ffffffff82bcf700 [ 5511.461922] R13: 000005032a28c046 R14: 0000000000000001 R15: 0000000000000000 [ 5511.480300] cpuidle_enter+0x29/0x40 [ 5511.494329] do_idle+0x1c7/0x250 [ 5511.507610] cpu_startup_entry+0x19/0x20 [ 5511.521394] start_kernel+0x649/0x66e [ 5511.534626] secondary_startup_64_no_verify+0xc3/0xcb [ 5511.549230] </TASK> Detect such case during bind() and allocate this memory region via newly introduced xp_alloc_tx_descs(). Also, use kvcalloc instead of kcalloc as for other buffer pool allocations, so that it matches the kvfree() from xp_destroy(). Fixes: d1bc532e99be ("i40e: xsk: Move tmp desc array from driver to pool") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220425153745.481322-1-maciej.fijalkowski@intel.com