#
e79b47a8 |
|
17-Apr-2024 |
Pablo Neira Ayuso <pablo@netfilter.org> |
netfilter: nf_tables: restore set elements when delete set fails From abort path, nft_mapelem_activate() needs to restore refcounters to the original state. Currently, it uses the set->ops->walk() to iterate over these set elements. The existing set iterator skips inactive elements in the next generation, this does not work from the abort path to restore the original state since it has to skip active elements instead (not inactive ones). This patch moves the check for inactive elements to the set iterator callback, then it reverses the logic for the .activate case which needs to skip active elements. Toggle next generation bit for elements when delete set command is invoked and call nft_clear() from .activate (abort) path to restore the next generation bit. The splat below shows an object in mappings memleak: [43929.457523] ------------[ cut here ]------------ [43929.457532] WARNING: CPU: 0 PID: 1139 at include/net/netfilter/nf_tables.h:1237 nft_setelem_data_deactivate+0xe4/0xf0 [nf_tables] [...] [43929.458014] RIP: 0010:nft_setelem_data_deactivate+0xe4/0xf0 [nf_tables] [43929.458076] Code: 83 f8 01 77 ab 49 8d 7c 24 08 e8 37 5e d0 de 49 8b 6c 24 08 48 8d 7d 50 e8 e9 5c d0 de 8b 45 50 8d 50 ff 89 55 50 85 c0 75 86 <0f> 0b eb 82 0f 0b eb b3 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 [43929.458081] RSP: 0018:ffff888140f9f4b0 EFLAGS: 00010246 [43929.458086] RAX: 0000000000000000 RBX: ffff8881434f5288 RCX: dffffc0000000000 [43929.458090] RDX: 00000000ffffffff RSI: ffffffffa26d28a7 RDI: ffff88810ecc9550 [43929.458093] RBP: ffff88810ecc9500 R08: 0000000000000001 R09: ffffed10281f3e8f [43929.458096] R10: 0000000000000003 R11: ffff0000ffff0000 R12: ffff8881434f52a0 [43929.458100] R13: ffff888140f9f5f4 R14: ffff888151c7a800 R15: 0000000000000002 [43929.458103] FS: 00007f0c687c4740(0000) GS:ffff888390800000(0000) knlGS:0000000000000000 [43929.458107] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [43929.458111] CR2: 00007f58dbe5b008 CR3: 0000000123602005 CR4: 00000000001706f0 [43929.458114] Call Trace: [43929.458118] <TASK> [43929.458121] ? __warn+0x9f/0x1a0 [43929.458127] ? nft_setelem_data_deactivate+0xe4/0xf0 [nf_tables] [43929.458188] ? report_bug+0x1b1/0x1e0 [43929.458196] ? handle_bug+0x3c/0x70 [43929.458200] ? exc_invalid_op+0x17/0x40 [43929.458211] ? nft_setelem_data_deactivate+0xd7/0xf0 [nf_tables] [43929.458271] ? nft_setelem_data_deactivate+0xe4/0xf0 [nf_tables] [43929.458332] nft_mapelem_deactivate+0x24/0x30 [nf_tables] [43929.458392] nft_rhash_walk+0xdd/0x180 [nf_tables] [43929.458453] ? __pfx_nft_rhash_walk+0x10/0x10 [nf_tables] [43929.458512] ? rb_insert_color+0x2e/0x280 [43929.458520] nft_map_deactivate+0xdc/0x1e0 [nf_tables] [43929.458582] ? __pfx_nft_map_deactivate+0x10/0x10 [nf_tables] [43929.458642] ? __pfx_nft_mapelem_deactivate+0x10/0x10 [nf_tables] [43929.458701] ? __rcu_read_unlock+0x46/0x70 [43929.458709] nft_delset+0xff/0x110 [nf_tables] [43929.458769] nft_flush_table+0x16f/0x460 [nf_tables] [43929.458830] nf_tables_deltable+0x501/0x580 [nf_tables] Fixes: 628bd3e49cba ("netfilter: nf_tables: drop map element references from preparation phase") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
efefd4f0 |
|
17-Apr-2024 |
Pablo Neira Ayuso <pablo@netfilter.org> |
netfilter: nf_tables: missing iterator type in lookup walk Add missing decorator type to lookup expression and tighten WARN_ON_ONCE check in pipapo to spot earlier that this is unset. Fixes: 29b359cf6d95 ("netfilter: nft_set_pipapo: walk over current view on netlink dump") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
3cfc9ec0 |
|
10-Apr-2024 |
Florian Westphal <fw@strlen.de> |
netfilter: nft_set_pipapo: do not free live element Pablo reports a crash with large batches of elements with a back-to-back add/remove pattern. Quoting Pablo: add_elem("00000000") timeout 100 ms ... add_elem("0000000X") timeout 100 ms del_elem("0000000X") <---------------- delete one that was just added ... add_elem("00005000") timeout 100 ms 1) nft_pipapo_remove() removes element 0000000X Then, KASAN shows a splat. Looking at the remove function there is a chance that we will drop a rule that maps to a non-deactivated element. Removal happens in two steps, first we do a lookup for key k and return the to-be-removed element and mark it as inactive in the next generation. Then, in a second step, the element gets removed from the set/map. The _remove function does not work correctly if we have more than one element that share the same key. This can happen if we insert an element into a set when the set already holds an element with same key, but the element mapping to the existing key has timed out or is not active in the next generation. In such case its possible that removal will unmap the wrong element. If this happens, we will leak the non-deactivated element, it becomes unreachable. The element that got deactivated (and will be freed later) will remain reachable in the set data structure, this can result in a crash when such an element is retrieved during lookup (stale pointer). Add a check that the fully matching key does in fact map to the element that we have marked as inactive in the deactivation step. If not, we need to continue searching. Add a bug/warn trap at the end of the function as well, the remove function must not ever be called with an invisible/unreachable/non-existent element. v2: avoid uneeded temporary variable (Stefano) Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Reported-by: Pablo Neira Ayuso <pablo@netfilter.org> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
29b359cf |
|
10-Apr-2024 |
Pablo Neira Ayuso <pablo@netfilter.org> |
netfilter: nft_set_pipapo: walk over current view on netlink dump The generation mask can be updated while netlink dump is in progress. The pipapo set backend walk iterator cannot rely on it to infer what view of the datastructure is to be used. Add notation to specify if user wants to read/update the set. Based on patch from Florian Westphal. Fixes: 2b84e215f874 ("netfilter: nft_set_pipapo: .walk does not deal with generations") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
b0e256f3 |
|
10-Mar-2024 |
Pablo Neira Ayuso <pablo@netfilter.org> |
netfilter: nft_set_pipapo: release elements in clone only from destroy path Clone already always provides a current view of the lookup table, use it to destroy the set, otherwise it is possible to destroy elements twice. This fix requires: 212ed75dc5fb ("netfilter: nf_tables: integrate pipapo into commit protocol") which came after: 9827a0e6e23b ("netfilter: nft_set_pipapo: release elements in clone from abort path"). Fixes: 9827a0e6e23b ("netfilter: nft_set_pipapo: release elements in clone from abort path") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
5b651783 |
|
15-Feb-2024 |
Florian Westphal <fw@strlen.de> |
netfilter: nft_set_pipapo: use GFP_KERNEL for insertions An earlier attempt changed this to GFP_KERNEL, but the get helper is also called for get requests from userspace, which uses rcu. Let the caller pass in the kmalloc flags to allow insertions to schedule if needed. Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
|
#
9f439bd6 |
|
13-Feb-2024 |
Florian Westphal <fw@strlen.de> |
netfilter: nft_set_pipapo: speed up bulk element insertions Insertions into the set are slow when we try to add many elements. For 800k elements I get: time nft -f pipapo_800k real 19m34.849s user 0m2.390s sys 19m12.828s perf stats: --95.39%--nft_pipapo_insert |--76.60%--pipapo_insert | --76.37%--pipapo_resize | |--72.87%--memcpy_orig | |--1.88%--__free_pages_ok | | --0.89%--free_tail_page_prepare | --1.38%--kvmalloc_node .. --18.56%--pipapo_get.isra.0 |--13.91%--__bitmap_and |--3.01%--pipapo_refill |--0.81%--__kmalloc | --0.74%--__kmalloc_large_node | --0.66%--__alloc_pages .. --0.52%--memset_orig So lots of time is spent in copying exising elements to make space for the next one. Instead of allocating to the exact size of the new rule count, allocate extra slack to reduce alloc/copy/free overhead. After: time nft -f pipapo_800k real 1m54.110s user 0m2.515s sys 1m51.377s --80.46%--nft_pipapo_insert |--73.45%--pipapo_get.isra.0 |--57.63%--__bitmap_and | |--8.52%--pipapo_refill |--3.45%--__kmalloc | --3.05%--__kmalloc_large_node | --2.58%--__alloc_pages --2.59%--memset_orig |--6.51%--pipapo_insert --5.96%--pipapo_resize |--3.63%--memcpy_orig --2.13%--kvmalloc_node The new @rules_alloc fills a hole, so struct size doesn't go up. Also make it so rule removal doesn't shrink unless the free/extra space exceeds two pages. This should be safe as well: When a rule gets removed, the attempt to lower the allocated size is already allowed to fail. Exception: do exact allocations as long as set is very small (less than one page needed). v2: address comments from Stefano: kdoc comment formatting changes remove redundant assignment switch back to PAGE_SIZE Link: https://lore.kernel.org/netfilter-devel/20240213141753.17ef27a6@elisabeth/ Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>
|
#
aac14d51 |
|
13-Feb-2024 |
Florian Westphal <fw@strlen.de> |
netfilter: nft_set_pipapo: shrink data structures The set uses a mix of 'int', 'unsigned int', and size_t. The rule count limit is NFT_PIPAPO_RULE0_MAX, which cannot exceed INT_MAX (a few helpers use 'int' as return type). Add a compile-time assertion for this. Replace size_t usage in structs with unsigned int or u8 where the stored values are smaller. Replace signed-int arguments for lengths with 'unsigned int' where possible. Last, remove lt_aligned member: its set but never read. struct nft_pipapo_match 40 bytes -> 32 bytes struct nft_pipapo_field 56 bytes -> 32 bytes Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>
|
#
07ace0bb |
|
13-Feb-2024 |
Florian Westphal <fw@strlen.de> |
netfilter: nft_set_pipapo: do not rely on ZERO_SIZE_PTR pipapo relies on kmalloc(0) returning ZERO_SIZE_PTR (i.e., not NULL but pointer is invalid). Rework this to not call slab allocator when we'd request a 0-byte allocation. Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>
|
#
f04df573 |
|
13-Feb-2024 |
Florian Westphal <fw@strlen.de> |
netfilter: nft_set_pipapo: constify lookup fn args where possible Those get called from packet path, content must not be modified. No functional changes intended. Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>
|
#
5a8cdf6f |
|
08-Feb-2024 |
Florian Westphal <fw@strlen.de> |
netfilter: nft_set_pipapo: remove scratch_aligned pointer use ->scratch for both avx2 and the generic implementation. After previous change the scratch->map member is always aligned properly for AVX2, so we can just use scratch->map in AVX2 too. The alignoff delta is stored in the scratchpad so we can reconstruct the correct address to free the area again. Fixes: 7400b063969b ("nft_set_pipapo: Introduce AVX2-based lookup implementation") Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
47b1c03c |
|
07-Feb-2024 |
Florian Westphal <fw@strlen.de> |
netfilter: nft_set_pipapo: add helper to release pcpu scratch area After next patch simple kfree() is not enough anymore, so add a helper for it. Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
76313d1a |
|
07-Feb-2024 |
Florian Westphal <fw@strlen.de> |
netfilter: nft_set_pipapo: store index in scratch maps Pipapo needs a scratchpad area to keep state during matching. This state can be large and thus cannot reside on stack. Each set preallocates percpu areas for this. On each match stage, one scratchpad half starts with all-zero and the other is inited to all-ones. At the end of each stage, the half that starts with all-ones is always zero. Before next field is tested, pointers to the two halves are swapped, i.e. resmap pointer turns into fill pointer and vice versa. After the last field has been processed, pipapo stashes the index toggle in a percpu variable, with assumption that next packet will start with the all-zero half and sets all bits in the other to 1. This isn't reliable. There can be multiple sets and we can't be sure that the upper and lower half of all set scratch map is always in sync (lookups can be conditional), so one set might have swapped, but other might not have been queried. Thus we need to keep the index per-set-and-cpu, just like the scratchpad. Note that this bug fix is incomplete, there is a related issue. avx2 and normal implementation might use slightly different areas of the map array space due to the avx2 alignment requirements, so m->scratch (generic/fallback implementation) and ->scratch_aligned (avx) may partially overlap. scratch and scratch_aligned are not distinct objects, the latter is just the aligned address of the former. After this change, write to scratch_align->map_index may write to scratch->map, so this issue becomes more prominent, we can set to 1 a bit in the supposedly-all-zero area of scratch->map[]. A followup patch will remove the scratch_aligned and makes generic and avx code use the same (aligned) area. Its done in a separate change to ease review. Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
7395dfac |
|
05-Feb-2024 |
Pablo Neira Ayuso <pablo@netfilter.org> |
netfilter: nf_tables: use timestamp to check for set element timeout Add a timestamp field at the beginning of the transaction, store it in the nftables per-netns area. Update set backend .insert, .deactivate and sync gc path to use the timestamp, this avoids that an element expires while control plane transaction is still unfinished. .lookup and .update, which are used from packet path, still use the current time to check if the element has expired. And .get path and dump also since this runs lockless under rcu read size lock. Then, there is async gc which also needs to check the current time since it runs asynchronously from a workqueue. Fixes: c3e1b005ed1c ("netfilter: nf_tables: add set element timeout support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
ab0beafd |
|
02-Feb-2024 |
Pablo Neira Ayuso <pablo@netfilter.org> |
netfilter: nft_set_pipapo: remove static in nft_pipapo_get() This has slipped through when reducing memory footprint for set elements, remove it. Fixes: 9dad402b89e8 ("netfilter: nf_tables: expose opaque set element as struct nft_elem_priv") Reported-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
ffb40fba |
|
20-Nov-2023 |
Florian Westphal <fw@strlen.de> |
netfilter: nft_set_pipapo: prefer gfp_kernel allocation No need to use GFP_ATOMIC here. Fixes: f6c383b8c31a ("netfilter: nf_tables: adapt set backend to use GC transaction API") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
317eb968 |
|
01-Dec-2023 |
Florian Westphal <fw@strlen.de> |
netfilter: nft_set_pipapo: skip inactive elements during set walk Otherwise set elements can be deactivated twice which will cause a crash. Reported-by: Xingyuan Mo <hdthky0@gmail.com> Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
078996fc |
|
18-Oct-2023 |
Pablo Neira Ayuso <pablo@netfilter.org> |
netfilter: nf_tables: set->ops->insert returns opaque set element in case of EEXIST Return struct nft_elem_priv instead of struct nft_set_ext for consistency with ("netfilter: nf_tables: expose opaque set element as struct nft_elem_priv") and to prepare the introduction of element timeout updates from control path. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
0e1ea651 |
|
16-Oct-2023 |
Pablo Neira Ayuso <pablo@netfilter.org> |
netfilter: nf_tables: shrink memory consumption of set elements Instead of copying struct nft_set_elem into struct nft_trans_elem, store the pointer to the opaque set element object in the transaction. Adapt set backend API (and set backend implementations) to take the pointer to opaque set element representation whenever required. This patch deconstifies .remove() and .activate() set backend API since these modify the set element opaque object. And it also constify nft_set_elem_ext() this provides access to the nft_set_ext struct without updating the object. According to pahole on x86_64, this patch shrinks struct nft_trans_elem size from 216 to 24 bytes. This patch also reduces stack memory consumption by removing the template struct nft_set_elem object, using the opaque set element object instead such as from the set iterator API, catchall elements and the get element command. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
9dad402b |
|
18-Oct-2023 |
Pablo Neira Ayuso <pablo@netfilter.org> |
netfilter: nf_tables: expose opaque set element as struct nft_elem_priv Add placeholder structure and place it at the beginning of each struct nft_*_elem for each existing set backend, instead of exposing elements as void type to the frontend which defeats compiler type checks. Use this pointer to this new type to replace void *. This patch updates the following set backend API to use this new struct nft_elem_priv placeholder structure: - update - deactivate - flush - get as well as the following helper functions: - nft_set_elem_ext() - nft_set_elem_init() - nft_set_elem_destroy() - nf_tables_set_elem_destroy() This patch adds nft_elem_priv_cast() to cast struct nft_elem_priv to native element representation from the corresponding set backend. BUILD_BUG_ON() makes sure this .priv placeholder is always at the top of the opaque set element representation. Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
6509a2e4 |
|
18-Oct-2023 |
Pablo Neira Ayuso <pablo@netfilter.org> |
netfilter: nf_tables: set backend .flush always succeeds .flush is always successful since this results from iterating over the set elements to toggle mark the element as inactive in the next generation. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
26cec9d4 |
|
18-Oct-2023 |
Pablo Neira Ayuso <pablo@netfilter.org> |
netfilter: nft_set_pipapo: no need to call pipapo_deactivate() from flush Use the element object that is already offered instead. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
25600167 |
|
13-Oct-2023 |
Florian Westphal <fw@strlen.de> |
netfilter: nf_tables: de-constify set commit ops function argument The set backend using this already has to work around this via ugly cast, don't spread this pattern. Signed-off-by: Florian Westphal <fw@strlen.de>
|
#
6d365eab |
|
06-Sep-2023 |
Pablo Neira Ayuso <pablo@netfilter.org> |
netfilter: nft_set_pipapo: stop GC iteration if GC transaction allocation fails nft_trans_gc_queue_sync() enqueues the GC transaction and it allocates a new one. If this allocation fails, then stop this GC sync run and retry later. Fixes: 5f68718b34a5 ("netfilter: nf_tables: GC transaction API to avoid race with control plane") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
4a9e12ea |
|
06-Sep-2023 |
Pablo Neira Ayuso <pablo@netfilter.org> |
netfilter: nft_set_pipapo: call nft_trans_gc_queue_sync() in catchall GC pipapo needs to enqueue GC transactions for catchall elements through nft_trans_gc_queue_sync(). Add nft_trans_gc_catchall_sync() and nft_trans_gc_catchall_async() to handle GC transaction queueing accordingly. Fixes: 5f68718b34a5 ("netfilter: nf_tables: GC transaction API to avoid race with control plane") Fixes: f6c383b8c31a ("netfilter: nf_tables: adapt set backend to use GC transaction API") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
5e1be4cd |
|
22-Aug-2023 |
Florian Westphal <fw@strlen.de> |
netfilter: nf_tables: fix out of memory error handling Several instances of pipapo_resize() don't propagate allocation failures, this causes a crash when fault injection is enabled for gfp_kernel slabs. Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
|
#
7845914f |
|
12-Aug-2023 |
Florian Westphal <fw@strlen.de> |
netfilter: nf_tables: don't fail inserts if duplicate has expired nftables selftests fail: run-tests.sh testcases/sets/0044interval_overlap_0 Expected: 0-2 . 0-3, got: W: [FAILED] ./testcases/sets/0044interval_overlap_0: got 1 Insertion must ignore duplicate but expired entries. Moreover, there is a strange asymmetry in nft_pipapo_activate: It refetches the current element, whereas the other ->activate callbacks (bitmap, hash, rhash, rbtree) use elem->priv. Same for .remove: other set implementations take elem->priv, nft_pipapo_remove fetches elem->priv, then does a relookup, remove this. I suspect this was the reason for the change that prompted the removal of the expired check in pipapo_get() in the first place, but skipping exired elements there makes no sense to me, this helper is used for normal get requests, insertions (duplicate check) and deactivate callback. In first two cases expired elements must be skipped. For ->deactivate(), this gets called for DELSETELEM, so it seems to me that expired elements should be skipped as well, i.e. delete request should fail with -ENOENT error. Fixes: 24138933b97b ("netfilter: nf_tables: don't skip expired elements during walk") Signed-off-by: Florian Westphal <fw@strlen.de>
|
#
08713cb0 |
|
10-Aug-2023 |
Florian Westphal <fw@strlen.de> |
netfilter: nf_tables: fix kdoc warnings after gc rework Jakub Kicinski says: We've got some new kdoc warnings here: net/netfilter/nft_set_pipapo.c:1557: warning: Function parameter or member '_set' not described in 'pipapo_gc' net/netfilter/nft_set_pipapo.c:1557: warning: Excess function parameter 'set' description in 'pipapo_gc' include/net/netfilter/nf_tables.h:577: warning: Function parameter or member 'dead' not described in 'nft_set' Fixes: 5f68718b34a5 ("netfilter: nf_tables: GC transaction API to avoid race with control plane") Fixes: f6c383b8c31a ("netfilter: nf_tables: adapt set backend to use GC transaction API") Reported-by: Jakub Kicinski <kuba@kernel.org> Closes: https://lore.kernel.org/netdev/20230810104638.746e46f1@kernel.org/ Signed-off-by: Florian Westphal <fw@strlen.de>
|
#
b9f052dc |
|
08-Aug-2023 |
Florian Westphal <fw@strlen.de> |
netfilter: nf_tables: fix false-positive lockdep splat ->abort invocation may cause splat on debug kernels: WARNING: suspicious RCU usage net/netfilter/nft_set_pipapo.c:1697 suspicious rcu_dereference_check() usage! [..] rcu_scheduler_active = 2, debug_locks = 1 1 lock held by nft/133554: [..] (nft_net->commit_mutex){+.+.}-{3:3}, at: nf_tables_valid_genid [..] lockdep_rcu_suspicious+0x1ad/0x260 nft_pipapo_abort+0x145/0x180 __nf_tables_abort+0x5359/0x63d0 nf_tables_abort+0x24/0x40 nfnetlink_rcv+0x1a0a/0x22c0 netlink_unicast+0x73c/0x900 netlink_sendmsg+0x7f0/0xc20 ____sys_sendmsg+0x48d/0x760 Transaction mutex is held, so parallel updates are not possible. Switch to _protected and check mutex is held for lockdep enabled builds. Fixes: 212ed75dc5fb ("netfilter: nf_tables: integrate pipapo into commit protocol") Signed-off-by: Florian Westphal <fw@strlen.de>
|
#
f6c383b8 |
|
09-Aug-2023 |
Pablo Neira Ayuso <pablo@netfilter.org> |
netfilter: nf_tables: adapt set backend to use GC transaction API Use the GC transaction API to replace the old and buggy gc API and the busy mark approach. No set elements are removed from async garbage collection anymore, instead the _DEAD bit is set on so the set element is not visible from lookup path anymore. Async GC enqueues transaction work that might be aborted and retried later. rbtree and pipapo set backends does not set on the _DEAD bit from the sync GC path since this runs in control plane path where mutex is held. In this case, set elements are deactivated, removed and then released via RCU callback, sync GC never fails. Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Fixes: 8d8540c4f5e0 ("netfilter: nft_set_rbtree: add timeout support") Fixes: 9d0982927e79 ("netfilter: nft_hash: add support for timeouts") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
24138933 |
|
09-Aug-2023 |
Florian Westphal <fw@strlen.de> |
netfilter: nf_tables: don't skip expired elements during walk There is an asymmetry between commit/abort and preparation phase if the following conditions are met: 1. set is a verdict map ("1.2.3.4 : jump foo") 2. timeouts are enabled In this case, following sequence is problematic: 1. element E in set S refers to chain C 2. userspace requests removal of set S 3. kernel does a set walk to decrement chain->use count for all elements from preparation phase 4. kernel does another set walk to remove elements from the commit phase (or another walk to do a chain->use increment for all elements from abort phase) If E has already expired in 1), it will be ignored during list walk, so its use count won't have been changed. Then, when set is culled, ->destroy callback will zap the element via nf_tables_set_elem_destroy(), but this function is only safe for elements that have been deactivated earlier from the preparation phase: lack of earlier deactivate removes the element but leaks the chain use count, which results in a WARN splat when the chain gets removed later, plus a leak of the nft_chain structure. Update pipapo_get() not to skip expired elements, otherwise flush command reports bogus ENOENT errors. Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Fixes: 8d8540c4f5e0 ("netfilter: nft_set_rbtree: add timeout support") Fixes: 9d0982927e79 ("netfilter: nft_hash: add support for timeouts") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
87b5a5c2 |
|
19-Jul-2023 |
Florian Westphal <fw@strlen.de> |
netfilter: nft_set_pipapo: fix improper element removal end key should be equal to start unless NFT_SET_EXT_KEY_END is present. Its possible to add elements that only have a start key ("{ 1.0.0.0 . 2.0.0.0 }") without an internval end. Insertion treats this via: if (nft_set_ext_exists(ext, NFT_SET_EXT_KEY_END)) end = (const u8 *)nft_set_ext_key_end(ext)->data; else end = start; but removal side always uses nft_set_ext_key_end(). This is wrong and leads to garbage remaining in the set after removal next lookup/insert attempt will give: BUG: KASAN: slab-use-after-free in pipapo_get+0x8eb/0xb90 Read of size 1 at addr ffff888100d50586 by task nft-pipapo_uaf_/1399 Call Trace: kasan_report+0x105/0x140 pipapo_get+0x8eb/0xb90 nft_pipapo_insert+0x1dc/0x1710 nf_tables_newsetelem+0x31f5/0x4e00 .. Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Reported-by: lonial con <kongln9170@gmail.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>
|
#
2b84e215 |
|
16-Jun-2023 |
Pablo Neira Ayuso <pablo@netfilter.org> |
netfilter: nft_set_pipapo: .walk does not deal with generations The .walk callback iterates over the current active set, but it might be useful to iterate over the next generation set. Use the generation mask to determine what set view (either current or next generation) is use for the walk iteration. Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
628bd3e4 |
|
16-Jun-2023 |
Pablo Neira Ayuso <pablo@netfilter.org> |
netfilter: nf_tables: drop map element references from preparation phase set .destroy callback releases the references to other objects in maps. This is very late and it results in spurious EBUSY errors. Drop refcount from the preparation phase instead, update set backend not to drop reference counter from set .destroy path. Exceptions: NFT_TRANS_PREPARE_ERROR does not require to drop the reference counter because the transaction abort path releases the map references for each element since the set is unbound. The abort path also deals with releasing reference counter for new elements added to unbound sets. Fixes: 591054469b3e ("netfilter: nf_tables: revisit chain/object refcounting from elements") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
212ed75d |
|
07-Jun-2023 |
Pablo Neira Ayuso <pablo@netfilter.org> |
netfilter: nf_tables: integrate pipapo into commit protocol The pipapo set backend follows copy-on-update approach, maintaining one clone of the existing datastructure that is being updated. The clone and current datastructures are swapped via rcu from the commit step. The existing integration with the commit protocol is flawed because there is no operation to clean up the clone if the transaction is aborted. Moreover, the datastructure swap happens on set element activation. This patch adds two new operations for sets: commit and abort, these new operations are invoked from the commit and abort steps, after the transactions have been digested, and it updates the pipapo set backend to use it. This patch adds a new ->pending_update field to sets to maintain a list of sets that require this new commit and abort operations. Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
a2a0ffb0 |
|
05-May-2023 |
Christophe JAILLET <christophe.jaillet@wanadoo.fr> |
netfilter: nft_set_pipapo: Use struct_size() Use struct_size() instead of hand writing it. This is less verbose and more informative. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Florian Westphal <fw@strlen.de>
|
#
97d4d394 |
|
24-Nov-2022 |
Stefano Brivio <sbrivio@redhat.com> |
netfilter: nft_set_pipapo: Actually validate intervals in fields after the first one Embarrassingly, nft_pipapo_insert() checked for interval validity in the first field only. The start_p and end_p pointers were reset to key data from the first field at every iteration of the loop which was supposed to go over the set fields. Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Reported-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
9827a0e6 |
|
01-Jul-2022 |
Pablo Neira Ayuso <pablo@netfilter.org> |
netfilter: nft_set_pipapo: release elements in clone from abort path New elements that reside in the clone are not released in case that the transaction is aborted. [16302.231754] ------------[ cut here ]------------ [16302.231756] WARNING: CPU: 0 PID: 100509 at net/netfilter/nf_tables_api.c:1864 nf_tables_chain_destroy+0x26/0x127 [nf_tables] [...] [16302.231882] CPU: 0 PID: 100509 Comm: nft Tainted: G W 5.19.0-rc3+ #155 [...] [16302.231887] RIP: 0010:nf_tables_chain_destroy+0x26/0x127 [nf_tables] [16302.231899] Code: f3 fe ff ff 41 55 41 54 55 53 48 8b 6f 10 48 89 fb 48 c7 c7 82 96 d9 a0 8b 55 50 48 8b 75 58 e8 de f5 92 e0 83 7d 50 00 74 09 <0f> 0b 5b 5d 41 5c 41 5d c3 4c 8b 65 00 48 8b 7d 08 49 39 fc 74 05 [...] [16302.231917] Call Trace: [16302.231919] <TASK> [16302.231921] __nf_tables_abort.cold+0x23/0x28 [nf_tables] [16302.231934] nf_tables_abort+0x30/0x50 [nf_tables] [16302.231946] nfnetlink_rcv_batch+0x41a/0x840 [nfnetlink] [16302.231952] ? __nla_validate_parse+0x48/0x190 [16302.231959] nfnetlink_rcv+0x110/0x129 [nfnetlink] [16302.231963] netlink_unicast+0x211/0x340 [16302.231969] netlink_sendmsg+0x21e/0x460 Add nft_set_pipapo_match_destroy() helper function to release the elements in the lookup tables. Stefano Brivio says: "We additionally look for elements pointers in the cloned matching data if priv->dirty is set, because that means that cloned data might point to additional elements we did not commit to the working copy yet (such as the abort path case, but perhaps not limited to it)." Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
23c54263 |
|
05-Jan-2022 |
Florian Westphal <fw@strlen.de> |
netfilter: nft_set_pipapo: allocate pcpu scratch maps on clone This is needed in case a new transaction is made that doesn't insert any new elements into an already existing set. Else, after second 'nft -f ruleset.txt', lookups in such a set will fail because ->lookup() encounters raw_cpu_ptr(m->scratch) == NULL. For the initial rule load, insertion of elements takes care of the allocation, but for rule reloads this isn't guaranteed: we might not have additions to the set. Fixes: 3c4287f62044a90e ("nf_tables: Add set type for arbitrary concatenation of ranges") Reported-by: etkaar <lists.netfilter.org@prvy.eu> Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
f0b3d338 |
|
09-May-2021 |
Stefano Brivio <sbrivio@redhat.com> |
netfilter: nft_set_pipapo_avx2: Add irq_fpu_usable() check, fallback to non-AVX2 version Arturo reported this backtrace: [709732.358791] WARNING: CPU: 3 PID: 456 at arch/x86/kernel/fpu/core.c:128 kernel_fpu_begin_mask+0xae/0xe0 [709732.358793] Modules linked in: binfmt_misc nft_nat nft_chain_nat nf_nat nft_counter nft_ct nf_tables nf_conntrack_netlink nfnetlink 8021q garp stp mrp llc vrf intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp crc32_pclmul mgag200 ghash_clmulni_intel drm_kms_helper cec aesni_intel drm libaes crypto_simd cryptd glue_helper mei_me dell_smbios iTCO_wdt evdev intel_pmc_bxt iTCO_vendor_support dcdbas pcspkr rapl dell_wmi_descriptor wmi_bmof sg i2c_algo_bit watchdog mei acpi_ipmi ipmi_si button nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipmi_devintf ipmi_msghandler ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 dm_mod raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor sd_mod t10_pi crc_t10dif crct10dif_generic raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ahci libahci tg3 libata xhci_pci libphy xhci_hcd ptp usbcore crct10dif_pclmul crct10dif_common bnxt_en crc32c_intel scsi_mod [709732.358941] pps_core i2c_i801 lpc_ich i2c_smbus wmi usb_common [709732.358957] CPU: 3 PID: 456 Comm: jbd2/dm-0-8 Not tainted 5.10.0-0.bpo.5-amd64 #1 Debian 5.10.24-1~bpo10+1 [709732.358959] Hardware name: Dell Inc. PowerEdge R440/04JN2K, BIOS 2.9.3 09/23/2020 [709732.358964] RIP: 0010:kernel_fpu_begin_mask+0xae/0xe0 [709732.358969] Code: ae 54 24 04 83 e3 01 75 38 48 8b 44 24 08 65 48 33 04 25 28 00 00 00 75 33 48 83 c4 10 5b c3 65 8a 05 5e 21 5e 76 84 c0 74 92 <0f> 0b eb 8e f0 80 4f 01 40 48 81 c7 00 14 00 00 e8 dd fb ff ff eb [709732.358972] RSP: 0018:ffffbb9700304740 EFLAGS: 00010202 [709732.358976] RAX: 0000000000000001 RBX: 0000000000000003 RCX: 0000000000000001 [709732.358979] RDX: ffffbb9700304970 RSI: ffff922fe1952e00 RDI: 0000000000000003 [709732.358981] RBP: ffffbb9700304970 R08: ffff922fc868a600 R09: ffff922fc711e462 [709732.358984] R10: 000000000000005f R11: ffff922ff0b27180 R12: ffffbb9700304960 [709732.358987] R13: ffffbb9700304b08 R14: ffff922fc664b6c8 R15: ffff922fc664b660 [709732.358990] FS: 0000000000000000(0000) GS:ffff92371fec0000(0000) knlGS:0000000000000000 [709732.358993] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [709732.358996] CR2: 0000557a6655bdd0 CR3: 000000026020a001 CR4: 00000000007706e0 [709732.358999] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [709732.359001] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [709732.359003] PKRU: 55555554 [709732.359005] Call Trace: [709732.359009] <IRQ> [709732.359035] nft_pipapo_avx2_lookup+0x4c/0x1cba [nf_tables] [709732.359046] ? sched_clock+0x5/0x10 [709732.359054] ? sched_clock_cpu+0xc/0xb0 [709732.359061] ? record_times+0x16/0x80 [709732.359068] ? plist_add+0xc1/0x100 [709732.359073] ? psi_group_change+0x47/0x230 [709732.359079] ? skb_clone+0x4d/0xb0 [709732.359085] ? enqueue_task_rt+0x22b/0x310 [709732.359098] ? bnxt_start_xmit+0x1e8/0xaf0 [bnxt_en] [709732.359102] ? packet_rcv+0x40/0x4a0 [709732.359121] nft_lookup_eval+0x59/0x160 [nf_tables] [709732.359133] nft_do_chain+0x350/0x500 [nf_tables] [709732.359152] ? nft_lookup_eval+0x59/0x160 [nf_tables] [709732.359163] ? nft_do_chain+0x364/0x500 [nf_tables] [709732.359172] ? fib4_rule_action+0x6d/0x80 [709732.359178] ? fib_rules_lookup+0x107/0x250 [709732.359184] nft_nat_do_chain+0x8a/0xf2 [nft_chain_nat] [709732.359193] nf_nat_inet_fn+0xea/0x210 [nf_nat] [709732.359202] nf_nat_ipv4_out+0x14/0xa0 [nf_nat] [709732.359207] nf_hook_slow+0x44/0xc0 [709732.359214] ip_output+0xd2/0x100 [709732.359221] ? __ip_finish_output+0x210/0x210 [709732.359226] ip_forward+0x37d/0x4a0 [709732.359232] ? ip4_key_hashfn+0xb0/0xb0 [709732.359238] ip_sublist_rcv_finish+0x4f/0x60 [709732.359243] ip_sublist_rcv+0x196/0x220 [709732.359250] ? ip_rcv_finish_core.isra.22+0x400/0x400 [709732.359255] ip_list_rcv+0x137/0x160 [709732.359264] __netif_receive_skb_list_core+0x29b/0x2c0 [709732.359272] netif_receive_skb_list_internal+0x1a6/0x2d0 [709732.359280] gro_normal_list.part.156+0x19/0x40 [709732.359286] napi_complete_done+0x67/0x170 [709732.359298] bnxt_poll+0x105/0x190 [bnxt_en] [709732.359304] ? irqentry_exit+0x29/0x30 [709732.359309] ? asm_common_interrupt+0x1e/0x40 [709732.359315] net_rx_action+0x144/0x3c0 [709732.359322] __do_softirq+0xd5/0x29c [709732.359329] asm_call_irq_on_stack+0xf/0x20 [709732.359332] </IRQ> [709732.359339] do_softirq_own_stack+0x37/0x40 [709732.359346] irq_exit_rcu+0x9d/0xa0 [709732.359353] common_interrupt+0x78/0x130 [709732.359358] asm_common_interrupt+0x1e/0x40 [709732.359366] RIP: 0010:crc_41+0x0/0x1e [crc32c_intel] [709732.359370] Code: ff ff f2 4d 0f 38 f1 93 a8 fe ff ff f2 4c 0f 38 f1 81 b0 fe ff ff f2 4c 0f 38 f1 8a b0 fe ff ff f2 4d 0f 38 f1 93 b0 fe ff ff <f2> 4c 0f 38 f1 81 b8 fe ff ff f2 4c 0f 38 f1 8a b8 fe ff ff f2 4d [709732.359373] RSP: 0018:ffffbb97008dfcd0 EFLAGS: 00000246 [709732.359377] RAX: 000000000000002a RBX: 0000000000000400 RCX: ffff922fc591dd50 [709732.359379] RDX: ffff922fc591dea0 RSI: 0000000000000a14 RDI: ffffffffc00dddc0 [709732.359382] RBP: 0000000000001000 R08: 000000000342d8c3 R09: 0000000000000000 [709732.359384] R10: 0000000000000000 R11: ffff922fc591dff0 R12: ffffbb97008dfe58 [709732.359386] R13: 000000000000000a R14: ffff922fd2b91e80 R15: ffff922fef83fe38 [709732.359395] ? crc_43+0x1e/0x1e [crc32c_intel] [709732.359403] ? crc32c_pcl_intel_update+0x97/0xb0 [crc32c_intel] [709732.359419] ? jbd2_journal_commit_transaction+0xaec/0x1a30 [jbd2] [709732.359425] ? irq_exit_rcu+0x3e/0xa0 [709732.359447] ? kjournald2+0xbd/0x270 [jbd2] [709732.359454] ? finish_wait+0x80/0x80 [709732.359470] ? commit_timeout+0x10/0x10 [jbd2] [709732.359476] ? kthread+0x116/0x130 [709732.359481] ? kthread_park+0x80/0x80 [709732.359488] ? ret_from_fork+0x1f/0x30 [709732.359494] ---[ end trace 081a19978e5f09f5 ]--- that is, nft_pipapo_avx2_lookup() uses the FPU running from a softirq that interrupted a kthread, also using the FPU. That's exactly the reason why irq_fpu_usable() is there: use it, and if we can't use the FPU, fall back to the non-AVX2 version of the lookup operation, i.e. nft_pipapo_lookup(). Reported-by: Arturo Borrero Gonzalez <arturo@netfilter.org> Cc: <stable@vger.kernel.org> # 5.6.x Fixes: 7400b063969b ("nft_set_pipapo: Introduce AVX2-based lookup implementation") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
aaa31047 |
|
27-Apr-2021 |
Pablo Neira Ayuso <pablo@netfilter.org> |
netfilter: nftables: add catch-all set element support This patch extends the set infrastructure to add a special catch-all set element. If the lookup fails to find an element (or range) in the set, then the catch-all element is selected. Users can specify a mapping, expression(s) and timeout to be attached to the catch-all element. This patch adds a catchall list to the set, this list might contain more than one single catch-all element (e.g. in case that the catch-all element is removed and a new one is added in the same transaction). However, most of the time, there will be either one element or no elements at all in this list. The catch-all element is identified via NFT_SET_ELEM_CATCHALL flag and such special element has no NFTA_SET_ELEM_KEY attribute. There is a new nft_set_elem_catchall object that stores a reference to the dummy catch-all element (catchall->elem) whose layout is the same of the set element type to reuse the existing set element codebase. The set size does not apply to the catch-all element, users can define a catch-all element even if the set is full. The check for valid set element flags hava been updates to report EOPNOTSUPP in case userspace requests flags that are not supported when using new userspace nftables and old kernel. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
50935339 |
|
25-Jul-2020 |
Alexander A. Klimov <grandmaster@al2klimov.de> |
netfilter: Replace HTTP links with HTTPS ones Rationale: Reduces attack surface on kernel devs opening the links for MITM as HTTPS traffic is much harder to manipulate. Deterministic algorithm: For each file: If not .svg: For each line: If doesn't contain `\bxmlns\b`: For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`: If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`: If both the HTTP and HTTPS versions return 200 OK and serve the same content: Replace HTTP with HTTPS. Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
3db86c39 |
|
12-Jul-2020 |
Andrew Lunn <andrew@lunn.ch> |
net: netfilter: kerneldoc fixes Simple fixes which require no deep knowledge of the code. Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@netfilter.org> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d61d2e90 |
|
14-Jun-2020 |
Stefano Brivio <sbrivio@redhat.com> |
netfilter: nft_set_pipapo: Drop useless assignment of scratch map index on insert In nft_pipapo_insert(), we need to reallocate scratch maps that will be used for matching by lookup functions, if they have never been allocated or if the bucket size changes as a result of the insertion. As pipapo_realloc_scratch() provides a pair of fresh, zeroed out maps, there's no need to select a particular one after reallocation. Other than being useless, the existing assignment was also troubled by the fact that the index was set only on the CPU performing the actual insertion, as spotted by Florian. Simply drop the assignment. Reported-by: Florian Westphal <fw@strlen.de> Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
c3829285 |
|
08-Jun-2020 |
Stefano Brivio <sbrivio@redhat.com> |
netfilter: nft_set_pipapo: Disable preemption before getting per-CPU pointer The lkp kernel test robot reports, with CONFIG_DEBUG_PREEMPT enabled: [ 165.316525] BUG: using smp_processor_id() in preemptible [00000000] code: nft/6247 [ 165.319547] caller is nft_pipapo_insert+0x464/0x610 [nf_tables] [ 165.321846] CPU: 1 PID: 6247 Comm: nft Not tainted 5.6.0-rc5-01595-ge32a4dc6512ce3 #1 [ 165.332128] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [ 165.334892] Call Trace: [ 165.336435] dump_stack+0x8f/0xcb [ 165.338128] debug_smp_processor_id+0xb2/0xc0 [ 165.340117] nft_pipapo_insert+0x464/0x610 [nf_tables] [ 165.342290] ? nft_trans_alloc_gfp+0x1c/0x60 [nf_tables] [ 165.344420] ? rcu_read_lock_sched_held+0x52/0x80 [ 165.346460] ? nft_trans_alloc_gfp+0x1c/0x60 [nf_tables] [ 165.348543] ? __mmu_interval_notifier_insert+0xa0/0xf0 [ 165.350629] nft_add_set_elem+0x5ff/0xa90 [nf_tables] [ 165.352699] ? __lock_acquire+0x241/0x1400 [ 165.354573] ? __lock_acquire+0x241/0x1400 [ 165.356399] ? reacquire_held_locks+0x12f/0x200 [ 165.358384] ? nf_tables_valid_genid+0x1f/0x40 [nf_tables] [ 165.360502] ? nla_strcmp+0x10/0x50 [ 165.362199] ? nft_table_lookup+0x4f/0xa0 [nf_tables] [ 165.364217] ? nla_strcmp+0x10/0x50 [ 165.365891] ? nf_tables_newsetelem+0xd5/0x150 [nf_tables] [ 165.367997] nf_tables_newsetelem+0xd5/0x150 [nf_tables] [ 165.370083] nfnetlink_rcv_batch+0x4fd/0x790 [nfnetlink] [ 165.372205] ? __lock_acquire+0x241/0x1400 [ 165.374058] ? __nla_validate_parse+0x57/0x8a0 [ 165.375989] ? cap_inode_getsecurity+0x230/0x230 [ 165.377954] ? security_capable+0x38/0x50 [ 165.379795] nfnetlink_rcv+0x11d/0x140 [nfnetlink] [ 165.381779] netlink_unicast+0x1b2/0x280 [ 165.383612] netlink_sendmsg+0x351/0x470 [ 165.385439] sock_sendmsg+0x5b/0x60 [ 165.387133] ____sys_sendmsg+0x200/0x280 [ 165.388871] ? copy_msghdr_from_user+0xd9/0x160 [ 165.390805] ___sys_sendmsg+0x88/0xd0 [ 165.392524] ? __might_fault+0x3e/0x90 [ 165.394273] ? sock_getsockopt+0x3d5/0xbb0 [ 165.396021] ? __handle_mm_fault+0x545/0x6a0 [ 165.397822] ? find_held_lock+0x2d/0x90 [ 165.399593] ? __sys_sendmsg+0x5e/0xa0 [ 165.401338] __sys_sendmsg+0x5e/0xa0 [ 165.402979] do_syscall_64+0x60/0x280 [ 165.404680] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 165.406621] RIP: 0033:0x7ff1fa46e783 [ 165.408299] Code: c7 c0 ff ff ff ff eb bb 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 89 54 24 1c 48 [ 165.414163] RSP: 002b:00007ffedf59ea78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 165.416804] RAX: ffffffffffffffda RBX: 00007ffedf59fc60 RCX: 00007ff1fa46e783 [ 165.419419] RDX: 0000000000000000 RSI: 00007ffedf59fb10 RDI: 0000000000000005 [ 165.421886] RBP: 00007ffedf59fc10 R08: 00007ffedf59ea54 R09: 0000000000000001 [ 165.424445] R10: 00007ff1fa630c6c R11: 0000000000000246 R12: 0000000000020000 [ 165.426954] R13: 0000000000000280 R14: 0000000000000005 R15: 00007ffedf59ea90 Disable preemption before accessing the lookup scratch area in nft_pipapo_insert(). Reported-by: kernel test robot <lkp@intel.com> Analysed-by: Florian Westphal <fw@strlen.de> Cc: <stable@vger.kernel.org> # 5.6.x Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
e6abef61 |
|
26-Mar-2020 |
Jason A. Donenfeld <Jason@zx2c4.com> |
x86: update AS_* macros to binutils >=2.23, supporting ADX and AVX2 Now that the kernel specifies binutils 2.23 as the minimum version, we can remove ifdefs for AVX2 and ADX throughout. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|
#
0eb4b5ee |
|
21-Mar-2020 |
Stefano Brivio <sbrivio@redhat.com> |
netfilter: nft_set_pipapo: Separate partial and complete overlap cases on insertion ...and return -ENOTEMPTY to the front-end on collision, -EEXIST if an identical element already exists. Together with the previous patch, element collision will now be returned to the user as -EEXIST. Reported-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
eb16933a |
|
07-Mar-2020 |
Stefano Brivio <sbrivio@redhat.com> |
nft_set_pipapo: Prepare for single ranged field usage A few adjustments in nft_pipapo_init() are needed to allow usage of this set back-end for a single, ranged field. Provide a convenient NFT_PIPAPO_MIN_FIELDS definition that currently makes sure that the rbtree back-end is selected instead, for sets with a single field. This finally allows a fair comparison with rbtree sets, by defining NFT_PIPAPO_MIN_FIELDS as 0 and skipping rbtree back-end initialisation: ---------------.--------------------------.-------------------------. AMD Epyc 7402 | baselines, Mpps | Mpps, % over rbtree | 1 thread |__________________________|_________________________| 3.35GHz | | | | | | 768KiB L1D$ | netdev | hash | rbtree | | pipapo | ---------------| hook | no | single | pipapo |single field| type entries | drop | ranges | field |single field| AVX2 | ---------------|--------|--------|--------|------------|------------| net,port | | | | | | 1000 | 19.0 | 10.4 | 3.8 | 6.0 +58% | 9.6 +153% | ---------------|--------|--------|--------|------------|------------| port,net | | | | | | 100 | 18.8 | 10.3 | 5.8 | 9.1 +57% |11.6 +100% | ---------------|--------|--------|--------|------------|------------| net6,port | | | | | | 1000 | 16.4 | 7.6 | 1.8 | 2.8 +55% | 6.5 +261% | ---------------|--------|--------|--------|------------|------------| port,proto | | | | [1] | [1] | 30000 | 19.6 | 11.6 | 3.9 | 0.9 -77% | 2.7 -31% | ---------------|--------|--------|--------|------------|------------| port,proto | | | | | | 10000 | 19.6 | 11.6 | 4.4 | 2.1 -52% | 5.6 +27% | ---------------|--------|--------|--------|------------|------------| port,proto | | | | | | 4 threads 10000| 77.9 | 45.1 | 17.4 | 8.3 -52% |22.4 +29% | ---------------|--------|--------|--------|------------|------------| net6,port,mac | | | | | | 10 | 16.5 | 5.4 | 4.3 | 4.5 +5% | 8.2 +91% | ---------------|--------|--------|--------|------------|------------| net6,port,mac, | | | | | | proto 1000 | 16.5 | 5.7 | 1.9 | 2.8 +47% | 6.6 +247% | ---------------|--------|--------|--------|------------|------------| net,mac | | | | | | 1000 | 19.0 | 8.4 | 3.9 | 6.0 +54% | 9.9 +154% | ---------------'--------'--------'--------'------------'------------' [1] Causes switch of lookup table buckets for 'port' to 4-bit groups Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
7400b063 |
|
07-Mar-2020 |
Stefano Brivio <sbrivio@redhat.com> |
nft_set_pipapo: Introduce AVX2-based lookup implementation If the AVX2 set is available, we can exploit the repetitive characteristic of this algorithm to provide a fast, vectorised version by using 256-bit wide AVX2 operations for bucket loads and bitwise intersections. In most cases, this implementation consistently outperforms rbtree set instances despite the fact they are configured to use a given, single, ranged data type out of the ones used for performance measurements by the nft_concat_range.sh kselftest. That script, injecting packets directly on the ingoing device path with pktgen, reports, averaged over five runs on a single AMD Epyc 7402 thread (3.35GHz, 768 KiB L1D$, 12 MiB L2$), the figures below. CONFIG_RETPOLINE was not set here. Note that this is not a fair comparison over hash and rbtree set types: non-ranged entries (used to have a reference for hash types) would be matched faster than this, and matching on a single field only (which is the case for rbtree) is also significantly faster. However, it's not possible at the moment to choose this set type for non-ranged entries, and the current implementation also needs a few minor adjustments in order to match on less than two fields. ---------------.-----------------------------------.------------. AMD Epyc 7402 | baselines, Mpps | this patch | 1 thread |___________________________________|____________| 3.35GHz | | | | | | 768KiB L1D$ | netdev | hash | rbtree | | | ---------------| hook | no | single | | pipapo | type entries | drop | ranges | field | pipapo | AVX2 | ---------------|--------|--------|--------|--------|------------| net,port | | | | | | 1000 | 19.0 | 10.4 | 3.8 | 4.0 | 7.5 +87% | ---------------|--------|--------|--------|--------|------------| port,net | | | | | | 100 | 18.8 | 10.3 | 5.8 | 6.3 | 8.1 +29% | ---------------|--------|--------|--------|--------|------------| net6,port | | | | | | 1000 | 16.4 | 7.6 | 1.8 | 2.1 | 4.8 +128% | ---------------|--------|--------|--------|--------|------------| port,proto | | | | | | 30000 | 19.6 | 11.6 | 3.9 | 0.5 | 2.6 +420% | ---------------|--------|--------|--------|--------|------------| net6,port,mac | | | | | | 10 | 16.5 | 5.4 | 4.3 | 3.4 | 4.7 +38% | ---------------|--------|--------|--------|--------|------------| net6,port,mac, | | | | | | proto 1000 | 16.5 | 5.7 | 1.9 | 1.4 | 3.6 +26% | ---------------|--------|--------|--------|--------|------------| net,mac | | | | | | 1000 | 19.0 | 8.4 | 3.9 | 2.5 | 6.4 +156% | ---------------'--------'--------'--------'--------'------------' A similar strategy could be easily reused to implement specialised versions for other SIMD sets, and I plan to post at least a NEON version at a later time. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
8683f4b9 |
|
07-Mar-2020 |
Stefano Brivio <sbrivio@redhat.com> |
nft_set_pipapo: Prepare for vectorised implementation: helpers Move most macros and helpers to a header file, so that they can be conveniently used by related implementations. No functional changes are intended here. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
bf3e5839 |
|
07-Mar-2020 |
Stefano Brivio <sbrivio@redhat.com> |
nft_set_pipapo: Prepare for vectorised implementation: alignment SIMD vector extension sets require stricter alignment than native instruction sets to operate efficiently (AVX, NEON) or for some instructions to work at all (AltiVec). Provide facilities to define arbitrary alignment for lookup tables and scratch maps. By defining byte alignment with NFT_PIPAPO_ALIGN, lt_aligned and scratch_aligned pointers become available. Additional headroom is allocated, and pointers to the possibly unaligned, originally allocated areas are kept so that they can be freed. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
4051f431 |
|
07-Mar-2020 |
Stefano Brivio <sbrivio@redhat.com> |
nft_set_pipapo: Add support for 8-bit lookup groups and dynamic switch While grouping matching bits in groups of four saves memory compared to the more natural choice of 8-bit words (lookup table size is one eighth), it comes at a performance cost, as the number of lookup comparisons is doubled, and those also needs bitshifts and masking. Introduce support for 8-bit lookup groups, together with a mapping mechanism to dynamically switch, based on defined per-table size thresholds and hysteresis, between 8-bit and 4-bit groups, as tables grow and shrink. Empty sets start with 8-bit groups, and per-field tables are converted to 4-bit groups if they get too big. An alternative approach would have been to swap per-set lookup operation functions as needed, but this doesn't allow for different group sizes in the same set, which looks desirable if some fields need significantly more matching data compared to others due to heavier impact of ranges (e.g. a big number of subnets with relatively simple port specifications). Allowing different group sizes for the same lookup functions implies the need for further conditional clauses, whose cost, however, appears to be negligible in tests. The matching rate figures below were obtained for x86_64 running the nft_concat_range.sh "performance" cases, averaged over five runs, on a single thread of an AMD Epyc 7402 CPU, and for aarch64 on a single thread of a BCM2711 (Raspberry Pi 4 Model B 4GB), clocked at a stable 2147MHz frequency: ---------------.-----------------------------------.------------. AMD Epyc 7402 | baselines, Mpps | this patch | 1 thread |___________________________________|____________| 3.35GHz | | | | | | 768KiB L1D$ | netdev | hash | rbtree | | | ---------------| hook | no | single | pipapo | pipapo | type entries | drop | ranges | field | 4 bits | bit switch | ---------------|--------|--------|--------|--------|------------| net,port | | | | | | 1000 | 19.0 | 10.4 | 3.8 | 2.8 | 4.0 +43% | ---------------|--------|--------|--------|--------|------------| port,net | | | | | | 100 | 18.8 | 10.3 | 5.8 | 5.5 | 6.3 +14% | ---------------|--------|--------|--------|--------|------------| net6,port | | | | | | 1000 | 16.4 | 7.6 | 1.8 | 1.3 | 2.1 +61% | ---------------|--------|--------|--------|--------|------------| port,proto | | | | | [1] | 30000 | 19.6 | 11.6 | 3.9 | 0.3 | 0.5 +66% | ---------------|--------|--------|--------|--------|------------| net6,port,mac | | | | | | 10 | 16.5 | 5.4 | 4.3 | 2.6 | 3.4 +31% | ---------------|--------|--------|--------|--------|------------| net6,port,mac, | | | | | | proto 1000 | 16.5 | 5.7 | 1.9 | 1.0 | 1.4 +40% | ---------------|--------|--------|--------|--------|------------| net,mac | | | | | | 1000 | 19.0 | 8.4 | 3.9 | 1.7 | 2.5 +47% | ---------------'--------'--------'--------'--------'------------' [1] Causes switch of lookup table buckets for 'port', not 'proto', to 4-bit groups ---------------.-----------------------------------.------------. BCM2711 | baselines, Mpps | this patch | 1 thread |___________________________________|____________| 2147MHz | | | | | | 32KiB L1D$ | netdev | hash | rbtree | | | ---------------| hook | no | single | pipapo | pipapo | type entries | drop | ranges | field | 4 bits | bit switch | ---------------|--------|--------|--------|--------|------------| net,port | | | | | | 1000 | 1.63 | 1.37 | 0.87 | 0.61 | 0.70 +17% | ---------------|--------|--------|--------|--------|------------| port,net | | | | | | 100 | 1.64 | 1.36 | 1.02 | 0.78 | 0.81 +4% | ---------------|--------|--------|--------|--------|------------| net6,port | | | | | | 1000 | 1.56 | 1.27 | 0.65 | 0.34 | 0.50 +47% | ---------------|--------|--------|--------|--------|------------| port,proto [2] | | | | | | 10000 | 1.68 | 1.43 | 0.84 | 0.30 | 0.40 +13% | ---------------|--------|--------|--------|--------|------------| net6,port,mac | | | | | | 10 | 1.56 | 1.14 | 1.02 | 0.62 | 0.66 +6% | ---------------|--------|--------|--------|--------|------------| net6,port,mac, | | | | | | proto 1000 | 1.56 | 1.12 | 0.64 | 0.27 | 0.40 +48% | ---------------|--------|--------|--------|--------|------------| net,mac | | | | | | 1000 | 1.63 | 1.26 | 0.87 | 0.41 | 0.53 +29% | ---------------'--------'--------'--------'--------'------------' [2] Using 10000 entries instead of 30000 as it would take way too long for the test script to generate all of them Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
e807b13c |
|
07-Mar-2020 |
Stefano Brivio <sbrivio@redhat.com> |
nft_set_pipapo: Generalise group size for buckets Get rid of all hardcoded assumptions that buckets in lookup tables correspond to four-bit groups, and replace them with appropriate calculations based on a variable group size, now stored in struct field. The group size could now be in principle any divisor of eight. Note, though, that lookup and get functions need an implementation intimately depending on the group size, and the only supported size there, currently, is four bits, which is also the initial and only used size at the moment. While at it, drop 'groups' from struct nft_pipapo: it was never used. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
6daf1414 |
|
20-Feb-2020 |
Gustavo A. R. Silva <gustavo@embeddedor.com> |
netfilter: Replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] Lastly, fix checkpatch.pl warning WARNING: __aligned(size) is preferred over __attribute__((aligned(size))) in net/bridge/netfilter/ebtables.c This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
eb9d7af3 |
|
10-Feb-2020 |
Chen Wandun <chenwandun@huawei.com> |
netfilter: nft_set_pipapo: make the symbol 'nft_pipapo_get' static Fix the following sparse warning: net/netfilter/nft_set_pipapo.c:739:6: warning: symbol 'nft_pipapo_get' was not declared. Should it be static? Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Signed-off-by: Chen Wandun <chenwandun@huawei.com> Acked-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
24d19826 |
|
18-Feb-2020 |
Florian Westphal <fw@strlen.de> |
netfilter: nf_tables: make all set structs const They do not need to be writeable anymore. v2: remove left-over __read_mostly annotation in set_pipapo.c (Stefano) Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
212d58c1 |
|
20-Feb-2020 |
Stefano Brivio <sbrivio@redhat.com> |
nft_set_pipapo: Actually fetch key data in nft_pipapo_remove() Phil reports that adding elements, flushing and re-adding them right away: nft add table t '{ set s { type ipv4_addr . inet_service; flags interval; }; }' nft add element t s '{ 10.0.0.1 . 22-25, 10.0.0.1 . 10-20 }' nft flush set t s nft add element t s '{ 10.0.0.1 . 10-20, 10.0.0.1 . 22-25 }' triggers, almost reliably, a crash like this one: [ 71.319848] general protection fault, probably for non-canonical address 0x6f6b6e696c2e756e: 0000 [#1] PREEMPT SMP PTI [ 71.321540] CPU: 3 PID: 1201 Comm: kworker/3:2 Not tainted 5.6.0-rc1-00377-g2bb07f4e1d861 #192 [ 71.322746] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190711_202441-buildvm-armv7-10.arm.fedoraproject.org-2.fc31 04/01/2014 [ 71.324430] Workqueue: events nf_tables_trans_destroy_work [nf_tables] [ 71.325387] RIP: 0010:nft_set_elem_destroy+0xa5/0x110 [nf_tables] [ 71.326164] Code: 89 d4 84 c0 74 0e 8b 77 44 0f b6 f8 48 01 df e8 41 ff ff ff 45 84 e4 74 36 44 0f b6 63 08 45 84 e4 74 2c 49 01 dc 49 8b 04 24 <48> 8b 40 38 48 85 c0 74 4f 48 89 e7 4c 8b [ 71.328423] RSP: 0018:ffffc9000226fd90 EFLAGS: 00010282 [ 71.329225] RAX: 6f6b6e696c2e756e RBX: ffff88813ab79f60 RCX: ffff88813931b5a0 [ 71.330365] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88813ab79f9a [ 71.331473] RBP: ffff88813ab79f60 R08: 0000000000000008 R09: 0000000000000000 [ 71.332627] R10: 000000000000021c R11: 0000000000000000 R12: ffff88813ab79fc2 [ 71.333615] R13: ffff88813b3adf50 R14: dead000000000100 R15: ffff88813931b8a0 [ 71.334596] FS: 0000000000000000(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000 [ 71.335780] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 71.336577] CR2: 000055ac683710f0 CR3: 000000013a222003 CR4: 0000000000360ee0 [ 71.337533] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 71.338557] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 71.339718] Call Trace: [ 71.340093] nft_pipapo_destroy+0x7a/0x170 [nf_tables_set] [ 71.340973] nft_set_destroy+0x20/0x50 [nf_tables] [ 71.341879] nf_tables_trans_destroy_work+0x246/0x260 [nf_tables] [ 71.342916] process_one_work+0x1d5/0x3c0 [ 71.343601] worker_thread+0x4a/0x3c0 [ 71.344229] kthread+0xfb/0x130 [ 71.344780] ? process_one_work+0x3c0/0x3c0 [ 71.345477] ? kthread_park+0x90/0x90 [ 71.346129] ret_from_fork+0x35/0x40 [ 71.346748] Modules linked in: nf_tables_set nf_tables nfnetlink 8021q [last unloaded: nfnetlink] [ 71.348153] ---[ end trace 2eaa8149ca759bcc ]--- [ 71.349066] RIP: 0010:nft_set_elem_destroy+0xa5/0x110 [nf_tables] [ 71.350016] Code: 89 d4 84 c0 74 0e 8b 77 44 0f b6 f8 48 01 df e8 41 ff ff ff 45 84 e4 74 36 44 0f b6 63 08 45 84 e4 74 2c 49 01 dc 49 8b 04 24 <48> 8b 40 38 48 85 c0 74 4f 48 89 e7 4c 8b [ 71.350017] RSP: 0018:ffffc9000226fd90 EFLAGS: 00010282 [ 71.350019] RAX: 6f6b6e696c2e756e RBX: ffff88813ab79f60 RCX: ffff88813931b5a0 [ 71.350019] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88813ab79f9a [ 71.350020] RBP: ffff88813ab79f60 R08: 0000000000000008 R09: 0000000000000000 [ 71.350021] R10: 000000000000021c R11: 0000000000000000 R12: ffff88813ab79fc2 [ 71.350022] R13: ffff88813b3adf50 R14: dead000000000100 R15: ffff88813931b8a0 [ 71.350025] FS: 0000000000000000(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000 [ 71.350026] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 71.350027] CR2: 000055ac683710f0 CR3: 000000013a222003 CR4: 0000000000360ee0 [ 71.350028] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 71.350028] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 71.350030] Kernel panic - not syncing: Fatal exception [ 71.350412] Kernel Offset: disabled [ 71.365922] ---[ end Kernel panic - not syncing: Fatal exception ]--- which is caused by dangling elements that have been deactivated, but never removed. On a flush operation, nft_pipapo_walk() walks through all the elements in the mapping table, which are then deactivated by nft_flush_set(), one by one, and added to the commit list for removal. Element data is then freed. On transaction commit, nft_pipapo_remove() is called, and failed to remove these elements, leading to the stale references in the mapping. The first symptom of this, revealed by KASan, is a one-byte use-after-free in subsequent calls to nft_pipapo_walk(), which is usually not enough to trigger a panic. When stale elements are used more heavily, though, such as double-free via nft_pipapo_destroy() as in Phil's case, the problem becomes more noticeable. The issue comes from that fact that, on a flush operation, nft_pipapo_remove() won't get the actual key data via elem->key, elements to be deleted upon commit won't be found by the lookup via pipapo_get(), and removal will be skipped. Key data should be fetched via nft_set_ext_key(), instead. Reported-by: Phil Sutter <phil@nwl.cc> Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
9a771204 |
|
14-Feb-2020 |
Stefano Brivio <sbrivio@redhat.com> |
netfilter: nft_set_pipapo: Don't abuse unlikely() in pipapo_refill() I originally used unlikely() in the if (match_only) clause, which we hit on the mapping table for the last field in a set, to ensure we avoid branching to the rest of for loop body, which is executed more frequently. However, Pablo reports, this is confusing as it gives the impression that this is not a common case, and it's actually not the intended usage of unlikely(). I couldn't observe any statistical difference in matching rates on x864_64 and aarch64 without it, so just drop it. Reported-by: Pablo Neira Ayuso <pablo@netfilter.org> Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
bd97ad51 |
|
14-Feb-2020 |
Stefano Brivio <sbrivio@redhat.com> |
netfilter: nft_set_pipapo: Fix mapping table example in comments In both insertion and lookup examples, the two element pointers of rule mapping tables were swapped. Fix that. Reported-by: Pablo Neira Ayuso <pablo@netfilter.org> Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
3c4287f6 |
|
21-Jan-2020 |
Stefano Brivio <sbrivio@redhat.com> |
nf_tables: Add set type for arbitrary concatenation of ranges This new set type allows for intervals in concatenated fields, which are expressed in the usual way, that is, simple byte concatenation with padding to 32 bits for single fields, and given as ranges by specifying start and end elements containing, each, the full concatenation of start and end values for the single fields. Ranges are expanded to composing netmasks, for each field: these are inserted as rules in per-field lookup tables. Bits to be classified are divided in 4-bit groups, and for each group, the lookup table contains 4^2 buckets, representing all the possible values of a bit group. This approach was inspired by the Grouper algorithm: http://www.cse.usf.edu/~ligatti/projects/grouper/ Matching is performed by a sequence of AND operations between bucket values, with buckets selected according to the value of packet bits, for each group. The result of this sequence tells us which rules matched for a given field. In order to concatenate several ranged fields, per-field rules are mapped using mapping arrays, one per field, that specify which rules should be considered while matching the next field. The mapping array for the last field contains a reference to the element originally inserted. The notes in nft_set_pipapo.c cover the algorithm in deeper detail. A pure hash-based approach is of no use here, as ranges need to be classified. An implementation based on "proxying" the existing red-black tree set type, creating a tree for each field, was considered, but deemed impractical due to the fact that elements would need to be shared between trees, at least as long as we want to keep UAPI changes to a minimum. A stand-alone implementation of this algorithm is available at: https://pipapo.lameexcu.se together with notes about possible future optimisations (in pipapo.c). This algorithm was designed with data locality in mind, and can be highly optimised for SIMD instruction sets, as the bulk of the matching work is done with repetitive, simple bitwise operations. At this point, without further optimisations, nft_concat_range.sh reports, for one AMD Epyc 7351 thread (2.9GHz, 512 KiB L1D$, 8 MiB L2$): TEST: performance net,port [ OK ] baseline (drop from netdev hook): 10190076pps baseline hash (non-ranged entries): 6179564pps baseline rbtree (match on first field only): 2950341pps set with 1000 full, ranged entries: 2304165pps port,net [ OK ] baseline (drop from netdev hook): 10143615pps baseline hash (non-ranged entries): 6135776pps baseline rbtree (match on first field only): 4311934pps set with 100 full, ranged entries: 4131471pps net6,port [ OK ] baseline (drop from netdev hook): 9730404pps baseline hash (non-ranged entries): 4809557pps baseline rbtree (match on first field only): 1501699pps set with 1000 full, ranged entries: 1092557pps port,proto [ OK ] baseline (drop from netdev hook): 10812426pps baseline hash (non-ranged entries): 6929353pps baseline rbtree (match on first field only): 3027105pps set with 30000 full, ranged entries: 284147pps net6,port,mac [ OK ] baseline (drop from netdev hook): 9660114pps baseline hash (non-ranged entries): 3778877pps baseline rbtree (match on first field only): 3179379pps set with 10 full, ranged entries: 2082880pps net6,port,mac,proto [ OK ] baseline (drop from netdev hook): 9718324pps baseline hash (non-ranged entries): 3799021pps baseline rbtree (match on first field only): 1506689pps set with 1000 full, ranged entries: 783810pps net,mac [ OK ] baseline (drop from netdev hook): 10190029pps baseline hash (non-ranged entries): 5172218pps baseline rbtree (match on first field only): 2946863pps set with 1000 full, ranged entries: 1279122pps v4: - fix build for 32-bit architectures: 64-bit division needs div_u64() (kbuild test robot <lkp@intel.com>) v3: - rework interface for field length specification, NFT_SET_SUBKEY disappears and information is stored in description - remove scratch area to store closing element of ranges, as elements now come with an actual attribute to specify the upper range limit (Pablo Neira Ayuso) - also remove pointer to 'start' element from mapping table, closing key is now accessible via extension data - use bytes right away instead of bits for field lengths, this way we can also double the inner loop of the lookup function to take care of upper and lower bits in a single iteration (minor performance improvement) - make it clearer that set operations are actually atomic API-wise, but we can't e.g. implement flush() as one-shot action - fix type for 'dup' in nft_pipapo_insert(), check for duplicates only in the next generation, and in general take care of differentiating generation mask cases depending on the operation (Pablo Neira Ayuso) - report C implementation matching rate in commit message, so that AVX2 implementation can be compared (Pablo Neira Ayuso) v2: - protect access to scratch maps in nft_pipapo_lookup() with local_bh_disable/enable() (Florian Westphal) - drop rcu_read_lock/unlock() from nft_pipapo_lookup(), it's already implied (Florian Westphal) - explain why partial allocation failures don't need handling in pipapo_realloc_scratch(), rename 'm' to clone and update related kerneldoc to make it clear we're not operating on the live copy (Florian Westphal) - add expicit check for priv->start_elem in nft_pipapo_insert() to avoid ending up in nft_pipapo_walk() with a NULL start element, and also zero it out in every operation that might make it invalid, so that insertion doesn't proceed with an invalid element (Florian Westphal) Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|