History log of /linux-master/net/netfilter/nft_set_pipapo_avx2.c
Revision Date Author Comments
# f04df573 13-Feb-2024 Florian Westphal <fw@strlen.de>

netfilter: nft_set_pipapo: constify lookup fn args where possible

Those get called from packet path, content must not be modified.
No functional changes intended.

Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>


# 4356e9f8 09-Feb-2024 Linus Torvalds <torvalds@linux-foundation.org>

work around gcc bugs with 'asm goto' with outputs

We've had issues with gcc and 'asm goto' before, and we created a
'asm_volatile_goto()' macro for that in the past: see commits
3f0116c3238a ("compiler/gcc4: Add quirk for 'asm goto' miscompilation
bug") and a9f180345f53 ("compiler/gcc4: Make quirk for
asm_volatile_goto() unconditional").

Then, much later, we ended up removing the workaround in commit
43c249ea0b1e ("compiler-gcc.h: remove ancient workaround for gcc PR
58670") because we no longer supported building the kernel with the
affected gcc versions, but we left the macro uses around.

Now, Sean Christopherson reports a new version of a very similar
problem, which is fixed by re-applying that ancient workaround. But the
problem in question is limited to only the 'asm goto with outputs'
cases, so instead of re-introducing the old workaround as-is, let's
rename and limit the workaround to just that much less common case.

It looks like there are at least two separate issues that all hit in
this area:

(a) some versions of gcc don't mark the asm goto as 'volatile' when it
has outputs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98619
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110420

which is easy to work around by just adding the 'volatile' by hand.

(b) Internal compiler errors:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110422

which are worked around by adding the extra empty 'asm' as a
barrier, as in the original workaround.

but the problem Sean sees may be a third thing since it involves bad
code generation (not an ICE) even with the manually added 'volatile'.

but the same old workaround works for this case, even if this feels a
bit like voodoo programming and may only be hiding the issue.

Reported-and-tested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/20240208220604.140859-1-seanjc@google.com/
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Andrew Pinski <quic_apinski@quicinc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5a8cdf6f 08-Feb-2024 Florian Westphal <fw@strlen.de>

netfilter: nft_set_pipapo: remove scratch_aligned pointer

use ->scratch for both avx2 and the generic implementation.

After previous change the scratch->map member is always aligned properly
for AVX2, so we can just use scratch->map in AVX2 too.

The alignoff delta is stored in the scratchpad so we can reconstruct
the correct address to free the area again.

Fixes: 7400b063969b ("nft_set_pipapo: Introduce AVX2-based lookup implementation")
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 76313d1a 07-Feb-2024 Florian Westphal <fw@strlen.de>

netfilter: nft_set_pipapo: store index in scratch maps

Pipapo needs a scratchpad area to keep state during matching.
This state can be large and thus cannot reside on stack.

Each set preallocates percpu areas for this.

On each match stage, one scratchpad half starts with all-zero and the other
is inited to all-ones.

At the end of each stage, the half that starts with all-ones is
always zero. Before next field is tested, pointers to the two halves
are swapped, i.e. resmap pointer turns into fill pointer and vice versa.

After the last field has been processed, pipapo stashes the
index toggle in a percpu variable, with assumption that next packet
will start with the all-zero half and sets all bits in the other to 1.

This isn't reliable.

There can be multiple sets and we can't be sure that the upper
and lower half of all set scratch map is always in sync (lookups
can be conditional), so one set might have swapped, but other might
not have been queried.

Thus we need to keep the index per-set-and-cpu, just like the
scratchpad.

Note that this bug fix is incomplete, there is a related issue.

avx2 and normal implementation might use slightly different areas of the
map array space due to the avx2 alignment requirements, so
m->scratch (generic/fallback implementation) and ->scratch_aligned
(avx) may partially overlap. scratch and scratch_aligned are not distinct
objects, the latter is just the aligned address of the former.

After this change, write to scratch_align->map_index may write to
scratch->map, so this issue becomes more prominent, we can set to 1
a bit in the supposedly-all-zero area of scratch->map[].

A followup patch will remove the scratch_aligned and makes generic and
avx code use the same (aligned) area.

Its done in a separate change to ease review.

Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 2b71e2c7 21-Dec-2021 Colin Ian King <colin.king@intel.com>

netfilter: nft_set_pipapo_avx2: remove redundant pointer lt

The pointer lt is being assigned a value and then later
updated but that value is never read. The pointer is
redundant and can be removed.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b7e945e2 27-Nov-2021 Stefano Brivio <sbrivio@redhat.com>

nft_set_pipapo: Fix bucket load in AVX2 lookup routine for six 8-bit groups

The sixth byte of packet data has to be looked up in the sixth group,
not in the seventh one, even if we load the bucket data into ymm6
(and not ymm5, for convenience of tracking stalls).

Without this fix, matching on a MAC address as first field of a set,
if 8-bit groups are selected (due to a small set size) would fail,
that is, the given MAC address would never match.

Reported-by: Nikita Yushchenko <nikita.yushchenko@virtuozzo.com>
Cc: <stable@vger.kernel.org> # 5.6.x
Fixes: 7400b063969b ("nft_set_pipapo: Introduce AVX2-based lookup implementation")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-By: Nikita Yushchenko <nikita.yushchenko@virtuozzo.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 89258f8e 29-May-2021 Florian Westphal <fw@strlen.de>

netfilter: nft_set_pipapo_avx2: fix up description warnings

W=1:
net/netfilter/nft_set_pipapo_avx2.c:159: warning: Excess function parameter 'len' description in 'nft_pipapo_avx2_refill'
net/netfilter/nft_set_pipapo_avx2.c:1124: warning: Function parameter or member 'key' not described in 'nft_pipapo_avx2_lookup'
net/netfilter/nft_set_pipapo_avx2.c:1124: warning: Excess function parameter 'elem' description in 'nft_pipapo_avx2_lookup'

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# a58db7ad 09-May-2021 Stefano Brivio <sbrivio@redhat.com>

netfilter: nft_set_pipapo_avx2: Skip LDMXCSR, we don't need a valid MXCSR state

We don't need a valid MXCSR state for the lookup routines, none of
the instructions we use rely on or affect any bit in the MXCSR
register.

Instead of calling kernel_fpu_begin(), we can pass 0 as mask to
kernel_fpu_begin_mask() and spare one LDMXCSR instruction.

Commit 49200d17d27d ("x86/fpu/64: Don't FNINIT in kernel_fpu_begin()")
already speeds up lookups considerably, and by dropping the MCXSR
initialisation we can now get a much smaller, but measurable, increase
in matching rates.

The table below reports matching rates and a wild approximation of
clock cycles needed for a match in a "port,net" test with 10 entries
from selftests/netfilter/nft_concat_range.sh, limited to the first
field, i.e. the port (with nft_set_rbtree initialisation skipped), run
on a single AMD Epyc 7351 thread (2.9GHz, 512 KiB L1D$, 8 MiB L2$).

The (very rough) estimation of clock cycles is obtained by simply
dividing frequency by matching rate. The "cycles spared" column refers
to the difference in cycles compared to the previous row, and the rate
increase also refers to the previous row. Results are averages of six
runs.

Merely for context, I'm also reporting packet rates obtained by
skipping kernel_fpu_begin() and kernel_fpu_end() altogether (which
shows a very limited impact now), as well as skipping the whole lookup
function, compared to simply counting and dropping all packets using
the netdev hook drop (see nft_concat_range.sh for details). This
workload also includes packet generation with pktgen and the receive
path of veth.

|matching| est. | cycles | rate |
| rate | cycles | spared |increase|
| (Mpps) | | | |
--------------------------------------|--------|--------|--------|--------|
FNINIT, LDMXCSR (before 49200d17d27d) | 5.245 | 553 | - | - |
LDMXCSR only (with 49200d17d27d) | 6.347 | 457 | 96 | 21.0% |
Without LDMXCSR (this patch) | 6.461 | 449 | 8 | 1.8% |
-------- for reference only: ---------|--------|--------|--------|--------|
Without kernel_fpu_begin() | 6.513 | 445 | 4 | 0.8% |
Without actual matching (return true) | 7.649 | 379 | 66 | 17.4% |
Without lookup operation (netdev drop)| 10.320 | 281 | 98 | 34.9% |

The clock cycles spared by avoiding LDMXCSR appear to be in line with CPI
and latency indicated in the manuals of comparable architectures: Intel
Skylake (CPI: 1, latency: 7) and AMD 12h (latency: 12) -- I couldn't find
this information for AMD 17h.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# f0b3d338 09-May-2021 Stefano Brivio <sbrivio@redhat.com>

netfilter: nft_set_pipapo_avx2: Add irq_fpu_usable() check, fallback to non-AVX2 version

Arturo reported this backtrace:

[709732.358791] WARNING: CPU: 3 PID: 456 at arch/x86/kernel/fpu/core.c:128 kernel_fpu_begin_mask+0xae/0xe0
[709732.358793] Modules linked in: binfmt_misc nft_nat nft_chain_nat nf_nat nft_counter nft_ct nf_tables nf_conntrack_netlink nfnetlink 8021q garp stp mrp llc vrf intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp crc32_pclmul mgag200 ghash_clmulni_intel drm_kms_helper cec aesni_intel drm libaes crypto_simd cryptd glue_helper mei_me dell_smbios iTCO_wdt evdev intel_pmc_bxt iTCO_vendor_support dcdbas pcspkr rapl dell_wmi_descriptor wmi_bmof sg i2c_algo_bit watchdog mei acpi_ipmi ipmi_si button nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipmi_devintf ipmi_msghandler ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 dm_mod raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor sd_mod t10_pi crc_t10dif crct10dif_generic raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ahci libahci tg3 libata xhci_pci libphy xhci_hcd ptp usbcore crct10dif_pclmul crct10dif_common bnxt_en crc32c_intel scsi_mod
[709732.358941] pps_core i2c_i801 lpc_ich i2c_smbus wmi usb_common
[709732.358957] CPU: 3 PID: 456 Comm: jbd2/dm-0-8 Not tainted 5.10.0-0.bpo.5-amd64 #1 Debian 5.10.24-1~bpo10+1
[709732.358959] Hardware name: Dell Inc. PowerEdge R440/04JN2K, BIOS 2.9.3 09/23/2020
[709732.358964] RIP: 0010:kernel_fpu_begin_mask+0xae/0xe0
[709732.358969] Code: ae 54 24 04 83 e3 01 75 38 48 8b 44 24 08 65 48 33 04 25 28 00 00 00 75 33 48 83 c4 10 5b c3 65 8a 05 5e 21 5e 76 84 c0 74 92 <0f> 0b eb 8e f0 80 4f 01 40 48 81 c7 00 14 00 00 e8 dd fb ff ff eb
[709732.358972] RSP: 0018:ffffbb9700304740 EFLAGS: 00010202
[709732.358976] RAX: 0000000000000001 RBX: 0000000000000003 RCX: 0000000000000001
[709732.358979] RDX: ffffbb9700304970 RSI: ffff922fe1952e00 RDI: 0000000000000003
[709732.358981] RBP: ffffbb9700304970 R08: ffff922fc868a600 R09: ffff922fc711e462
[709732.358984] R10: 000000000000005f R11: ffff922ff0b27180 R12: ffffbb9700304960
[709732.358987] R13: ffffbb9700304b08 R14: ffff922fc664b6c8 R15: ffff922fc664b660
[709732.358990] FS: 0000000000000000(0000) GS:ffff92371fec0000(0000) knlGS:0000000000000000
[709732.358993] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[709732.358996] CR2: 0000557a6655bdd0 CR3: 000000026020a001 CR4: 00000000007706e0
[709732.358999] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[709732.359001] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[709732.359003] PKRU: 55555554
[709732.359005] Call Trace:
[709732.359009] <IRQ>
[709732.359035] nft_pipapo_avx2_lookup+0x4c/0x1cba [nf_tables]
[709732.359046] ? sched_clock+0x5/0x10
[709732.359054] ? sched_clock_cpu+0xc/0xb0
[709732.359061] ? record_times+0x16/0x80
[709732.359068] ? plist_add+0xc1/0x100
[709732.359073] ? psi_group_change+0x47/0x230
[709732.359079] ? skb_clone+0x4d/0xb0
[709732.359085] ? enqueue_task_rt+0x22b/0x310
[709732.359098] ? bnxt_start_xmit+0x1e8/0xaf0 [bnxt_en]
[709732.359102] ? packet_rcv+0x40/0x4a0
[709732.359121] nft_lookup_eval+0x59/0x160 [nf_tables]
[709732.359133] nft_do_chain+0x350/0x500 [nf_tables]
[709732.359152] ? nft_lookup_eval+0x59/0x160 [nf_tables]
[709732.359163] ? nft_do_chain+0x364/0x500 [nf_tables]
[709732.359172] ? fib4_rule_action+0x6d/0x80
[709732.359178] ? fib_rules_lookup+0x107/0x250
[709732.359184] nft_nat_do_chain+0x8a/0xf2 [nft_chain_nat]
[709732.359193] nf_nat_inet_fn+0xea/0x210 [nf_nat]
[709732.359202] nf_nat_ipv4_out+0x14/0xa0 [nf_nat]
[709732.359207] nf_hook_slow+0x44/0xc0
[709732.359214] ip_output+0xd2/0x100
[709732.359221] ? __ip_finish_output+0x210/0x210
[709732.359226] ip_forward+0x37d/0x4a0
[709732.359232] ? ip4_key_hashfn+0xb0/0xb0
[709732.359238] ip_sublist_rcv_finish+0x4f/0x60
[709732.359243] ip_sublist_rcv+0x196/0x220
[709732.359250] ? ip_rcv_finish_core.isra.22+0x400/0x400
[709732.359255] ip_list_rcv+0x137/0x160
[709732.359264] __netif_receive_skb_list_core+0x29b/0x2c0
[709732.359272] netif_receive_skb_list_internal+0x1a6/0x2d0
[709732.359280] gro_normal_list.part.156+0x19/0x40
[709732.359286] napi_complete_done+0x67/0x170
[709732.359298] bnxt_poll+0x105/0x190 [bnxt_en]
[709732.359304] ? irqentry_exit+0x29/0x30
[709732.359309] ? asm_common_interrupt+0x1e/0x40
[709732.359315] net_rx_action+0x144/0x3c0
[709732.359322] __do_softirq+0xd5/0x29c
[709732.359329] asm_call_irq_on_stack+0xf/0x20
[709732.359332] </IRQ>
[709732.359339] do_softirq_own_stack+0x37/0x40
[709732.359346] irq_exit_rcu+0x9d/0xa0
[709732.359353] common_interrupt+0x78/0x130
[709732.359358] asm_common_interrupt+0x1e/0x40
[709732.359366] RIP: 0010:crc_41+0x0/0x1e [crc32c_intel]
[709732.359370] Code: ff ff f2 4d 0f 38 f1 93 a8 fe ff ff f2 4c 0f 38 f1 81 b0 fe ff ff f2 4c 0f 38 f1 8a b0 fe ff ff f2 4d 0f 38 f1 93 b0 fe ff ff <f2> 4c 0f 38 f1 81 b8 fe ff ff f2 4c 0f 38 f1 8a b8 fe ff ff f2 4d
[709732.359373] RSP: 0018:ffffbb97008dfcd0 EFLAGS: 00000246
[709732.359377] RAX: 000000000000002a RBX: 0000000000000400 RCX: ffff922fc591dd50
[709732.359379] RDX: ffff922fc591dea0 RSI: 0000000000000a14 RDI: ffffffffc00dddc0
[709732.359382] RBP: 0000000000001000 R08: 000000000342d8c3 R09: 0000000000000000
[709732.359384] R10: 0000000000000000 R11: ffff922fc591dff0 R12: ffffbb97008dfe58
[709732.359386] R13: 000000000000000a R14: ffff922fd2b91e80 R15: ffff922fef83fe38
[709732.359395] ? crc_43+0x1e/0x1e [crc32c_intel]
[709732.359403] ? crc32c_pcl_intel_update+0x97/0xb0 [crc32c_intel]
[709732.359419] ? jbd2_journal_commit_transaction+0xaec/0x1a30 [jbd2]
[709732.359425] ? irq_exit_rcu+0x3e/0xa0
[709732.359447] ? kjournald2+0xbd/0x270 [jbd2]
[709732.359454] ? finish_wait+0x80/0x80
[709732.359470] ? commit_timeout+0x10/0x10 [jbd2]
[709732.359476] ? kthread+0x116/0x130
[709732.359481] ? kthread_park+0x80/0x80
[709732.359488] ? ret_from_fork+0x1f/0x30
[709732.359494] ---[ end trace 081a19978e5f09f5 ]---

that is, nft_pipapo_avx2_lookup() uses the FPU running from a softirq
that interrupted a kthread, also using the FPU.

That's exactly the reason why irq_fpu_usable() is there: use it, and
if we can't use the FPU, fall back to the non-AVX2 version of the
lookup operation, i.e. nft_pipapo_lookup().

Reported-by: Arturo Borrero Gonzalez <arturo@netfilter.org>
Cc: <stable@vger.kernel.org> # 5.6.x
Fixes: 7400b063969b ("nft_set_pipapo: Introduce AVX2-based lookup implementation")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# eb16933a 07-Mar-2020 Stefano Brivio <sbrivio@redhat.com>

nft_set_pipapo: Prepare for single ranged field usage

A few adjustments in nft_pipapo_init() are needed to allow usage of
this set back-end for a single, ranged field.

Provide a convenient NFT_PIPAPO_MIN_FIELDS definition that currently
makes sure that the rbtree back-end is selected instead, for sets
with a single field.

This finally allows a fair comparison with rbtree sets, by defining
NFT_PIPAPO_MIN_FIELDS as 0 and skipping rbtree back-end initialisation:

---------------.--------------------------.-------------------------.
AMD Epyc 7402 | baselines, Mpps | Mpps, % over rbtree |
1 thread |__________________________|_________________________|
3.35GHz | | | | | |
768KiB L1D$ | netdev | hash | rbtree | | pipapo |
---------------| hook | no | single | pipapo |single field|
type entries | drop | ranges | field |single field| AVX2 |
---------------|--------|--------|--------|------------|------------|
net,port | | | | | |
1000 | 19.0 | 10.4 | 3.8 | 6.0 +58% | 9.6 +153% |
---------------|--------|--------|--------|------------|------------|
port,net | | | | | |
100 | 18.8 | 10.3 | 5.8 | 9.1 +57% |11.6 +100% |
---------------|--------|--------|--------|------------|------------|
net6,port | | | | | |
1000 | 16.4 | 7.6 | 1.8 | 2.8 +55% | 6.5 +261% |
---------------|--------|--------|--------|------------|------------|
port,proto | | | | [1] | [1] |
30000 | 19.6 | 11.6 | 3.9 | 0.9 -77% | 2.7 -31% |
---------------|--------|--------|--------|------------|------------|
port,proto | | | | | |
10000 | 19.6 | 11.6 | 4.4 | 2.1 -52% | 5.6 +27% |
---------------|--------|--------|--------|------------|------------|
port,proto | | | | | |
4 threads 10000| 77.9 | 45.1 | 17.4 | 8.3 -52% |22.4 +29% |
---------------|--------|--------|--------|------------|------------|
net6,port,mac | | | | | |
10 | 16.5 | 5.4 | 4.3 | 4.5 +5% | 8.2 +91% |
---------------|--------|--------|--------|------------|------------|
net6,port,mac, | | | | | |
proto 1000 | 16.5 | 5.7 | 1.9 | 2.8 +47% | 6.6 +247% |
---------------|--------|--------|--------|------------|------------|
net,mac | | | | | |
1000 | 19.0 | 8.4 | 3.9 | 6.0 +54% | 9.9 +154% |
---------------'--------'--------'--------'------------'------------'
[1] Causes switch of lookup table buckets for 'port' to 4-bit groups

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 7400b063 07-Mar-2020 Stefano Brivio <sbrivio@redhat.com>

nft_set_pipapo: Introduce AVX2-based lookup implementation

If the AVX2 set is available, we can exploit the repetitive
characteristic of this algorithm to provide a fast, vectorised
version by using 256-bit wide AVX2 operations for bucket loads and
bitwise intersections.

In most cases, this implementation consistently outperforms rbtree
set instances despite the fact they are configured to use a given,
single, ranged data type out of the ones used for performance
measurements by the nft_concat_range.sh kselftest.

That script, injecting packets directly on the ingoing device path
with pktgen, reports, averaged over five runs on a single AMD Epyc
7402 thread (3.35GHz, 768 KiB L1D$, 12 MiB L2$), the figures below.
CONFIG_RETPOLINE was not set here.

Note that this is not a fair comparison over hash and rbtree set
types: non-ranged entries (used to have a reference for hash types)
would be matched faster than this, and matching on a single field
only (which is the case for rbtree) is also significantly faster.

However, it's not possible at the moment to choose this set type
for non-ranged entries, and the current implementation also needs
a few minor adjustments in order to match on less than two fields.

---------------.-----------------------------------.------------.
AMD Epyc 7402 | baselines, Mpps | this patch |
1 thread |___________________________________|____________|
3.35GHz | | | | | |
768KiB L1D$ | netdev | hash | rbtree | | |
---------------| hook | no | single | | pipapo |
type entries | drop | ranges | field | pipapo | AVX2 |
---------------|--------|--------|--------|--------|------------|
net,port | | | | | |
1000 | 19.0 | 10.4 | 3.8 | 4.0 | 7.5 +87% |
---------------|--------|--------|--------|--------|------------|
port,net | | | | | |
100 | 18.8 | 10.3 | 5.8 | 6.3 | 8.1 +29% |
---------------|--------|--------|--------|--------|------------|
net6,port | | | | | |
1000 | 16.4 | 7.6 | 1.8 | 2.1 | 4.8 +128% |
---------------|--------|--------|--------|--------|------------|
port,proto | | | | | |
30000 | 19.6 | 11.6 | 3.9 | 0.5 | 2.6 +420% |
---------------|--------|--------|--------|--------|------------|
net6,port,mac | | | | | |
10 | 16.5 | 5.4 | 4.3 | 3.4 | 4.7 +38% |
---------------|--------|--------|--------|--------|------------|
net6,port,mac, | | | | | |
proto 1000 | 16.5 | 5.7 | 1.9 | 1.4 | 3.6 +26% |
---------------|--------|--------|--------|--------|------------|
net,mac | | | | | |
1000 | 19.0 | 8.4 | 3.9 | 2.5 | 6.4 +156% |
---------------'--------'--------'--------'--------'------------'

A similar strategy could be easily reused to implement specialised
versions for other SIMD sets, and I plan to post at least a NEON
version at a later time.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>