History log of /linux-master/include/net/netfilter/nf_tables.h
Revision Date Author Comments
# 29b359cf 10-Apr-2024 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_set_pipapo: walk over current view on netlink dump

The generation mask can be updated while netlink dump is in progress.
The pipapo set backend walk iterator cannot rely on it to infer what
view of the datastructure is to be used. Add notation to specify if user
wants to read/update the set.

Based on patch from Florian Westphal.

Fixes: 2b84e215f874 ("netfilter: nft_set_pipapo: .walk does not deal with generations")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 31bf508b 21-Dec-2023 Phil Sutter <phil@nwl.cc>

netfilter: nf_tables: Implement table adoption support

Allow a new process to take ownership of a previously owned table,
useful mostly for firewall management services restarting or suspending
when idle.

By extending __NFT_TABLE_F_UPDATE, the on/off/on check in
nf_tables_updtable() also covers table adoption, although it is actually
not needed: Table adoption is irreversible because nf_tables_updtable()
rejects attempts to drop NFT_TABLE_F_OWNER so table->nlpid setting can
happen just once within the transaction.

If the transaction commences, table's nlpid and flags fields are already
set and no further action is required. If it aborts, the table returns
to orphaned state.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Florian Westphal <fw@strlen.de>


# 7395dfac 05-Feb-2024 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: use timestamp to check for set element timeout

Add a timestamp field at the beginning of the transaction, store it
in the nftables per-netns area.

Update set backend .insert, .deactivate and sync gc path to use the
timestamp, this avoids that an element expires while control plane
transaction is still unfinished.

.lookup and .update, which are used from packet path, still use the
current time to check if the element has expired. And .get path and dump
also since this runs lockless under rcu read size lock. Then, there is
async gc which also needs to check the current time since it runs
asynchronously from a workqueue.

Fixes: c3e1b005ed1c ("netfilter: nf_tables: add set element timeout support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 776d4516 23-Jan-2024 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: restrict tunnel object to NFPROTO_NETDEV

Bail out on using the tunnel dst template from other than netdev family.
Add the infrastructure to check for the family in objects.

Fixes: af308b94a2a4 ("netfilter: nf_tables: add tunnel support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b253d87f 26-Dec-2023 George Guo <guodongtai@kylinos.cn>

netfilter: nf_tables: cleanup documentation

- Correct comments for nlpid, family, udlen and udata in struct nft_table,
and afinfo is no longer a member of enum nft_set_class.

- Add comment for data in struct nft_set_elem.

- Add comment for flags in struct nft_ctx.

- Add comments for timeout in struct nft_set_iter, and flags is not a
member of struct nft_set_iter, remove the comment for it.

- Add comments for commit, abort, estimate and gc_init in struct
nft_set_ops.

- Add comments for pending_update, num_exprs, exprs and catchall_list
in struct nft_set.

- Add comment for ext_len in struct nft_set_ext_tmpl.

- Add comment for inner_ops in struct nft_expr_type.

- Add comments for clone, destroy_clone, reduce, gc, offload,
offload_action, offload_stats in struct nft_expr_ops.

- Add comments for blob_gen_0, blob_gen_1, bound, genmask, udlen, udata,
blob_next in struct nft_chain.

- Add comment for flags in struct nft_base_chain.

- Add comments for udlen, udata in struct nft_object.

- Add comment for type in struct nft_object_ops.

- Add comment for hook_list in struct nft_flowtable, and remove comments
for dev_name and ops which are not members of struct nft_flowtable.

Signed-off-by: George Guo <guodongtai@kylinos.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c301f098 03-Nov-2023 Dan Carpenter <dan.carpenter@linaro.org>

netfilter: nf_tables: fix pointer math issue in nft_byteorder_eval()

The problem is in nft_byteorder_eval() where we are iterating through a
loop and writing to dst[0], dst[1], dst[2] and so on... On each
iteration we are writing 8 bytes. But dst[] is an array of u32 so each
element only has space for 4 bytes. That means that every iteration
overwrites part of the previous element.

I spotted this bug while reviewing commit caf3ef7468f7 ("netfilter:
nf_tables: prevent OOB access in nft_byteorder_eval") which is a related
issue. I think that the reason we have not detected this bug in testing
is that most of time we only write one element.

Fixes: ce1e7989d989 ("netfilter: nft_byteorder: provide 64bit le/be conversion")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 078996fc 18-Oct-2023 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: set->ops->insert returns opaque set element in case of EEXIST

Return struct nft_elem_priv instead of struct nft_set_ext for
consistency with ("netfilter: nf_tables: expose opaque set element as
struct nft_elem_priv") and to prepare the introduction of element
timeout updates from control path.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 0e1ea651 16-Oct-2023 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: shrink memory consumption of set elements

Instead of copying struct nft_set_elem into struct nft_trans_elem, store
the pointer to the opaque set element object in the transaction. Adapt
set backend API (and set backend implementations) to take the pointer to
opaque set element representation whenever required.

This patch deconstifies .remove() and .activate() set backend API since
these modify the set element opaque object. And it also constify
nft_set_elem_ext() this provides access to the nft_set_ext struct
without updating the object.

According to pahole on x86_64, this patch shrinks struct nft_trans_elem
size from 216 to 24 bytes.

This patch also reduces stack memory consumption by removing the
template struct nft_set_elem object, using the opaque set element object
instead such as from the set iterator API, catchall elements and the get
element command.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 9dad402b 18-Oct-2023 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: expose opaque set element as struct nft_elem_priv

Add placeholder structure and place it at the beginning of each struct
nft_*_elem for each existing set backend, instead of exposing elements
as void type to the frontend which defeats compiler type checks. Use
this pointer to this new type to replace void *.

This patch updates the following set backend API to use this new struct
nft_elem_priv placeholder structure:

- update
- deactivate
- flush
- get

as well as the following helper functions:

- nft_set_elem_ext()
- nft_set_elem_init()
- nft_set_elem_destroy()
- nf_tables_set_elem_destroy()

This patch adds nft_elem_priv_cast() to cast struct nft_elem_priv to
native element representation from the corresponding set backend.
BUILD_BUG_ON() makes sure this .priv placeholder is always at the top
of the opaque set element representation.

Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 6509a2e4 18-Oct-2023 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: set backend .flush always succeeds

.flush is always successful since this results from iterating over the
set elements to toggle mark the element as inactive in the next
generation.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 25600167 13-Oct-2023 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: de-constify set commit ops function argument

The set backend using this already has to work around this via ugly
cast, don't spread this pattern.

Signed-off-by: Florian Westphal <fw@strlen.de>


# 94ecde83 08-Oct-2023 George Guo <guodongtai@kylinos.cn>

netfilter: cleanup struct nft_table

Add comments for nlpid, family, udlen and udata in struct nft_table, and
afinfo is no longer a member of struct nft_table, so remove the comment
for it.

Signed-off-by: George Guo <guodongtai@kylinos.cn>
Signed-off-by: Florian Westphal <fw@strlen.de>


# cf5000a7 19-Sep-2023 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: fix memleak when more than 255 elements expired

When more than 255 elements expired we're supposed to switch to a new gc
container structure.

This never happens: u8 type will wrap before reaching the boundary
and nft_trans_gc_space() always returns true.

This means we recycle the initial gc container structure and
lose track of the elements that came before.

While at it, don't deref 'gc' after we've passed it to call_rcu.

Fixes: 5f68718b34a5 ("netfilter: nf_tables: GC transaction API to avoid race with control plane")
Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>


# 4a9e12ea 06-Sep-2023 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_set_pipapo: call nft_trans_gc_queue_sync() in catchall GC

pipapo needs to enqueue GC transactions for catchall elements through
nft_trans_gc_queue_sync(). Add nft_trans_gc_catchall_sync() and
nft_trans_gc_catchall_async() to handle GC transaction queueing
accordingly.

Fixes: 5f68718b34a5 ("netfilter: nf_tables: GC transaction API to avoid race with control plane")
Fixes: f6c383b8c31a ("netfilter: nf_tables: adapt set backend to use GC transaction API")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 8e51830e 22-Aug-2023 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: defer gc run if previous batch is still pending

Don't queue more gc work, else we may queue the same elements multiple
times.

If an element is flagged as dead, this can mean that either the previous
gc request was invalidated/discarded by a transaction or that the previous
request is still pending in the system work queue.

The latter will happen if the gc interval is set to a very low value,
e.g. 1ms, and system work queue is backlogged.

The sets refcount is 1 if no previous gc requeusts are queued, so add
a helper for this and skip gc run if old requests are pending.

Add a helper for this and skip the gc run in this case.

Fixes: f6c383b8c31a ("netfilter: nf_tables: adapt set backend to use GC transaction API")
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 4b80ced9 17-Aug-2023 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: validate all pending tables

We have to validate all tables in the transaction that are in
VALIDATE_DO state, the blamed commit below did not move the break
statement to its right location so we only validate one table.

Moreover, we can't init table->validate to _SKIP when a table object
is allocated.

If we do, then if a transcaction creates a new table and then
fails the transaction, nfnetlink will loop and nft will hang until
user cancels the command.

Add back the pernet state as a place to stash the last state encountered.
This is either _DO (we hit an error during commit validation) or _SKIP
(transaction passed all checks).

Fixes: 00c320f9b755 ("netfilter: nf_tables: make validation state per table")
Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>


# 08713cb0 10-Aug-2023 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: fix kdoc warnings after gc rework

Jakub Kicinski says:
We've got some new kdoc warnings here:
net/netfilter/nft_set_pipapo.c:1557: warning: Function parameter or member '_set' not described in 'pipapo_gc'
net/netfilter/nft_set_pipapo.c:1557: warning: Excess function parameter 'set' description in 'pipapo_gc'
include/net/netfilter/nf_tables.h:577: warning: Function parameter or member 'dead' not described in 'nft_set'

Fixes: 5f68718b34a5 ("netfilter: nf_tables: GC transaction API to avoid race with control plane")
Fixes: f6c383b8c31a ("netfilter: nf_tables: adapt set backend to use GC transaction API")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/netdev/20230810104638.746e46f1@kernel.org/
Signed-off-by: Florian Westphal <fw@strlen.de>


# a2dd0233 09-Aug-2023 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: remove busy mark and gc batch API

Ditch it, it has been replace it by the GC transaction API and it has no
clients anymore.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 5f68718b 09-Aug-2023 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: GC transaction API to avoid race with control plane

The set types rhashtable and rbtree use a GC worker to reclaim memory.
From system work queue, in periodic intervals, a scan of the table is
done.

The major caveat here is that the nft transaction mutex is not held.
This causes a race between control plane and GC when they attempt to
delete the same element.

We cannot grab the netlink mutex from the work queue, because the
control plane has to wait for the GC work queue in case the set is to be
removed, so we get following deadlock:

cpu 1 cpu2
GC work transaction comes in , lock nft mutex
`acquire nft mutex // BLOCKS
transaction asks to remove the set
set destruction calls cancel_work_sync()

cancel_work_sync will now block forever, because it is waiting for the
mutex the caller already owns.

This patch adds a new API that deals with garbage collection in two
steps:

1) Lockless GC of expired elements sets on the NFT_SET_ELEM_DEAD_BIT
so they are not visible via lookup. Annotate current GC sequence in
the GC transaction. Enqueue GC transaction work as soon as it is
full. If ruleset is updated, then GC transaction is aborted and
retried later.

2) GC work grabs the mutex. If GC sequence has changed then this GC
transaction lost race with control plane, abort it as it contains
stale references to objects and let GC try again later. If the
ruleset is intact, then this GC transaction deactivates and removes
the elements and it uses call_rcu() to destroy elements.

Note that no elements are removed from GC lockless path, the _DEAD bit
is set and pointers are collected. GC catchall does not remove the
elements anymore too. There is a new set->dead flag that is set on to
abort the GC transaction to deal with set->ops->destroy() path which
removes the remaining elements in the set from commit_release, where no
mutex is held.

To deal with GC when mutex is held, which allows safe deactivate and
removal, add sync GC API which releases the set element object via
call_rcu(). This is used by rbtree and pipapo backends which also
perform garbage collection from control plane path.

Since element removal from sets can happen from control plane and
element garbage collection/timeout, it is necessary to keep the set
structure alive until all elements have been deactivated and destroyed.

We cannot do a cancel_work_sync or flush_work in nft_set_destroy because
its called with the transaction mutex held, but the aforementioned async
work queue might be blocked on the very mutex that nft_set_destroy()
callchain is sitting on.

This gives us the choice of ABBA deadlock or UaF.

To avoid both, add set->refs refcount_t member. The GC API can then
increment the set refcount and release it once the elements have been
free'd.

Set backends are adapted to use the GC transaction API in a follow up
patch entitled:

("netfilter: nf_tables: use gc transaction API in set backends")

This is joint work with Florian Westphal.

Fixes: cfed7e1b1f8e ("netfilter: nf_tables: add set garbage collection helpers")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 1689f259 28-Jun-2023 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: report use refcount overflow

Overflow use refcount checks are not complete.

Add helper function to deal with object reference counter tracking.
Report -EMFILE in case UINT_MAX is reached.

nft_use_dec() splats in case that reference counter underflows,
which should not ever happen.

Add nft_use_inc_restore() and nft_use_dec_restore() which are used
to restore reference counter from error and abort paths.

Use u32 in nft_flowtable and nft_object since helper functions cannot
work on bitfields.

Remove the few early incomplete checks now that the helper functions
are in place and used to check for refcount overflow.

Fixes: 96518518cc41 ("netfilter: add nftables")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 96b2ef9b 06-Jun-2023 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: permit update of set size

Now that set->nelems is always updated permit update of the sets max size.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 938154b9 16-Jun-2023 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: reject unbound anonymous set before commit phase

Add a new list to track set transaction and to check for unbound
anonymous sets before entering the commit phase.

Bail out at the end of the transaction handling if an anonymous set
remains unbound.

Fixes: 96518518cc41 ("netfilter: add nftables")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 628bd3e4 16-Jun-2023 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: drop map element references from preparation phase

set .destroy callback releases the references to other objects in maps.
This is very late and it results in spurious EBUSY errors. Drop refcount
from the preparation phase instead, update set backend not to drop
reference counter from set .destroy path.

Exceptions: NFT_TRANS_PREPARE_ERROR does not require to drop the
reference counter because the transaction abort path releases the map
references for each element since the set is unbound. The abort path
also deals with releasing reference counter for new elements added to
unbound sets.

Fixes: 591054469b3e ("netfilter: nf_tables: revisit chain/object refcounting from elements")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 26b5a571 16-Jun-2023 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain

Add a new state to deal with rule expressions deactivation from the
newrule error path, otherwise the anonymous set remains in the list in
inactive state for the next generation. Mark the set/chain transaction
as unbound so the abort path releases this object, set it as inactive in
the next generation so it is not reachable anymore from this transaction
and reference counter is dropped.

Fixes: 1240eb93f061 ("netfilter: nf_tables: incorrect error path handling with NFT_MSG_NEWRULE")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 4bedf9ee 16-Jun-2023 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: fix chain binding transaction logic

Add bound flag to rule and chain transactions as in 6a0a8d10a366
("netfilter: nf_tables: use-after-free in failing rule with bound set")
to skip them in case that the chain is already bound from the abort
path.

This patch fixes an imbalance in the chain use refcnt that triggers a
WARN_ON on the table and chain destroy path.

This patch also disallows nested chain bindings, which is not
supported from userspace.

The logic to deal with chain binding in nft_data_hold() and
nft_data_release() is not correct. The NFT_TRANS_PREPARE state needs a
special handling in case a chain is bound but next expressions in the
same rule fail to initialize as described by 1240eb93f061 ("netfilter:
nf_tables: incorrect error path handling with NFT_MSG_NEWRULE").

The chain is left bound if rule construction fails, so the objects
stored in this chain (and the chain itself) are released by the
transaction records from the abort path, follow up patch ("netfilter:
nf_tables: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain")
completes this error handling.

When deleting an existing rule, chain bound flag is set off so the
rule expression .destroy path releases the objects.

Fixes: d0e2c7de92c7 ("netfilter: nf_tables: add NFT_CHAIN_BINDING")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 212ed75d 07-Jun-2023 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: integrate pipapo into commit protocol

The pipapo set backend follows copy-on-update approach, maintaining one
clone of the existing datastructure that is being updated. The clone
and current datastructures are swapped via rcu from the commit step.

The existing integration with the commit protocol is flawed because
there is no operation to clean up the clone if the transaction is
aborted. Moreover, the datastructure swap happens on set element
activation.

This patch adds two new operations for sets: commit and abort, these new
operations are invoked from the commit and abort steps, after the
transactions have been digested, and it updates the pipapo set backend
to use it.

This patch adds a new ->pending_update field to sets to maintain a list
of sets that require this new commit and abort operations.

Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c1592a89 02-May-2023 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: deactivate anonymous set from preparation phase

Toggle deleted anonymous sets as inactive in the next generation, so
users cannot perform any update on it. Clear the generation bitmask
in case the transaction is aborted.

The following KASAN splat shows a set element deletion for a bound
anonymous set that has been already removed in the same transaction.

[ 64.921510] ==================================================================
[ 64.923123] BUG: KASAN: wild-memory-access in nf_tables_commit+0xa24/0x1490 [nf_tables]
[ 64.924745] Write of size 8 at addr dead000000000122 by task test/890
[ 64.927903] CPU: 3 PID: 890 Comm: test Not tainted 6.3.0+ #253
[ 64.931120] Call Trace:
[ 64.932699] <TASK>
[ 64.934292] dump_stack_lvl+0x33/0x50
[ 64.935908] ? nf_tables_commit+0xa24/0x1490 [nf_tables]
[ 64.937551] kasan_report+0xda/0x120
[ 64.939186] ? nf_tables_commit+0xa24/0x1490 [nf_tables]
[ 64.940814] nf_tables_commit+0xa24/0x1490 [nf_tables]
[ 64.942452] ? __kasan_slab_alloc+0x2d/0x60
[ 64.944070] ? nf_tables_setelem_notify+0x190/0x190 [nf_tables]
[ 64.945710] ? kasan_set_track+0x21/0x30
[ 64.947323] nfnetlink_rcv_batch+0x709/0xd90 [nfnetlink]
[ 64.948898] ? nfnetlink_rcv_msg+0x480/0x480 [nfnetlink]

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b9703ed4 20-Apr-2023 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: support for adding new devices to an existing netdev chain

This patch allows users to add devices to an existing netdev chain.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 46df4175 14-Apr-2023 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: do not store rule in traceinfo structure

pass it as argument instead. This reduces size of traceinfo to
16 bytes. Total stack usage:

nf_tables_core.c:252 nft_do_chain 304 static

While its possible to also pass basechain as argument, doing so
increases nft_do_chaininfo function size.

Unlike pktinfo/verdict/rule the basechain info isn't used in
the expression evaluation path. gcc places it on the stack, which
results in extra push/pop when it gets passed to the trace helpers
as argument rather than as part of the traceinfo structure.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 0a202145 14-Apr-2023 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: do not store verdict in traceinfo structure

Just pass it as argument to nft_trace_notify. Stack is reduced by 8 bytes:

nf_tables_core.c:256 nft_do_chain 312 static

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 698bb828 14-Apr-2023 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: do not store pktinfo in traceinfo structure

pass it as argument. No change in object size.

stack usage decreases by 8 byte:
nf_tables_core.c:254 nft_do_chain 320 static

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 00c320f9 13-Apr-2023 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: make validation state per table

We only need to validate tables that saw changes in the current
transaction.

The existing code revalidates all tables, but this isn't needed as
cross-table jumps are not allowed (chains have table scope).

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 63e9bbbc 11-Apr-2023 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: don't store chain address on jump

Now that the rule trailer/end marker and the rcu head reside in the
same structure, we no longer need to save/restore the chain pointer
when performing/returning from a jump.

We can simply let the trace infra walk the evaluated rule until it
hits the end marker and then fetch the chain pointer from there.

When the rule is NULL (policy tracing), then chain and basechain
pointers were already identical, so just use the basechain.

This cuts size of jumpstack in half, from 256 to 128 bytes in 64bit,
scripts/stackusage says:

nf_tables_core.c:251 nft_do_chain 328 static

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# d46fc894 16-Apr-2023 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: validate catch-all set elements

catch-all set element might jump/goto to chain that uses expressions
that require validation.

Fixes: aaa31047a6d2 ("netfilter: nftables: add catch-all set element support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 123b9961 19-Dec-2022 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: honor set timeout and garbage collection updates

Set timeout and garbage collection interval updates are ignored on
updates. Add transaction to update global set element timeout and
garbage collection interval.

Fixes: 96518518cc41 ("netfilter: add nftables")
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# bed4a63e 19-Dec-2022 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: consolidate set description

Add the following fields to the set description:

- key type
- data type
- object type
- policy
- gc_int: garbage collection interval)
- timeout: element timeout

This prepares for stricter set type checks on updates in a follow up
patch.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 8daa8fde 14-Oct-2022 Phil Sutter <phil@nwl.cc>

netfilter: nf_tables: Introduce NFT_MSG_GETRULE_RESET

Analogous to NFT_MSG_GETOBJ_RESET, but for rules: Reset stateful
expressions like counters or quotas. The latter two are the only
consumers, adjust their 'dump' callbacks to respect the parameter
introduced earlier.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 7d34aa3e 14-Oct-2022 Phil Sutter <phil@nwl.cc>

netfilter: nf_tables: Extend nft_expr_ops::dump callback parameters

Add a 'reset' flag just like with nft_object_ops::dump. This will be
useful to reset "anonymous stateful objects", e.g. simple rule counters.

No functional change intended.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 0e795b37 17-Oct-2022 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_inner: add percpu inner context

Add NFT_PKTINFO_INNER_FULL flag to annotate that inner offsets are
available. Store nft_inner_tun_ctx object in percpu area to cache
existing inner offsets for this skbuff.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 3a07327d 25-Oct-2022 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_inner: support for inner tunnel header matching

This new expression allows you to match on the inner headers that are
encapsulated by any of the existing tunneling protocols.

This expression parses the inner packet to set the link, network and
transport offsets, so the existing expressions (with a few updates) can
be reused to match on the inner headers.

The inner expression supports for different tunnel combinations such as:

- ethernet frame over IPv4/IPv6 packet, eg. VxLAN.
- IPv4/IPv6 packet over IPv4/IPv6 packet, eg. IPIP.
- IPv4/IPv6 packet over IPv4/IPv6 + transport header, eg. GRE.
- transport header (ESP or SCTP) over transport header (usually UDP)

The following fields are used to describe the tunnel protocol:

- flags, which describe how to parse the inner headers:

NFT_PAYLOAD_CTX_INNER_TUN, the tunnel provides its own header.
NFT_PAYLOAD_CTX_INNER_ETHER, the ethernet frame is available as inner header.
NFT_PAYLOAD_CTX_INNER_NH, the network header is available as inner header.
NFT_PAYLOAD_CTX_INNER_TH, the transport header is available as inner header.

For example, VxLAN sets on all of these flags. While GRE only sets on
NFT_PAYLOAD_CTX_INNER_NH and NFT_PAYLOAD_CTX_INNER_TH. Then, ESP over
UDP only sets on NFT_PAYLOAD_CTX_INNER_TH.

The tunnel description is composed of the following attributes:

- header size: in case the tunnel comes with its own header, eg. VxLAN.

- type: this provides a hint to userspace on how to delinearize the rule.
This is useful for VxLAN and Geneve since they run over UDP, since
transport does not provide a hint. This is also useful in case hardware
offload is ever supported. The type is not currently interpreted by the
kernel.

- expression: currently only payload supported. Follow up patch adds
also inner meta support which is required by autogenerated
dependencies. The exthdr expression should be supported too
at some point. There is a new inner_ops operation that needs to be
set on to allow to use an existing expression from the inner expression.

This patch adds a new NFT_PAYLOAD_TUN_HEADER base which allows to match
on the tunnel header fields, eg. vxlan vni.

The payload expression is embedded into nft_inner private area and this
private data area is passed to the payload inner eval function via
direct call.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e7a1caa6 14-Oct-2022 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: reduce nft_pktinfo by 8 bytes

structure is reduced from 32 to 24 bytes. While at it, also check
that iphdrlen is sane, this is guaranteed for NFPROTO_IPV4 but not
for ingress or bridge, so add checks for this.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# ab482c6b 21-Aug-2022 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: make table handle allocation per-netns friendly

mutex is per-netns, move table_netns to the pernet area.

*read-write* to 0xffffffff883a01e8 of 8 bytes by task 6542 on cpu 0:
nf_tables_newtable+0x6dc/0xc00 net/netfilter/nf_tables_api.c:1221
nfnetlink_rcv_batch net/netfilter/nfnetlink.c:513 [inline]
nfnetlink_rcv_skb_batch net/netfilter/nfnetlink.c:634 [inline]
nfnetlink_rcv+0xa6a/0x13a0 net/netfilter/nfnetlink.c:652
netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
netlink_unicast+0x652/0x730 net/netlink/af_netlink.c:1345
netlink_sendmsg+0x643/0x740 net/netlink/af_netlink.c:1921

Fixes: f102d66b335a ("netfilter: nf_tables: use dedicated mutex to guard transactions")
Reported-by: Abhishek Shah <abhishek.shah@columbia.edu>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# f323ef3a 08-Aug-2022 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: disallow jump to implicit chain from set element

Extend struct nft_data_desc to add a flag field that specifies
nft_data_init() is being called for set element data.

Use it to disallow jump to implicit chain from set element, only jump
to chain via immediate expression is allowed.

Fixes: d0e2c7de92c7 ("netfilter: nf_tables: add NFT_CHAIN_BINDING")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 341b6941 08-Aug-2022 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: upfront validation of data via nft_data_init()

Instead of parsing the data and then validate that type and length are
correct, pass a description of the expected data so it can be validated
upfront before parsing it to bail out earlier.

This patch adds a new .size field to specify the maximum size of the
data area. The .len field is optional and it is used as an input/output
field, it provides the specific length of the expected data in the input
path. If then .len field is not specified, then obtained length from the
netlink attribute is stored. This is required by cmp, bitwise, range and
immediate, which provide no netlink attribute that describes the data
length. The immediate expression uses the destination register type to
infer the expected data type.

Relying on opencoded validation of the expected data might lead to
subtle bugs as described in 7e6bc1f6cabc ("netfilter: nf_tables:
stricter validation of element data").

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 34aae2c2 09-Aug-2022 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: validate variable length element extension

Update template to validate variable length extensions. This patch adds
a new .ext_len[id] field to the template to store the expected extension
length. This is used to sanity check the initialization of the variable
length extension.

Use PTR_ERR() in nft_set_elem_init() to report errors since, after this
update, there are two reason why this might fail, either because of
ENOMEM or insufficient room in the extension field (EINVAL).

Kernels up until 7e6bc1f6cabc ("netfilter: nf_tables: stricter
validation of element data") allowed to copy more data to the extension
than was allocated. This ext_len field allows to validate if the
destination has the correct size as additional check.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 7278b3c1 23-Jun-2022 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: add and use BE register load-store helpers

Same as the existing ones, no conversions. This is just for sparse sake
only so that we no longer mix be16/u16 and be32/u32 types.

Alternative is to add __force __beX in various places, but this
seems nicer.

objdiff shows no changes.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c39ba4de 05-Jul-2022 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: replace BUG_ON by element length check

BUG_ON can be triggered from userspace with an element with a large
userdata area. Replace it by length check and return EINVAL instead.
Over time extensions have been growing in size.

Pick a sufficiently old Fixes: tag to propagate this fix.

Fixes: 7d7402642eaf ("netfilter: nf_tables: variable sized set element keys / data")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e34b9ed9 22-Jun-2022 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: avoid skb access on nf_stolen

When verdict is NF_STOLEN, the skb might have been freed.

When tracing is enabled, this can result in a use-after-free:
1. access to skb->nf_trace
2. access to skb->mark
3. computation of trace id
4. dump of packet payload

To avoid 1, keep a cached copy of skb->nf_trace in the
trace state struct.
Refresh this copy whenever verdict is != STOLEN.

Avoid 2 by skipping skb->mark access if verdict is STOLEN.

3 is avoided by precomputing the trace id.

Only dump the packet when verdict is not "STOLEN".

Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b6d9014a 30-May-2022 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: delete flowtable hooks via transaction list

Remove inactive bool field in nft_hook object that was introduced in
abadb2f865d7 ("netfilter: nf_tables: delete devices from flowtable").
Move stale flowtable hooks to transaction list instead.

Deleting twice the same device does not result in ENOENT.

Fixes: abadb2f865d7 ("netfilter: nf_tables: delete devices from flowtable")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 34cc9e52 14-Mar-2022 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: cancel tracking for clobbered destination registers

Output of expressions might be larger than one single register, this might
clobber existing data. Reset tracking for all destination registers that
required to store the expression output.

This patch adds three new helper functions:

- nft_reg_track_update: cancel previous register tracking and update it.
- nft_reg_track_cancel: cancel any previous register tracking info.
- __nft_reg_track_cancel: cancel only one single register tracking info.

Partial register clobbering detection is also supported by checking the
.num_reg field which describes the number of register that are used.

This patch updates the following expressions:

- meta_bridge
- bitwise
- byteorder
- meta
- payload

to use these helper functions.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b2d30654 14-Mar-2022 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: do not reduce read-only expressions

Skip register tracking for expressions that perform read-only operations
on the registers. Define and use a cookie pointer NFT_REDUCE_READONLY to
avoid defining stubs for these expressions.

This patch re-enables register tracking which was disabled in ed5f85d42290
("netfilter: nf_tables: disable register tracking"). Follow up patches
add remaining register tracking for existing expressions.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b1a5983f 17-Feb-2022 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables_offload: incorrect flow offload action array size

immediate verdict expression needs to allocate one slot in the flow offload
action array, however, immediate data expression does not need to do so.

fwd and dup expression need to allocate one slot, this is missing.

Add a new offload_action interface to report if this expression needs to
allocate one slot in the flow offload action array.

Fixes: be2861dc36d7 ("netfilter: nft_{fwd,dup}_netdev: add offload support")
Reported-and-tested-by: Nick Gregory <Nick.Gregory@Sophos.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# be5650f8 09-Jan-2022 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_bitwise: track register operations

Check if the destination register already contains the data that this
bitwise expression performs. This allows to skip this redundant
operation.

If the destination contains a different bitwise operation, cancel the
register tracking information. If the destination contains no bitwise
operation, update the register tracking information.

Update the payload and meta expression to check if this bitwise
operation has been already performed on the register. Hence, both the
payload/meta and the bitwise expressions are reduced.

There is also a special case: If source register != destination register
and source register is not updated by a previous bitwise operation, then
transfer selector from the source register to the destination register.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 12e4ecfa 09-Jan-2022 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add register tracking infrastructure

This patch adds new infrastructure to skip redundant selector store
operations on the same register to achieve a performance boost from
the packet path.

This is particularly noticeable in pure linear rulesets but it also
helps in rulesets which are already heaving relying in maps to avoid
ruleset linear inspection.

The idea is to keep data of the most recurrent store operations on
register to reuse them with cmp and lookup expressions.

This infrastructure allows for dynamic ruleset updates since the ruleset
blob reduction happens from the kernel.

Userspace still needs to be updated to maximize register utilization to
cooperate to improve register data reuse / reduce number of store on
register operations.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 642c8eff 09-Jan-2022 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add NFT_REG32_NUM

Add a definition including the maximum number of 32-bits registers that
are used a scratchpad memory area to store data.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 2c865a8a 09-Jan-2022 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add rule blob layout

This patch adds a blob layout per chain to represent the ruleset in the
packet datapath.

size (unsigned long)
struct nft_rule_dp
struct nft_expr
...
struct nft_rule_dp
struct nft_expr
...
struct nft_rule_dp (is_last=1)

The new structure nft_rule_dp represents the rule in a more compact way
(smaller memory footprint) compared to the control-plane nft_rule
structure.

The ruleset blob is a read-only data structure. The first field contains
the blob size, then the rules containing expressions. There is a trailing
rule which is used by the tracing infrastructure which is equivalent to
the NULL rule marker in the previous representation. The blob size field
does not include the size of this trailing rule marker.

The ruleset blob is generated from the commit path.

This patch reuses the infrastructure available since 0cbc06b3faba
("netfilter: nf_tables: remove synchronize_rcu in commit phase") to
build the array of rules per chain.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c46b38dc 28-Oct-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_payload: support for inner header matching / mangling

Allow to match and mangle on inner headers / payload data after the
transport header. There is a new field in the pktinfo structure that
stores the inner header offset which is calculated only when requested.
Only TCP and UDP supported at this stage.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b5bdc6f9 28-Oct-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: convert pktinfo->tprot_set to flags field

Generalize boolean field to store more flags on the pktinfo structure.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 6fb721cf 26-Sep-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: honor NLM_F_CREATE and NLM_F_EXCL in event notification

Include the NLM_F_CREATE and NLM_F_EXCL flags in netlink event
notifications, otherwise userspace cannot distiguish between create and
add commands.

Fixes: 96518518cc41 ("netfilter: add nftables")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 897389de 27-May-2021 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: remove xt_action_param from nft_pktinfo

Init it on demand in the nft_compat expression. This reduces size
of nft_pktinfo from 48 to 24 bytes on x86_64.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# f06ad944 27-May-2021 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: remove unused arg in nft_set_pktinfo_unspec()

The functions pass extra skb arg, but either its not used or the helpers
can already access it via pkt->skb.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 2d7b4ace 27-May-2021 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: add and use nft_thoff helper

This allows to change storage placement later on without changing readers.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 85554eb9 27-May-2021 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: add and use nft_sk helper

This allows to change storage placement later on without changing readers.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 179d9ba5 24-May-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: fix table flag updates

The dormant flag need to be updated from the preparation phase,
otherwise, two consecutive requests to dorm a table in the same batch
might try to remove the same hooks twice, resulting in the following
warning:

hook not found, pf 3 num 0
WARNING: CPU: 0 PID: 334 at net/netfilter/core.c:480 __nf_unregister_net_hook+0x1eb/0x610 net/netfilter/core.c:480
Modules linked in:
CPU: 0 PID: 334 Comm: kworker/u4:5 Not tainted 5.12.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: netns cleanup_net
RIP: 0010:__nf_unregister_net_hook+0x1eb/0x610 net/netfilter/core.c:480

This patch is a partial revert of 0ce7cf4127f1 ("netfilter: nftables:
update table flags from the commit phase") to restore the previous
behaviour.

However, there is still another problem: A batch containing a series of
dorm-wakeup-dorm table and vice-versa also trigger the warning above
since hook unregistration happens from the preparation phase, while hook
registration occurs from the commit phase.

To fix this problem, this patch adds two internal flags to annotate the
original dormant flag status which are __NFT_TABLE_F_WAS_DORMANT and
__NFT_TABLE_F_WAS_AWAKEN, to restore it from the abort path.

The __NFT_TABLE_F_UPDATE bitmask allows to handle the dormant flag update
with one single transaction.

Reported-by: syzbot+7ad5cd1615f2d89c6e7e@syzkaller.appspotmail.com
Fixes: 0ce7cf4127f1 ("netfilter: nftables: update table flags from the commit phase")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# aaa31047 27-Apr-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nftables: add catch-all set element support

This patch extends the set infrastructure to add a special catch-all set
element. If the lookup fails to find an element (or range) in the set,
then the catch-all element is selected. Users can specify a mapping,
expression(s) and timeout to be attached to the catch-all element.

This patch adds a catchall list to the set, this list might contain more
than one single catch-all element (e.g. in case that the catch-all
element is removed and a new one is added in the same transaction).
However, most of the time, there will be either one element or no
elements at all in this list.

The catch-all element is identified via NFT_SET_ELEM_CATCHALL flag and
such special element has no NFTA_SET_ELEM_KEY attribute. There is a new
nft_set_elem_catchall object that stores a reference to the dummy
catch-all element (catchall->elem) whose layout is the same of the set
element type to reuse the existing set element codebase.

The set size does not apply to the catch-all element, users can define a
catch-all element even if the set is full.

The check for valid set element flags hava been updates to report
EOPNOTSUPP in case userspace requests flags that are not supported when
using new userspace nftables and old kernel.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# d59d2f82 22-Apr-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nftables: add nft_pernet() helper function

Consolidate call to net_generic(net, nf_tables_net_id) in this
wrapper function.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b72920f6 15-Apr-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nftables: counter hardware offload support

This patch adds the .offload_stats operation to synchronize hardware
stats with the expression data. Update the counter expression to use
this new interface. The hardware stats are retrieved from the netlink
dump path via FLOW_CLS_STATS command to the driver.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 0854db2a 01-Apr-2021 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: use net_generic infra for transaction data

This moves all nf_tables pernet data from struct net to a net_generic
extension, with the exception of the gencursor.

The latter is used in the data path and also outside of the nf_tables
core. All others are only used from the configuration plane.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# cefa31a9 25-Mar-2021 Florian Westphal <fw@strlen.de>

netfilter: nft_log: perform module load from nf_tables

modprobe calls from the nf_logger_find_get() API causes deadlock in very
special cases because they occur with the nf_tables transaction mutex held.

In the specific case of nf_log, deadlock is via:

A nf_tables -> transaction mutex -> nft_log -> modprobe -> nf_log_syslog \
-> pernet_ops rwsem -> wait for C
B netlink event -> rtnl_mutex -> nf_tables transaction mutex -> wait for A
C close() -> ip6mr_sk_done -> rtnl_mutex -> wait for B

Earlier patch added NFLOG/xt_LOG module softdeps to avoid the need to load
the backend module during a transaction.

For nft_log we would have to add a softdep for both nfnetlink_log or
nf_log_syslog, since we do not know in advance which of the two backends
are going to be configured.

This defers the modprobe op until after the transaction mutex is released.

Tested-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 0ce7cf41 17-Mar-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nftables: update table flags from the commit phase

Do not update table flags from the preparation phase. Store the flags
update into the transaction, then update the flags from the commit
phase.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 7b35582c 16-Mar-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nftables: allow to update flowtable flags

Honor flowtable flags from the control update path. Disallow disabling
to toggle hardware offload support though.

Fixes: 8bb69f3b2918 ("netfilter: nf_tables: add flowtable offload control plane")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 6001a930 14-Feb-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nftables: introduce table ownership

A userspace daemon like firewalld might need to monitor for netlink
updates to detect its ruleset removal by the (global) flush ruleset
command to ensure ruleset persistency. This adds extra complexity from
userspace and, for some little time, the firewall policy is not in
place.

This patch adds the NFT_TABLE_F_OWNER flag which allows a userspace
program to own the table that creates in exclusivity.

Tables that are owned...

- can only be updated and removed by the owner, non-owners hit EPERM if
they try to update it or remove it.
- are destroyed when the owner closes the netlink socket or the process
is gone (implicit netlink socket closure).
- are skipped by the global flush ruleset command.
- are listed in the global ruleset.

The userspace process that sets on the NFT_TABLE_F_OWNER flag need to
leave open the netlink socket.

A new NFTA_TABLE_OWNER netlink attribute specifies the netlink port ID
to identify the owner from userspace.

This patch also updates error reporting when an unknown table flag is
specified to change it from EINVAL to EOPNOTSUPP given that EINVAL is
usually reserved to report for malformed netlink messages to userspace.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 08a01c11 25-Jan-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nftables: statify nft_parse_register()

This function is not used anymore by any extension, statify it.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 345023b0 25-Jan-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nftables: add nft_parse_register_store() and use it

This new function combines the netlink register attribute parser
and the store validation function.

This update requires to replace:

enum nft_registers dreg:8;

in many of the expression private areas otherwise compiler complains
with:

error: cannot take address of bit-field ‘dreg’

when passing the register field as reference.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 4f16d25c 25-Jan-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nftables: add nft_parse_register_load() and use it

This new function combines the netlink register attribute parser
and the load validation function.

This update requires to replace:

enum nft_registers sreg:8;

in many of the expression private areas otherwise compiler complains
with:

error: cannot take address of bit-field ‘sreg’

when passing the register field as reference.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# fca05d4d 15-Jan-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_dynset: honor stateful expressions in set definition

If the set definition contains stateful expressions, allocate them for
the newly added entries from the packet path.

Fixes: 65038428b2c6 ("netfilter: nf_tables: allow to specify stateful expression in set definition")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 563125a7 09-Dec-2020 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nftables: generalize set extension to support for several expressions

This patch replaces NFT_SET_EXPR by NFT_SET_EXT_EXPRESSIONS. This new
extension allows to attach several expressions to one set element (not
only one single expression as NFT_SET_EXPR provides). This patch
prepares for support for several expressions per set element in the
netlink userspace API.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 92b211a2 07-Dec-2020 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nftables: move nft_expr before nft_set

Move the nft_expr structure definition before nft_set. Expressions are
used by rules and sets, remove unnecessary forward declarations. This
comes as preparation to support for multiple expressions per set element.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 8cfd9b0f 07-Dec-2020 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nftables: generalize set expressions support

Currently, the set infrastucture allows for one single expressions per
element. This patch extends the existing infrastructure to allow for up
to two expressions. This is not updating the netlink API yet, this is
coming as an initial preparation patch.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 917d80d3 08-Dec-2020 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_dynset: fix timeouts later than 23 days

Use nf_msecs_to_jiffies64 and nf_jiffies64_to_msecs as provided by
8e1102d5a159 ("netfilter: nf_tables: support timeouts larger than 23
days"), otherwise ruleset listing breaks.

Fixes: a8b1e36d0d1d ("netfilter: nft_dynset: fix element timeout for HZ != 1000")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 31cc578a 20-Oct-2020 Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com>

netfilter: nftables_offload: KASAN slab-out-of-bounds Read in nft_flow_rule_create

This patch fixes the issue due to:

BUG: KASAN: slab-out-of-bounds in nft_flow_rule_create+0x622/0x6a2
net/netfilter/nf_tables_offload.c:40
Read of size 8 at addr ffff888103910b58 by task syz-executor227/16244

The error happens when expr->ops is accessed early on before performing the boundary check and after nft_expr_next() moves the expr to go out-of-bounds.

This patch checks the boundary condition before expr->ops that fixes the slab-out-of-bounds Read issue.

Add nft_expr_more() and use it to fix this problem.

Signed-off-by: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# d25e2e93 14-Oct-2020 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: restore NF_INET_NUMHOOKS

This definition is used by the iptables legacy UAPI, restore it.

Fixes: d3519cb89f6d ("netfilter: nf_tables: add inet ingress support")
Reported-by: Jason A. Donenfeld <Jason@zx2c4.com>
Tested-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# d3519cb8 07-Oct-2020 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add inet ingress support

This patch adds a new ingress hook for the inet family. The inet ingress
hook emulates the IP receive path code, therefore, unclean packets are
drop before walking over the ruleset in this basechain.

This patch also introduces the nft_base_chain_netdev() helper function
to check if this hook is bound to one or more devices (through the hook
list infrastructure). This check allows to perform the same handling for
the inet ingress as it would be a netdev ingress chain from the control
plane.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 002f2176 28-Sep-2020 Jose M. Guisado Gomez <guigom@riseup.net>

netfilter: nf_tables: add userdata attributes to nft_chain

Enables storing userdata for nft_chain. Field udata points to user data
and udlen stores its length.

Adds new attribute flag NFTA_CHAIN_USERDATA.

Signed-off-by: Jose M. Guisado Gomez <guigom@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 8a8b9047 16-Sep-2020 YueHaibing <yuehaibing@huawei.com>

netfilter: nf_tables: Remove ununsed function nft_data_debug

It is never used, so can be removed.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b131c964 08-Sep-2020 Jose M. Guisado Gomez <guigom@riseup.net>

netfilter: nf_tables: add userdata support for nft_object

Enables storing userdata for nft_object. Initially this will store an
optional comment but can be extended in the future as needed.

Adds new attribute NFTA_OBJ_USERDATA to nft_object.

Signed-off-by: Jose M. Guisado Gomez <guigom@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 7a81575b 20-Aug-2020 Jose M. Guisado Gomez <guigom@riseup.net>

netfilter: nf_tables: add userdata attributes to nft_table

Enables storing userdata for nft_table. Field udata points to user data
and udlen store its length.

Adds new attribute flag NFTA_TABLE_USERDATA

Signed-off-by: Jose M. Guisado Gomez <guigom@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 1e105e6a 20-Aug-2020 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: fix destination register zeroing

Following bug was reported via irc:
nft list ruleset
set knock_candidates_ipv4 {
type ipv4_addr . inet_service
size 65535
elements = { 127.0.0.1 . 123,
127.0.0.1 . 123 }
}
..
udp dport 123 add @knock_candidates_ipv4 { ip saddr . 123 }
udp dport 123 add @knock_candidates_ipv4 { ip saddr . udp dport }

It should not have been possible to add a duplicate set entry.

After some debugging it turned out that the problem is the immediate
value (123) in the second-to-last rule.

Concatenations use 32bit registers, i.e. the elements are 8 bytes each,
not 6 and it turns out the kernel inserted

inet firewall @knock_candidates_ipv4
element 0100007f ffff7b00 : 0 [end]
element 0100007f 00007b00 : 0 [end]

Note the non-zero upper bits of the first element. It turns out that
nft_immediate doesn't zero the destination register, but this is needed
when the length isn't a multiple of 4.

Furthermore, the zeroing in nft_payload is broken. We can't use
[len / 4] = 0 -- if len is a multiple of 4, index is off by one.

Skip zeroing in this case and use a conditional instead of (len -1) / 4.

Fixes: 49499c3e6e18 ("netfilter: nf_tables: switch registers to 32 bit addressing")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# ffe8923f 24-Jul-2020 Florian Westphal <fw@strlen.de>

netfilter: nft_compat: make sure xtables destructors have run

Pablo Neira found that after recent update of xt_IDLETIMER the
iptables-nft tests sometimes show an error.

He tracked this down to the delayed cleanup used by nf_tables core:
del rule (transaction A)
add rule (transaction B)

Its possible that by time transaction B (both in same netns) runs,
the xt target destructor has not been invoked yet.

For native nft expressions this is no problem because all expressions
that have such side effects make sure these are handled from the commit
phase, rather than async cleanup.

For nft_compat however this isn't true.

Instead of forcing synchronous behaviour for nft_compat, keep track
of the number of outstanding destructor calls.

When we attempt to create a new expression, flush the cleanup worker
to make sure destructors have completed.

With lots of help from Pablo Neira.

Reported-by: Pablo Neira Ayso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# d0e2c7de 30-Jun-2020 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add NFT_CHAIN_BINDING

This new chain flag specifies that:

* the kernel dynamically allocates the chain name, if no chain name
is specified.

* If the immediate expression that refers to this chain is removed,
then this bound chain (and its content) is destroyed.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 67c49de4 30-Jun-2020 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: expose enum nft_chain_flags through UAPI

This enum definition was never exposed through UAPI. Rename
NFT_BASE_CHAIN to NFT_CHAIN_BASE for consistency.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 74cccc3d 30-Jun-2020 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add NFTA_CHAIN_ID attribute

This netlink attribute allows you to refer to chains inside a
transaction as an alternative to the name and the handle. The chain
binding support requires this new chain ID approach.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# abadb2f8 20-May-2020 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: delete devices from flowtable

This patch allows users to delete devices from existing flowtables.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 78d9f48f 20-May-2020 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add devices to existing flowtable

This patch allows users to add devices to an existing flowtable.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# fdb9c405 24-Apr-2020 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: allow up to 64 bytes in the set element data area

So far, the set elements could store up to 128-bits in the data area.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# a26c1e49 31-Mar-2020 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: do not update stateful expressions if lookup is inverted

Initialize set lookup matching element to NULL. Otherwise, the
NFT_LOOKUP_F_INV flag reverses the matching logic and it leads to
deference an uninitialized pointer to the matching element. Make sure
element data area and stateful expression are accessed if there is a
matching set element.

This patch undoes 24791b9aa1ab ("netfilter: nft_set_bitmap: initialize set
element extension in lookups") which is not required anymore.

Fixes: 339706bc21c1 ("netfilter: nft_lookup: update element stateful expression")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# d56aab26 27-Mar-2020 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: skip set types that do not support for expressions

The bitmap set does not support for expressions, skip it from the
estimation step.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 65038428 17-Mar-2020 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: allow to specify stateful expression in set definition

This patch allows users to specify the stateful expression for the
elements in this set via NFTA_SET_EXPR. This new feature allows you to
turn on counters for all of the elements in this set.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c604cc691 17-Mar-2020 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: move nft_expr_clone() to nf_tables_api.c

Move the nft_expr_clone() helper function to the core.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 76adfafe 11-Mar-2020 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add nft_set_elem_update_expr() helper function

This helper function runs the eval path of the stateful expression
of an existing set element.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 795a6d6b 11-Mar-2020 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: statify nft_expr_init()

Not exposed anymore to modules, statify this function.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# a7fc9368 11-Mar-2020 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add nft_set_elem_expr_alloc()

Add helper function to create stateful expression.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 6daf1414 20-Feb-2020 Gustavo A. R. Silva <gustavo@embeddedor.com>

netfilter: Replace zero-length array with flexible-array member

The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
int stuff;
struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

Lastly, fix checkpatch.pl warning
WARNING: __aligned(size) is preferred over __attribute__((aligned(size)))
in net/bridge/netfilter/ebtables.c

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 24d19826 18-Feb-2020 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: make all set structs const

They do not need to be writeable anymore.

v2: remove left-over __read_mostly annotation in set_pipapo.c (Stefano)

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e32a4dc6 18-Feb-2020 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: make sets built-in

Placing nftables set support in an extra module is pointless:

1. nf_tables needs dynamic registeration interface for sake of one module
2. nft heavily relies on sets, e.g. even simple rule like
"nft ... tcp dport { 80, 443 }" will not work with _SETS=n.

IOW, either nftables isn't used or both nf_tables and nf_tables_set
modules are needed anyway.

With extra module:
307K net/netfilter/nf_tables.ko
79K net/netfilter/nf_tables_set.ko

text data bss dec filename
146416 3072 545 150033 nf_tables.ko
35496 1817 0 37313 nf_tables_set.ko

This patch:
373K net/netfilter/nf_tables.ko

178563 4049 545 183157 nf_tables.ko

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# f3a2181e 21-Jan-2020 Stefano Brivio <sbrivio@redhat.com>

netfilter: nf_tables: Support for sets with multiple ranged fields

Introduce a new nested netlink attribute, NFTA_SET_DESC_CONCAT, used
to specify the length of each field in a set concatenation.

This allows set implementations to support concatenation of multiple
ranged items, as they can divide the input key into matching data for
every single field. Such set implementations would be selected as
they specify support for NFT_SET_INTERVAL and allow desc->field_count
to be greater than one. Explicitly disallow this for nft_set_rbtree.

In order to specify the interval for a set entry, userspace would
include in NFTA_SET_DESC_CONCAT attributes field lengths, and pass
range endpoints as two separate keys, represented by attributes
NFTA_SET_ELEM_KEY and NFTA_SET_ELEM_KEY_END.

While at it, export the number of 32-bit registers available for
packet matching, as nftables will need this to know the maximum
number of field lengths that can be specified.

For example, "packets with an IPv4 address between 192.0.2.0 and
192.0.2.42, with destination port between 22 and 25", can be
expressed as two concatenated elements:

NFTA_SET_ELEM_KEY: 192.0.2.0 . 22
NFTA_SET_ELEM_KEY_END: 192.0.2.42 . 25

and NFTA_SET_DESC_CONCAT attribute would contain:

NFTA_LIST_ELEM
NFTA_SET_FIELD_LEN: 4
NFTA_LIST_ELEM
NFTA_SET_FIELD_LEN: 2

v4: No changes
v3: Complete rework, NFTA_SET_DESC_CONCAT instead of NFTA_SET_SUBKEY
v2: No changes

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 7b225d0b 21-Jan-2020 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add NFTA_SET_ELEM_KEY_END attribute

Add NFTA_SET_ELEM_KEY_END attribute to convey the closing element of the
interval between kernel and userspace.

This patch also adds the NFT_SET_EXT_KEY_END extension to store the
closing element value in this interval.

v4: No changes
v3: New patch

[sbrivio: refactor error paths and labels; add corresponding
nft_set_ext_type for new key; rebase]
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 7cd9a58d 19-Nov-2019 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: constify nft_reg_load{8, 16, 64}()

This patch constifies the pointer to source register data that is passed
as an input parameter.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 250367c5 31-Oct-2019 Lukas Wunner <lukas@wunner.de>

netfilter: nf_tables: Align nft_expr private data to 64-bit

Invoking the following commands on a 32-bit architecture with strict
alignment requirements (such as an ARMv7-based Raspberry Pi) results
in an alignment exception:

# nft add table ip test-ip4
# nft add chain ip test-ip4 output { type filter hook output priority 0; }
# nft add rule ip test-ip4 output quota 1025 bytes

Alignment trap: not handling instruction e1b26f9f at [<7f4473f8>]
Unhandled fault: alignment exception (0x001) at 0xb832e824
Internal error: : 1 [#1] PREEMPT SMP ARM
Hardware name: BCM2835
[<7f4473fc>] (nft_quota_do_init [nft_quota])
[<7f447448>] (nft_quota_init [nft_quota])
[<7f4260d0>] (nf_tables_newrule [nf_tables])
[<7f4168dc>] (nfnetlink_rcv_batch [nfnetlink])
[<7f416bd0>] (nfnetlink_rcv [nfnetlink])
[<8078b334>] (netlink_unicast)
[<8078b664>] (netlink_sendmsg)
[<8071b47c>] (sock_sendmsg)
[<8071bd18>] (___sys_sendmsg)
[<8071ce3c>] (__sys_sendmsg)
[<8071ce94>] (sys_sendmsg)

The reason is that nft_quota_do_init() calls atomic64_set() on an
atomic64_t which is only aligned to 32-bit, not 64-bit, because it
succeeds struct nft_expr in memory which only contains a 32-bit pointer.
Fix by aligning the nft_expr private data to 64-bit.

Fixes: 96518518cc41 ("netfilter: add nftables")
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: stable@vger.kernel.org # v3.13+
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# d54725cd 16-Oct-2019 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: support for multiple devices per netdev hook

This patch allows you to register one netdev basechain to multiple
devices. This adds a new NFTA_HOOK_DEVS netlink attribute to specify
the list of netdevices. Basechains store a list of hooks.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# cb662ac6 16-Oct-2019 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: increase maximum devices number per flowtable

Rise the maximum limit of devices per flowtable up to 256. Rename
NFT_FLOWTABLE_DEVICE_MAX to NFT_NETDEVICE_MAX in preparation to reuse
the netdev hook parser for ingress basechain.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 3f0465a9 16-Oct-2019 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: dynamically allocate hooks per net_device in flowtables

Use a list of hooks per device instead an array.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 71a8a63b 16-Oct-2019 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_flow_table: move priority to struct nf_flowtable

Hardware offload needs access to the priority field, store this field in
the nf_flowtable object.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 9b05b6e1 24-Sep-2019 Laura Garcia Liebana <nevola@gmail.com>

netfilter: nf_tables: bogus EBUSY when deleting flowtable after flush

The deletion of a flowtable after a flush in the same transaction
results in EBUSY. This patch adds an activation and deactivation of
flowtables in order to update the _use_ counter.

Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# ad652f38 16-Sep-2019 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add NFT_CHAIN_POLICY_UNSET and use it

Default policy is defined as a unsigned 8-bit field, do not use a
negative value to leave it unset, use this new NFT_CHAIN_POLICY_UNSET
instead.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# f19438bd 13-Sep-2019 Jeremy Sowden <jeremy@azazel.net>

netfilter: remove CONFIG_NETFILTER checks from headers.

`struct nf_hook_ops`, `struct nf_hook_state` and the `nf_hookfn`
function typedef appear in function and struct declarations and
definitions in a number of netfilter headers. The structs and typedef
themselves are defined by linux/netfilter.h but only when
CONFIG_NETFILTER is enabled. Define them unconditionally and add
forward declarations in order to remove CONFIG_NETFILTER conditionals
from the other headers.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# d62d0ba9 26-Aug-2019 Fernando Fernandez Mancera <ffmancera@riseup.net>

netfilter: nf_tables: Introduce stateful object update operation

This patch adds the infrastructure needed for the stateful object update
support.

Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# d0a8d877 17-Aug-2019 Ander Juaristi <a@juaristi.eus>

netfilter: nft_dynset: support for element deletion

This patch implements the delete operation from the ruleset.

It implements a new delete() function in nft_set_rhash. It is simpler
to use than the already existing remove(), because it only takes the set
and the key as arguments, whereas remove() expects a full
nft_set_elem structure.

Signed-off-by: Ander Juaristi <a@juaristi.eus>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# a1b840ad 17-Aug-2019 Ander Juaristi <a@juaristi.eus>

netfilter: nf_tables: Introduce new 64-bit helper register functions

Introduce new helper functions to load/store 64-bit values onto/from
registers:

- nft_reg_store64
- nft_reg_load64

This commit also re-orders all these helpers from smallest to largest
target bit size.

Signed-off-by: Ander Juaristi <a@juaristi.eus>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 78458e3e 07-Aug-2019 Jeremy Sowden <jeremy@azazel.net>

netfilter: add missing IS_ENABLED(CONFIG_NETFILTER) checks to some header-files.

linux/netfilter.h defines a number of struct and inline function
definitions which are only available is CONFIG_NETFILTER is enabled.
These structs and functions are used in declarations and definitions in
other header-files. Added preprocessor checks to make sure these
headers will compile if CONFIG_NETFILTER is disabled.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 47e640af 07-Aug-2019 Jeremy Sowden <jeremy@azazel.net>

netfilter: add missing IS_ENABLED(CONFIG_NF_TABLES) check to header-file.

nf_tables.h defines an API comprising several inline functions and
macros that depend on the nft member of struct net. However, this is
only defined is CONFIG_NF_TABLES is enabled. Added preprocessor checks
to ensure that nf_tables.h will compile if CONFIG_NF_TABLES is disabled.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 6a0a8d10 09-Aug-2019 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: use-after-free in failing rule with bound set

If a rule that has already a bound anonymous set fails to be added, the
preparation phase releases the rule and the bound set. However, the
transaction object from the abort path still has a reference to the set
object that is stale, leading to a use-after-free when checking for the
set->bound field. Add a new field to the transaction that specifies if
the set is bound, so the abort path can skip releasing it since the rule
command owns it and it takes care of releasing it. After this update,
the set->bound field is removed.

[ 24.649883] Unable to handle kernel paging request at virtual address 0000000000040434
[ 24.657858] Mem abort info:
[ 24.660686] ESR = 0x96000004
[ 24.663769] Exception class = DABT (current EL), IL = 32 bits
[ 24.669725] SET = 0, FnV = 0
[ 24.672804] EA = 0, S1PTW = 0
[ 24.675975] Data abort info:
[ 24.678880] ISV = 0, ISS = 0x00000004
[ 24.682743] CM = 0, WnR = 0
[ 24.685723] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000428952000
[ 24.692207] [0000000000040434] pgd=0000000000000000
[ 24.697119] Internal error: Oops: 96000004 [#1] SMP
[...]
[ 24.889414] Call trace:
[ 24.891870] __nf_tables_abort+0x3f0/0x7a0
[ 24.895984] nf_tables_abort+0x20/0x40
[ 24.899750] nfnetlink_rcv_batch+0x17c/0x588
[ 24.904037] nfnetlink_rcv+0x13c/0x190
[ 24.907803] netlink_unicast+0x18c/0x208
[ 24.911742] netlink_sendmsg+0x1b0/0x350
[ 24.915682] sock_sendmsg+0x4c/0x68
[ 24.919185] ___sys_sendmsg+0x288/0x2c8
[ 24.923037] __sys_sendmsg+0x7c/0xd0
[ 24.926628] __arm64_sys_sendmsg+0x2c/0x38
[ 24.930744] el0_svc_common.constprop.0+0x94/0x158
[ 24.935556] el0_svc_handler+0x34/0x90
[ 24.939322] el0_svc+0x8/0xc
[ 24.942216] Code: 37280300 f9404023 91014262 aa1703e0 (f9401863)
[ 24.948336] ---[ end trace cebbb9dcbed3b56f ]---

Fixes: f6ac85858976 ("netfilter: nf_tables: unbind set in rule from commit path")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 14bfb13f 19-Jul-2019 Pablo Neira Ayuso <pablo@netfilter.org>

net: flow_offload: add flow_block structure and use it

This object stores the flow block callbacks that are attached to this
block. Update flow_block_cb_lookup() to take this new object.

This patch restores the block sharing feature.

Fixes: da3eeb904ff4 ("net: flow_offload: add list handling functions")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c9626a2c 09-Jul-2019 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add hardware offload support

This patch adds hardware offload support for nftables through the
existing netdev_ops->ndo_setup_tc() interface, the TC_SETUP_CLSFLOWER
classifier and the flow rule API. This hardware offload support is
available for the NFPROTO_NETDEV family and the ingress hook.

Each nftables expression has a new ->offload interface, that is used to
populate the flow rule object that is attached to the transaction
object.

There is a new per-table NFT_TABLE_F_HW flag, that is set on to offload
an entire table, including all of its chains.

This patch supports for basic metadata (layer 3 and 4 protocol numbers),
5-tuple payload matching and the accept/drop actions; this also includes
basechain hardware offload only.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 79ebb5bb 18-Jun-2019 Laura Garcia Liebana <nevola@gmail.com>

netfilter: nf_tables: enable set expiration time for set elements

Currently, the expiration of every element in a set or map
is a read-only parameter generated at kernel side.

This change will permit to set a certain expiration date
per element that will be required, for example, during
stateful replication among several nodes.

This patch handles the NFTA_SET_ELEM_EXPIRATION in order
to configure the expiration parameter per element, or
will use the timeout in the case that the expiration
is not set.

Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# a4cb98f3 15-Apr-2019 Paul Gortmaker <paul.gortmaker@windriver.com>

netfilter: nf_tables: drop include of module.h from nf_tables.h

Ideally, header files under include/linux shouldn't be adding
includes of other headers, in anticipation of their consumers,
but just the headers needed for the header itself to pass
parsing with CPP.

The module.h is particularly bad in this sense, as it itself does
include a whole bunch of other headers, due to the complexity of
module support.

Since nf_tables.h is not going into a module struct looking for
specific fields, we can just let it know that module is a struct,
just like about 60 other include/linux headers already do.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# f1f86d09 15-Apr-2019 Paul Gortmaker <paul.gortmaker@windriver.com>

netfilter: nf_tables: relocate header content to consumer

The nf_tables.h header is used in a lot of files, but it turns out
that there is only one actual user of nft_expr_clone().

Hence we relocate that function to be with the one consumer of it
and avoid having to process it with CPP for all the other files.

This will also enable a reduction in the other headers that the
nf_tables.h itself has to include just to be stand-alone, hence
a pending further significant reduction in the CPP content that
needs to get processed for each netfilter file.

Note that the explicit "inline" has been dropped as part of this
relocation. In similar changes to this, I believe Dave has asked
this be done, so we free up gcc to make the choice of whether to
inline or not.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 3b0a081d 04-Apr-2019 Florian Westphal <fw@strlen.de>

netfilter: make two functions static

They have no external callers anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c1deb065 27-Mar-2019 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: merge route type into core

very little code, so it really doesn't make sense to have extra
modules or even a kconfig knob for this.

Merge them and make functionality available unconditionally.
The merge makes inet family route support trivial, so add it
as well here.

Before:
text data bss dec hex filename
835 832 0 1667 683 nft_chain_route_ipv4.ko
870 832 0 1702 6a6 nft_chain_route_ipv6.ko
111568 2556 529 114653 1bfdd nf_tables.ko

After:
text data bss dec hex filename
113133 2556 529 116218 1c5fa nf_tables.ko

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 273fe3f1 08-Mar-2019 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: bogus EBUSY when deleting set after flush

Set deletion after flush coming in the same batch results in EBUSY. Add
set use counter to track the number of references to this set from
rules. We cannot rely on the list of bindings for this since such list
is still populated from the preparation phase.

Reported-by: Václav Zindulka <vaclav.zindulka@tlapnet.cz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 40ba1d9b 07-Mar-2019 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: fix set double-free in abort path

The abort path can cause a double-free of an anonymous set.
Added-and-to-be-aborted rule looks like this:

udp dport { 137, 138 } drop

The to-be-aborted transaction list looks like this:

newset
newsetelem
newsetelem
rule

This gets walked in reverse order, so first pass disables the rule, the
set elements, then the set.

After synchronize_rcu(), we then destroy those in same order: rule, set
element, set element, newset.

Problem is that the anonymous set has already been bound to the rule, so
the rule (lookup expression destructor) already frees the set, when then
cause use-after-free when trying to delete the elements from this set,
then try to free the set again when handling the newset expression.

Rule releases the bound set in first place from the abort path, this
causes the use-after-free on set element removal when undoing the new
element transactions. To handle this, skip new element transaction if
set is bound from the abort path.

This is still causes the use-after-free on set element removal. To
handle this, remove transaction from the list when the set is already
bound.

Joint work with Florian Westphal.

Fixes: f6ac85858976 ("netfilter: nf_tables: unbind set in rule from commit path")
Bugzilla: https://bugzilla.netfilter.org/show_bug.cgi?id=1325
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b8e20400 13-Feb-2019 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_compat: use .release_ops and remove list of extension

Add .release_ops, that is called in case of error at a later stage in
the expression initialization path, ie. .select_ops() has been already
set up operations and that needs to be undone. This allows us to unwind
.select_ops from the error path, ie. release the dynamic operations for
this extension.

Moreover, allocate one single operation instead of recycling them, this
comes at the cost of consuming a bit more memory per rule, but it
simplifies the infrastructure.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# f6ac8585 02-Feb-2019 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: unbind set in rule from commit path

Anonymous sets that are bound to rules from the same transaction trigger
a kernel splat from the abort path due to double set list removal and
double free.

This patch updates the logic to search for the transaction that is
responsible for creating the set and disable the set list removal and
release, given the rule is now responsible for this. Lookup is reverse
since the transaction that adds the set is likely to be at the tail of
the list.

Moreover, this patch adds the unbind step to deliver the event from the
commit path. This should not be done from the worker thread, since we
have no guarantees of in-order delivery to the listener.

This patch removes the assumption that both activate and deactivate
callbacks need to be provided.

Fixes: cd5125d8f518 ("netfilter: nf_tables: split set destruction in deactivate and destroy phase")
Reported-by: Mikhail Morfikov <mmorfikov@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 4d44175a 08-Jan-2019 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: handle nft_object lookups via rhltable

Instead of linear search, use rhlist interface to look up the objects.
This fixes rulesets with thousands of named objects (quota, counters and
the like).

We only use a single table for this and consider the address of the
table we're doing the lookup in as a part of the key.

This reduces restore time of a sample ruleset with ~20k named counters
from 37 seconds to 0.8 seconds.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# d152159b 08-Jan-2019 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: prepare nft_object for lookups via hashtable

Add a 'key' structure for object, so we can look them up by name + table
combination (the name can be the same in each table).

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 0935d558 29-Aug-2018 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: asynchronous release

Release the committed transaction log from a work queue, moving
expensive synchronize_rcu out of the locked section and providing
opportunity to batch this.

On my test machine this cuts runtime of nft-test.py in half.
Based on earlier patch from Pablo Neira Ayuso.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# cd5125d8 29-Aug-2018 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: split set destruction in deactivate and destroy phase

Splits unbind_set into destroy_set and unbinding operation.

Unbinding removes set from lists (so new transaction would not
find it anymore) but keeps memory allocated (so packet path continues
to work).

Rebind function is added to allow unrolling in case transaction
that wants to remove set is aborted.

Destroy function is added to free the memory, but this could occur
outside of transaction in the future.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# d209df3e 02-Aug-2018 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: fix register ordering

We must register nfnetlink ops last, as that exposes nf_tables to
userspace. Without this, we could theoretically get nfnetlink request
before net->nft state has been initialized.

Fixes: 99633ab29b213 ("netfilter: nf_tables: complete net namespace support")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 4ef360dd 25-Jul-2018 Taehee Yoo <ap420073@gmail.com>

netfilter: nft_set: fix allocation size overflow in privsize callback.

In order to determine allocation size of set, ->privsize is invoked.
At this point, both desc->size and size of each data structure of set
are used. desc->size means number of element that is given by user.
desc->size is u32 type. so that upperlimit of set element is 4294967295.
but return type of ->privsize is also u32. hence overflow can occurred.

test commands:
%nft add table ip filter
%nft add set ip filter hash1 { type ipv4_addr \; size 4294967295 \; }
%nft list ruleset

splat looks like:
[ 1239.202910] kasan: CONFIG_KASAN_INLINE enabled
[ 1239.208788] kasan: GPF could be caused by NULL-ptr deref or user memory access
[ 1239.217625] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[ 1239.219329] CPU: 0 PID: 1603 Comm: nft Not tainted 4.18.0-rc5+ #7
[ 1239.229091] RIP: 0010:nft_hash_walk+0x1d2/0x310 [nf_tables_set]
[ 1239.229091] Code: 84 d2 7f 10 4c 89 e7 89 44 24 38 e8 d8 5a 17 e0 8b 44 24 38 48 8d 7b 10 41 0f b6 0c 24 48 89 fa 48 89 fe 48 c1 ea 03 83 e6 07 <42> 0f b6 14 3a 40 38 f2 7f 1a 84 d2 74 16
[ 1239.229091] RSP: 0018:ffff8801118cf358 EFLAGS: 00010246
[ 1239.229091] RAX: 0000000000000000 RBX: 0000000000020400 RCX: 0000000000000001
[ 1239.229091] RDX: 0000000000004082 RSI: 0000000000000000 RDI: 0000000000020410
[ 1239.229091] RBP: ffff880114d5a988 R08: 0000000000007e94 R09: ffff880114dd8030
[ 1239.229091] R10: ffff880114d5a988 R11: ffffed00229bb006 R12: ffff8801118cf4d0
[ 1239.229091] R13: ffff8801118cf4d8 R14: 0000000000000000 R15: dffffc0000000000
[ 1239.229091] FS: 00007f5a8fe0b700(0000) GS:ffff88011b600000(0000) knlGS:0000000000000000
[ 1239.229091] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1239.229091] CR2: 00007f5a8ecc27b0 CR3: 000000010608e000 CR4: 00000000001006f0
[ 1239.229091] Call Trace:
[ 1239.229091] ? nft_hash_remove+0xf0/0xf0 [nf_tables_set]
[ 1239.229091] ? memset+0x1f/0x40
[ 1239.229091] ? __nla_reserve+0x9f/0xb0
[ 1239.229091] ? memcpy+0x34/0x50
[ 1239.229091] nf_tables_dump_set+0x9a1/0xda0 [nf_tables]
[ 1239.229091] ? __kmalloc_reserve.isra.29+0x2e/0xa0
[ 1239.229091] ? nft_chain_hash_obj+0x630/0x630 [nf_tables]
[ 1239.229091] ? nf_tables_commit+0x2c60/0x2c60 [nf_tables]
[ 1239.229091] netlink_dump+0x470/0xa20
[ 1239.229091] __netlink_dump_start+0x5ae/0x690
[ 1239.229091] nft_netlink_dump_start_rcu+0xd1/0x160 [nf_tables]
[ 1239.229091] nf_tables_getsetelem+0x2e5/0x4b0 [nf_tables]
[ 1239.229091] ? nft_get_set_elem+0x440/0x440 [nf_tables]
[ 1239.229091] ? nft_chain_hash_obj+0x630/0x630 [nf_tables]
[ 1239.229091] ? nf_tables_dump_obj_done+0x70/0x70 [nf_tables]
[ 1239.229091] ? nla_parse+0xab/0x230
[ 1239.229091] ? nft_get_set_elem+0x440/0x440 [nf_tables]
[ 1239.229091] nfnetlink_rcv_msg+0x7f0/0xab0 [nfnetlink]
[ 1239.229091] ? nfnetlink_bind+0x1d0/0x1d0 [nfnetlink]
[ 1239.229091] ? debug_show_all_locks+0x290/0x290
[ 1239.229091] ? sched_clock_cpu+0x132/0x170
[ 1239.229091] ? find_held_lock+0x39/0x1b0
[ 1239.229091] ? sched_clock_local+0x10d/0x130
[ 1239.229091] netlink_rcv_skb+0x211/0x320
[ 1239.229091] ? nfnetlink_bind+0x1d0/0x1d0 [nfnetlink]
[ 1239.229091] ? netlink_ack+0x7b0/0x7b0
[ 1239.229091] ? ns_capable_common+0x6e/0x110
[ 1239.229091] nfnetlink_rcv+0x2d1/0x310 [nfnetlink]
[ 1239.229091] ? nfnetlink_rcv_batch+0x10f0/0x10f0 [nfnetlink]
[ 1239.229091] ? netlink_deliver_tap+0x829/0x930
[ 1239.229091] ? lock_acquire+0x265/0x2e0
[ 1239.229091] netlink_unicast+0x406/0x520
[ 1239.509725] ? netlink_attachskb+0x5b0/0x5b0
[ 1239.509725] ? find_held_lock+0x39/0x1b0
[ 1239.509725] netlink_sendmsg+0x987/0xa20
[ 1239.509725] ? netlink_unicast+0x520/0x520
[ 1239.509725] ? _copy_from_user+0xa9/0xc0
[ 1239.509725] __sys_sendto+0x21a/0x2c0
[ 1239.509725] ? __ia32_sys_getpeername+0xa0/0xa0
[ 1239.509725] ? retint_kernel+0x10/0x10
[ 1239.509725] ? sched_clock_cpu+0x132/0x170
[ 1239.509725] ? find_held_lock+0x39/0x1b0
[ 1239.509725] ? lock_downgrade+0x540/0x540
[ 1239.509725] ? up_read+0x1c/0x100
[ 1239.509725] ? __do_page_fault+0x763/0x970
[ 1239.509725] ? retint_user+0x18/0x18
[ 1239.509725] __x64_sys_sendto+0x177/0x180
[ 1239.509725] do_syscall_64+0xaa/0x360
[ 1239.509725] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1239.509725] RIP: 0033:0x7f5a8f468e03
[ 1239.509725] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb d0 0f 1f 84 00 00 00 00 00 83 3d 49 c9 2b 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8
[ 1239.509725] RSP: 002b:00007ffd78d0b778 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 1239.509725] RAX: ffffffffffffffda RBX: 00007ffd78d0c890 RCX: 00007f5a8f468e03
[ 1239.509725] RDX: 0000000000000034 RSI: 00007ffd78d0b7e0 RDI: 0000000000000003
[ 1239.509725] RBP: 00007ffd78d0b7d0 R08: 00007f5a8f15c160 R09: 000000000000000c
[ 1239.509725] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd78d0b7e0
[ 1239.509725] R13: 0000000000000034 R14: 00007f5a8f9aff60 R15: 00005648040094b0
[ 1239.509725] Modules linked in: nf_tables_set nf_tables nfnetlink ip_tables x_tables
[ 1239.670713] ---[ end trace 39375adcda140f11 ]---
[ 1239.676016] RIP: 0010:nft_hash_walk+0x1d2/0x310 [nf_tables_set]
[ 1239.682834] Code: 84 d2 7f 10 4c 89 e7 89 44 24 38 e8 d8 5a 17 e0 8b 44 24 38 48 8d 7b 10 41 0f b6 0c 24 48 89 fa 48 89 fe 48 c1 ea 03 83 e6 07 <42> 0f b6 14 3a 40 38 f2 7f 1a 84 d2 74 16
[ 1239.705108] RSP: 0018:ffff8801118cf358 EFLAGS: 00010246
[ 1239.711115] RAX: 0000000000000000 RBX: 0000000000020400 RCX: 0000000000000001
[ 1239.719269] RDX: 0000000000004082 RSI: 0000000000000000 RDI: 0000000000020410
[ 1239.727401] RBP: ffff880114d5a988 R08: 0000000000007e94 R09: ffff880114dd8030
[ 1239.735530] R10: ffff880114d5a988 R11: ffffed00229bb006 R12: ffff8801118cf4d0
[ 1239.743658] R13: ffff8801118cf4d8 R14: 0000000000000000 R15: dffffc0000000000
[ 1239.751785] FS: 00007f5a8fe0b700(0000) GS:ffff88011b600000(0000) knlGS:0000000000000000
[ 1239.760993] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1239.767560] CR2: 00007f5a8ecc27b0 CR3: 000000010608e000 CR4: 00000000001006f0
[ 1239.775679] Kernel panic - not syncing: Fatal exception
[ 1239.776630] Kernel Offset: 0x1f000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 1239.776630] Rebooting in 5 seconds..

Fixes: 20a69341f2d0 ("netfilter: nf_tables: add netlink set API")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b8088dda 16-Jul-2018 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: use dev->name directly

no need to store the name in separate area.

Furthermore, it uses kmalloc but not kfree and most accesses seem to treat
it as char[IFNAMSIZ] not char *.

Remove this and use dev->name instead.

In case event zeroed dev, just omit the name in the dump.

Fixes: d92191aa84e5f1 ("netfilter: nf_tables: cache device name in flowtable object")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 26b2f552 12-Jul-2018 Taehee Yoo <ap420073@gmail.com>

netfilter: nf_tables: fix jumpstack depth validation

The level of struct nft_ctx is updated by nf_tables_check_loops(). That
is used to validate jumpstack depth. But jumpstack validation routine
doesn't update and validate recursively. So, in some cases, chain depth
can be bigger than the NFT_JUMP_STACK_SIZE.

After this patch, The jumpstack validation routine is located in the
nft_chain_validate(). When new rules or new set elements are added, the
nft_table_validate() is called by the nf_tables_newrule and the
nf_tables_newsetelem. The nft_table_validate() calls the
nft_chain_validate() that visit all their children chains recursively.
So it can update depth of chain certainly.

Reproducer:
%cat ./test.sh
#!/bin/bash
nft add table ip filter
nft add chain ip filter input { type filter hook input priority 0\; }
for ((i=0;i<20;i++)); do
nft add chain ip filter a$i
done

nft add rule ip filter input jump a1

for ((i=0;i<10;i++)); do
nft add rule ip filter a$i jump a$((i+1))
done

for ((i=11;i<19;i++)); do
nft add rule ip filter a$i jump a$((i+1))
done

nft add rule ip filter a10 jump a11

Result:
[ 253.931782] WARNING: CPU: 1 PID: 0 at net/netfilter/nf_tables_core.c:186 nft_do_chain+0xacc/0xdf0 [nf_tables]
[ 253.931915] Modules linked in: nf_tables nfnetlink ip_tables x_tables
[ 253.932153] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.18.0-rc3+ #48
[ 253.932153] RIP: 0010:nft_do_chain+0xacc/0xdf0 [nf_tables]
[ 253.932153] Code: 83 f8 fb 0f 84 c7 00 00 00 e9 d0 00 00 00 83 f8 fd 74 0e 83 f8 ff 0f 84 b4 00 00 00 e9 bd 00 00 00 83 bd 64 fd ff ff 0f 76 09 <0f> 0b 31 c0 e9 bc 02 00 00 44 8b ad 64 fd
[ 253.933807] RSP: 0018:ffff88011b807570 EFLAGS: 00010212
[ 253.933807] RAX: 00000000fffffffd RBX: ffff88011b807660 RCX: 0000000000000000
[ 253.933807] RDX: 0000000000000010 RSI: ffff880112b39d78 RDI: ffff88011b807670
[ 253.933807] RBP: ffff88011b807850 R08: ffffed0023700ece R09: ffffed0023700ecd
[ 253.933807] R10: ffff88011b80766f R11: ffffed0023700ece R12: ffff88011b807898
[ 253.933807] R13: ffff880112b39d80 R14: ffff880112b39d60 R15: dffffc0000000000
[ 253.933807] FS: 0000000000000000(0000) GS:ffff88011b800000(0000) knlGS:0000000000000000
[ 253.933807] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 253.933807] CR2: 00000000014f1008 CR3: 000000006b216000 CR4: 00000000001006e0
[ 253.933807] Call Trace:
[ 253.933807] <IRQ>
[ 253.933807] ? sched_clock_cpu+0x132/0x170
[ 253.933807] ? __nft_trace_packet+0x180/0x180 [nf_tables]
[ 253.933807] ? sched_clock_cpu+0x132/0x170
[ 253.933807] ? debug_show_all_locks+0x290/0x290
[ 253.933807] ? __lock_acquire+0x4835/0x4af0
[ 253.933807] ? inet_ehash_locks_alloc+0x1a0/0x1a0
[ 253.933807] ? unwind_next_frame+0x159e/0x1840
[ 253.933807] ? __read_once_size_nocheck.constprop.4+0x5/0x10
[ 253.933807] ? nft_do_chain_ipv4+0x197/0x1e0 [nf_tables]
[ 253.933807] ? nft_do_chain+0x5/0xdf0 [nf_tables]
[ 253.933807] nft_do_chain_ipv4+0x197/0x1e0 [nf_tables]
[ 253.933807] ? nft_do_chain_arp+0xb0/0xb0 [nf_tables]
[ 253.933807] ? __lock_is_held+0x9d/0x130
[ 253.933807] nf_hook_slow+0xc4/0x150
[ 253.933807] ip_local_deliver+0x28b/0x380
[ 253.933807] ? ip_call_ra_chain+0x3e0/0x3e0
[ 253.933807] ? ip_rcv_finish+0x1610/0x1610
[ 253.933807] ip_rcv+0xbcc/0xcc0
[ 253.933807] ? debug_show_all_locks+0x290/0x290
[ 253.933807] ? ip_local_deliver+0x380/0x380
[ 253.933807] ? __lock_is_held+0x9d/0x130
[ 253.933807] ? ip_local_deliver+0x380/0x380
[ 253.933807] __netif_receive_skb_core+0x1c9c/0x2240

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 1b2470e5 02-Jun-2018 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: handle chain name lookups via rhltable

If there is a significant amount of chains list search is too slow, so
add an rhlist table for this.

This speeds up ruleset loading: for every new rule we have to check if
the name already exists in current generation.

We need to be able to cope with duplicate chain names in case a transaction
drops the nfnl mutex (for request_module) and the abort of this old
transaction is still pending.

The list is kept -- we need a way to iterate chains even if hash resize is
in progress without missing an entry.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 371ebcbb 02-Jun-2018 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add destroy_clone expression

Before this patch, cloned expressions are released via ->destroy. This
is a problem for the new connlimit expression since the ->destroy path
drop a reference on the conntrack modules and it unregisters hooks. The
new ->destroy_clone provides context that this expression is being
released from the packet path, so it is mirroring ->clone(), where
neither module reference is dropped nor hooks need to be unregistered -
because this done from the control plane path from the ->init() path.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 79b174ad 02-Jun-2018 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: garbage collection for stateful expressions

Use garbage collector to schedule removal of elements based of feedback
from expression that this element comes with. Therefore, the garbage
collector is not guided by timeout expirations in this new mode.

The new connlimit expression sets on the NFT_EXPR_GC flag to enable this
behaviour, the dynset expression needs to explicitly enable the garbage
collector via set->ops->gc_init call.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 3453c927 02-Jun-2018 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: pass ctx to nf_tables_expr_destroy()

nft_set_elem_destroy() can be called from call_rcu context. Annotate
netns and table in set object so we can populate the context object.
Moreover, pass context object to nf_tables_set_elem_destroy() from the
commit phase, since it is already available from there.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 00bfb320 02-Jun-2018 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: pass context to object destroy indirection

The new connlimit object needs this to properly deal with conntrack
dependencies.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# a654de8f 30-May-2018 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: fix chain dependency validation

The following ruleset:

add table ip filter
add chain ip filter input { type filter hook input priority 4; }
add chain ip filter ap
add rule ip filter input jump ap
add rule ip filter ap masquerade

results in a panic, because the masquerade extension should be rejected
from the filter chain. The existing validation is missing a chain
dependency check when the rule is added to the non-base chain.

This patch fixes the problem by walking down the rules from the
basechains, searching for either immediate or lookup expressions, then
jumping to non-base chains and again walking down the rules to perform
the expression validation, so we make sure the full ruleset graph is
validated. This is done only once from the commit phase, in case of
problem, we abort the transaction and perform fine grain validation for
error reporting. This patch requires 003087911af2 ("netfilter:
nfnetlink: allow commit to fail") to achieve this behaviour.

This patch also adds a cleanup callback to nfnl batch interface to reset
the validate state from the exit path.

As a result of this patch, nf_tables_check_loops() doesn't use
->validate to check for loops, instead it just checks for immediate
expressions.

Reported-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 0cbc06b3 24-May-2018 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: remove synchronize_rcu in commit phase

synchronize_rcu() is expensive.

The commit phase currently enforces an unconditional
synchronize_rcu() after incrementing the generation counter.

This is to make sure that a packet always sees a consistent chain, either
nft_do_chain is still using old generation (it will skip the newly added
rules), or the new one (it will skip old ones that might still be linked
into the list).

We could just remove the synchronize_rcu(), it would not cause a crash but
it could cause us to evaluate a rule that was removed and new rule for the
same packet, instead of either-or.

To resolve this, add rule pointer array holding two generations, the
current one and the future generation.

In commit phase, allocate the rule blob and populate it with the rules that
will be active in the new generation.

Then, make this rule blob public, replacing the old generation pointer.

Then the generation counter can be incremented.

nft_do_chain() will either continue to use the current generation
(in case loop was invoked right before increment), or the new one.

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 4e25ceb8 14-May-2018 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: allow chain type to override hook register

Will be used in followup patch when nat types no longer
use nf_register_net_hook() but will instead register with the nat core.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# bb7b40ae 07-May-2018 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: bogus EBUSY in chain deletions

When removing a rule that jumps to chain and such chain in the same
batch, this bogusly hits EBUSY. Add activate and deactivate operations
to expression that can be called from the preparation and the
commit/abort phases.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 8e1102d5 16-Apr-2018 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: support timeouts larger than 23 days

Marco De Benedetto says:
I would like to use a timeout of 30 days for elements in a set but it
seems there is a some kind of problem above 24d20h31m23s.

Fix this by using 'jiffies64' for timeout handling to get same behaviour
on 32 and 64bit systems.

nftables passes timeouts as u64 in milliseconds to the kernel,
but on kernel side we used a mixture of 'long' and jiffies conversions
rather than u64 and jiffies64.

Bugzilla: https://bugzilla.netfilter.org/show_bug.cgi?id=1237
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 71cc0873 03-Apr-2018 Phil Sutter <phil@nwl.cc>

netfilter: nf_tables: Simplify set backend selection

Drop nft_set_type's ability to act as a container of multiple backend
implementations it chooses from. Instead consolidate the whole selection
logic in nft_select_set_ops() and the actual backend provided estimate()
callback.

This turns nf_tables_set_types into a list containing all available
backends which is traversed when selecting one matching userspace
requested criteria.

Also, this change allows to embed nft_set_ops structure into
nft_set_type and pull flags field into the latter as it's only used
during selection phase.

A crucial part of this change is to make sure the new layout respects
hash backend constraints formerly enforced by nft_hash_select_ops()
function: This is achieved by introduction of a specific estimate()
callback for nft_hash_fast_ops which returns false for key lengths != 4.
In turn, nft_hash_estimate() is changed to return false for key lengths
== 4 so it won't be chosen by accident. Also, both callbacks must return
false for unbounded sets as their size estimate depends on a known
maximum element count.

Note that this patch partially reverts commit 4f2921ca21b71 ("netfilter:
nf_tables: meter: pick a set backend that supports updates") by making
nft_set_ops_candidate() not explicitly look for an update callback but
make NFT_SET_EVAL a regular backend feature flag which is checked along
with the others. This way all feature requirements are checked in one
go.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# cac20fcd 27-Mar-2018 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: simplify lookup functions

Replace the nf_tables_ prefix by nft_ and merge code into single lookup
function whenever possible. In many cases we go over the 80-chars
boundary function names, this save us ~50 LoC.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 84453a90 26-Feb-2018 Felix Fietkau <nbd@nbd.name>

netfilter: nf_flow_table: track flow tables in nf_flow_table directly

Avoids having nf_flow_table depend on nftables (useful for future
iptables backport work)

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 10659cba 27-Mar-2018 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: rename to nft_set_lookup_global()

To prepare shorter introduction of shorter function prefix.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 43a605f2 27-Mar-2018 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: enable conntrack if NAT chain is registered

Register conntrack hooks if the user adds NAT chains. Users get confused
with the existing behaviour since they will see no packets hitting this
chain until they add the first rule that refers to conntrack.

This patch adds new ->init() and ->free() indirections to chain types
that can be used by NAT chains to invoke the conntrack dependency.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 02c7b25e 27-Mar-2018 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: build-in filter chain type

One module per supported filter chain family type takes too much memory
for very little code - too much modularization - place all chain filter
definitions in one single file.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# cc07eeb0 27-Mar-2018 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: nft_register_chain_type() returns void

Use WARN_ON() instead since it should not happen that neither family
goes over NFPROTO_NUMPROTO nor there is already a chain of this type
already registered.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 32537e91 27-Mar-2018 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: rename struct nf_chain_type

Use nft_ prefix. By when I added chain types, I forgot to use the
nftables prefix. Rename enum nft_chain_type to enum nft_chain_types too,
otherwise there is an overlap.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# d92191aa 21-Mar-2018 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: cache device name in flowtable object

Devices going away have to grab the nfnl_lock from the netdev event path
to avoid races with control plane updates.

However, netlink dumps in netfilter do not hold nfnl_lock mutex. Cache
the device name into the objects to avoid an use-after-free situation
for a device that is going away.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 3ecbfd65 26-Dec-2017 Harsha Sharma <harshasharmaiitr@gmail.com>

netfilter: nf_tables: allocate handle and delete objects via handle

This patch allows deletion of objects via unique handle which can be
listed via '-a' option.

Signed-off-by: Harsha Sharma <harshasharmaiitr@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 98319cb9 08-Jan-2018 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: get rid of struct nft_af_info abstraction

Remove the infrastructure to register/unregister nft_af_info structure,
this structure stores no useful information anymore.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# dd4cbef7 08-Jan-2018 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: get rid of pernet families

Now that we have a single table list for each netns, we can get rid of
one pointer per family and the global afinfo list, thus, shrinking
struct netns for nftables that now becomes 64 bytes smaller.

And call __nft_release_afinfo() from __net_exit path accordingly to
release netnamespace objects on removal.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 36596dad 08-Jan-2018 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add single table list for all families

Place all existing user defined tables in struct net *, instead of
having one list per family. This saves us from one level of indentation
in netlink dump functions.

Place pointer to struct nft_af_info in struct nft_table temporarily, as
we still need this to put back reference module reference counter on
table removal.

This patch comes in preparation for the removal of struct nft_af_info.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e7bb5c71 19-Dec-2017 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: remove flag field from struct nft_af_info

Replace it by a direct check for the netdev protocol family.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# fe19c04c 19-Dec-2017 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: remove nhooks field from struct nft_af_info

We already validate the hook through bitmask, so this check is
superfluous. When removing this, this patch is also fixing a bug in the
new flowtable codebase, since ctx->afi points to the table family
instead of the netdev family which is where the flowtable is really
hooked in.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 3b49e2e9 06-Jan-2018 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add flow table netlink frontend

This patch introduces a netlink control plane to create, delete and dump
flow tables. Flow tables are identified by name, this name is used from
rules to refer to an specific flow table. Flow tables use the rhashtable
class and a generic garbage collector to remove expired entries.

This also adds the infrastructure to add different flow table types, so
we can add one for each layer 3 protocol family.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 0befd061 01-Jan-2018 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: remove nft_dereference()

This macro is unnecessary, it just hides details for one single caller.
nfnl_dereference() is just enough.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c2f9eafe 09-Dec-2017 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: remove hooks from family definition

They don't belong to the family definition, move them to the filter
chain type definition instead.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c974a3a3 09-Dec-2017 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: remove multihook chains and families

Since NFPROTO_INET is handled from the core, we don't need to maintain
extra infrastructure in nf_tables to handle the double hook
registration, one for IPv4 and another for IPv6.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 408070d6 24-Nov-2017 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add nft_set_is_anonymous() helper

Add helper function to test for the NFT_SET_ANONYMOUS flag.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 7a4473a3 09-Dec-2017 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: explicit nft_set_pktinfo() call from hook path

Instead of calling this function from the family specific variant, this
reduces the code size in the fast path for the netdev, bridge and inet
families. After this change, we must call nft_set_pktinfo() upfront from
the chain hook indirection.

Before:

text data bss dec hex filename
2145 208 0 2353 931 net/netfilter/nf_tables_netdev.o

After:

text data bss dec hex filename
2125 208 0 2333 91d net/netfilter/nf_tables_netdev.o

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# ba0e4d99 09-Oct-2017 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: get set elements via netlink

This patch adds a new get operation to look up for specific elements in
a set via netlink interface. You can also use it to check if an interval
already exists.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b2441318 01-Nov-2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 14cd5d4a 23-Oct-2017 Mark Rutland <mark.rutland@arm.com>

locking/atomics, net/netlink/netfilter: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE()

For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't currently harmful.

However, for some features it is necessary to instrument reads and
writes separately, which is not possible with ACCESS_ONCE(). This
distinction is critical to correct operation.

It's possible to transform the bulk of kernel code using the Coccinelle
script below. However, this doesn't handle comments, leaving references
to ACCESS_ONCE() instances which have been removed. As a preparatory
step, this patch converts netlink and netfilter code and comments to use
{READ,WRITE}_ONCE() consistently.

----
virtual patch

@ depends on patch @
expression E1, E2;
@@

- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)

@ depends on patch @
expression E;
@@

- ACCESS_ONCE(E)
+ READ_ONCE(E)
----

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Florian Westphal <fw@strlen.de>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-7-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# dfc46034 23-Aug-2017 Pablo M. Bermudo Garay <pablombg@gmail.com>

netfilter: nf_tables: add select_ops for stateful objects

This patch adds support for overloading stateful objects operations
through the select_ops() callback, just as it is implemented for
expressions.

This change is needed for upcoming additions to the stateful objects
infrastructure.

Signed-off-by: Pablo M. Bermudo Garay <pablombg@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 61509575 27-Jul-2017 Phil Sutter <phil@nwl.cc>

netfilter: nf_tables: Allow object names of up to 255 chars

Same conversion as for table names, use NFT_NAME_MAXLEN as upper
boundary as well.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 38745490 27-Jul-2017 Phil Sutter <phil@nwl.cc>

netfilter: nf_tables: Allow set names of up to 255 chars

Same conversion as for table names, use NFT_NAME_MAXLEN as upper
boundary as well.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b7263e07 27-Jul-2017 Phil Sutter <phil@nwl.cc>

netfilter: nf_tables: Allow chain name of up to 255 chars

Same conversion as for table names, use NFT_NAME_MAXLEN as upper
boundary as well.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e46abbcc 27-Jul-2017 Phil Sutter <phil@nwl.cc>

netfilter: nf_tables: Allow table names of up to 255 chars

Allocate all table names dynamically to allow for arbitrary lengths but
introduce NFT_NAME_MAXLEN as an upper sanity boundary. It's value was
chosen to allow using a domain name as per RFC 1035.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 347b408d 22-May-2017 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: pass set description to ->privsize

The new non-resizable hashtable variant needs this to calculate the
size of the bucket array.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 2b664957 22-May-2017 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: select set backend flavour depending on description

This patch adds the infrastructure to support several implementations of
the same set type. This selection will be based on the set description
and the features available for this set. This allow us to select set
backend implementation that will result in better performance numbers.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 59105446 15-May-2017 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: revisit chain/object refcounting from elements

Andreas reports that the following incremental update using our commit
protocol doesn't work.

# nft -f incremental-update.nft
delete element ip filter client_to_any { 10.180.86.22 : goto CIn_1 }
delete chain ip filter CIn_1
... Error: Could not process rule: Device or resource busy

The existing code is not well-integrated into the commit phase protocol,
since element deletions do not result in refcount decrement from the
preparation phase. This results in bogus EBUSY errors like the one
above.

Two new functions come with this patch:

* nft_set_elem_activate() function is used from the abort path, to
restore the set element refcounting on objects that occurred from
the preparation phase.

* nft_set_elem_deactivate() that is called from nft_del_setelem() to
decrement set element refcounting on objects from the preparation
phase in the commit protocol.

The nft_data_uninit() has been renamed to nft_data_release() since this
function does not uninitialize any data store in the data register,
instead just releases the references to objects. Moreover, a new
function nft_data_hold() has been introduced to be used from
nft_set_elem_activate().

Reported-by: Andreas Schultz <aschultz@tpip.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# f323d954 20-Mar-2017 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add nft_is_base_chain() helper

This new helper function allows us to check if this is a basechain.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 04166f48 13-Mar-2017 Pablo Neira Ayuso <pablo@netfilter.org>

Revert "netfilter: nf_tables: add flush field to struct nft_set_iter"

This reverts commit 1f48ff6c5393aa7fe290faf5d633164f105b0aa7.

This patch is not required anymore now that we keep a dummy list of
set elements in the bitmap set implementation, so revert this before
we forget this code has no clients.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 84fba055 08-Mar-2017 Florian Westphal <fw@strlen.de>

netfilter: provide nft_ctx in object init function

this is needed by the upcoming ct helper object type --
we'd like to be able use the table family (ip, ip6, inet) to figure
out which helper has to be requested.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 10596608 08-Mar-2017 Liping Zhang <zlpnobody@gmail.com>

netfilter: nf_tables: fix mismatch in big-endian system

Currently, there are two different methods to store an u16 integer to
the u32 data register. For example:
u32 *dest = &regs->data[priv->dreg];
1. *dest = 0; *(u16 *) dest = val_u16;
2. *dest = val_u16;

For method 1, the u16 value will be stored like this, either in
big-endian or little-endian system:
0 15 31
+-+-+-+-+-+-+-+-+-+-+-+-+
| Value | 0 |
+-+-+-+-+-+-+-+-+-+-+-+-+

For method 2, in little-endian system, the u16 value will be the same
as listed above. But in big-endian system, the u16 value will be stored
like this:
0 15 31
+-+-+-+-+-+-+-+-+-+-+-+-+
| 0 | Value |
+-+-+-+-+-+-+-+-+-+-+-+-+

So later we use "memcmp(&regs->data[priv->sreg], data, 2);" to do
compare in nft_cmp, nft_lookup expr ..., method 2 will get the wrong
result in big-endian system, as 0~15 bits will always be zero.

For the similar reason, when loading an u16 value from the u32 data
register, we should use "*(u16 *) sreg;" instead of "(u16)*sreg;",
the 2nd method will get the wrong value in the big-endian system.

So introduce some wrapper functions to store/load an u8 or u16
integer to/from the u32 data register, and use them in the right
place.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c7a72e3f 06-Mar-2017 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add nft_set_lookup()

This new function consolidates set lookup via either name or ID by
introducing a new nft_set_lookup() function. Replace existing spots
where we can use this too.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 25e94a99 28-Feb-2017 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: don't call nfnetlink_set_err() if nfnetlink_send() fails

The underlying nlmsg_multicast() already sets sk->sk_err for us to
notify socket overruns, so we should not do anything with this return
value. So we just call nfnetlink_set_err() if:

1) We fail to allocate the netlink message.

or

2) We don't have enough space in the netlink message to place attributes,
which means that we likely need to allocate a larger message.

Before this patch, the internal ESRCH netlink error code was propagated
to userspace, which is quite misleading. Netlink semantics mandate that
listeners just hit ENOBUFS if the socket buffer overruns.

Reported-by: Alexander Alemayhu <alexander@alemayhu.com>
Tested-by: Alexander Alemayhu <alexander@alemayhu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 1a94e38d 09-Feb-2017 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add NFTA_RULE_ID attribute

This new attribute allows us to uniquely identify a rule in transaction.
Robots may trigger an insertion followed by deletion in a batch, in that
scenario we still don't have a public rule handle that we can use to
delete the rule. This is similar to the NFTA_SET_ID attribute that
allows us to refer to an anonymous set from a batch.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 0b5a7874 18-Jan-2017 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add space notation to sets

The space notation allows us to classify the set backend implementation
based on the amount of required memory. This provides an order of the
set representation scalability in terms of memory. The size field is
still left in place so use this if the userspace provides no explicit
number of elements, so we cannot calculate the real memory that this set
needs. This also helps us break ties in the set backend selection
routine, eg. two backend implementations provide the same performance.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 55af753c 18-Jan-2017 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: rename struct nft_set_estimate class field

Use lookup as field name instead, to prepare the introduction of the
memory class in a follow up patch.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 1f48ff6c 18-Jan-2017 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add flush field to struct nft_set_iter

This provides context to walk callback iterator, thus, we know if the
walk happens from the set flush path. This is required by the new bitmap
set type coming in a follow up patch which has no real struct
nft_set_ext, so it has to allocate it based on the two bit compact
element representation.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 1ba1c414 18-Jan-2017 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: rename deactivate_one() to flush()

Although semantics are similar to deactivate() with no implicit element
lookup, this is only called from the set flush path, so better rename
this to flush().

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 5cb82a38 18-Jan-2017 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: pass netns to set->ops->remove()

This new parameter is required by the new bitmap set type that comes in a
follow up patch.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# de70185d 23-Jan-2017 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: deconstify walk callback function

The flush operation needs to modify set and element objects, so let's
deconstify this.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 8411b644 05-Dec-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: support for set flushing

This patch adds support for set flushing, that consists of walking over
the set elements if the NFTA_SET_ELEM_LIST_ELEMENTS attribute is set.
This patch requires the following changes:

1) Add set->ops->deactivate_one() operation: This allows us to
deactivate an element from the set element walk path, given we can
skip the lookup that happens in ->deactivate().

2) Add a new nft_trans_alloc_gfp() function since we need to allocate
transactions using GFP_ATOMIC given the set walk path happens with
held rcu_read_lock.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 8aeff920 27-Nov-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add stateful object reference to set elements

This patch allows you to refer to stateful objects from set elements.
This provides the infrastructure to create maps where the right hand
side of the mapping is a stateful object.

This allows us to build dictionaries of stateful objects, that you can
use to perform fast lookups using any arbitrary key combination.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 18965317 27-Nov-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_quota: add depleted flag for objects

Notify on depleted quota objects. The NFT_QUOTA_F_DEPLETED flag
indicates we have reached overquota.

Add pointer to table from nft_object, so we can use it when sending the
depletion notification to userspace.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 2599e989 27-Nov-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: notify internal updates of stateful objects

Introduce nf_tables_obj_notify() to notify internal state changes in
stateful objects. This is used by the quota object to report depletion
in a follow up patch.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 43da04a5 27-Nov-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: atomic dump and reset for stateful objects

This patch adds a new NFT_MSG_GETOBJ_RESET command perform an atomic
dump-and-reset of the stateful object. This also comes with add support
for atomic dump and reset for counter and quota objects.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e5009240 27-Nov-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add stateful objects

This patch augments nf_tables to support stateful objects. This new
infrastructure allows you to create, dump and delete stateful objects,
that are identified by a user-defined name.

This patch adds the generic infrastructure, follow up patches add
support for two stateful objects: counters and quotas.

This patch provides a native infrastructure for nf_tables to replace
nfacct, the extended accounting infrastructure for iptables.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# d3e2a111 20-Nov-2016 Anders K. Pedersen <akp@cohaesio.com>

netfilter: nf_tables: fix inconsistent element expiration calculation

As Liping Zhang reports, after commit a8b1e36d0d1d ("netfilter: nft_dynset:
fix element timeout for HZ != 1000"), priv->timeout was stored in jiffies,
while set->timeout was stored in milliseconds. This is inconsistent and
incorrect.

Firstly, we already call msecs_to_jiffies in nft_set_elem_init, so
priv->timeout will be converted to jiffies twice.

Secondly, if the user did not specify the NFTA_DYNSET_TIMEOUT attr,
set->timeout will be used, but we forget to call msecs_to_jiffies
when do update elements.

Fix this by using jiffies internally for traditional sets and doing the
conversions to/from msec when interacting with userspace - as dynset
already does.

This is preferable to doing the conversions, when elements are inserted or
updated, because this can happen very frequently on busy dynsets.

Fixes: a8b1e36d0d1d ("netfilter: nft_dynset: fix element timeout for HZ != 1000")
Reported-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Anders K. Pedersen <akp@cohaesio.com>
Acked-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 0e5a1c7e 03-Nov-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: use hook state from xt_action_param structure

Don't copy relevant fields from hook state structure, instead use the
one that is already available in struct xt_action_param.

This patch also adds a set of new wrapper functions to fetch relevant
hook state structure fields.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 613dbd95 03-Nov-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: x_tables: move hook state into xt_action_param structure

Place pointer to hook state in xt_action_param structure instead of
copying the fields that we need. After this change xt_action_param fits
into one cacheline.

This patch also adds a set of new wrapper functions to fetch relevant
hook state structure fields.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# f1d505bb 25-Oct-2016 John W. Linville <linville@tuxdriver.com>

netfilter: nf_tables: fix type mismatch with error return from nft_parse_u32_check

Commit 36b701fae12ac ("netfilter: nf_tables: validate maximum value of
u32 netlink attributes") introduced nft_parse_u32_check with a return
value of "unsigned int", yet on error it returns "-ERANGE".

This patch corrects the mismatch by changing the return value to "int",
which happens to match the actual users of nft_parse_u32_check already.

Found by Coverity, CID 1373930.

Note that commit 21a9e0f1568ea ("netfilter: nft_exthdr: fix error
handling in nft_exthdr_init()) attempted to address the issue, but
did not address the return type of nft_parse_u32_check.

Signed-off-by: John W. Linville <linville@tuxdriver.com>
Cc: Laura Garcia Liebana <nevola@gmail.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 36b701fae12ac ("netfilter: nf_tables: validate maximum value...")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 61f9e292 22-Oct-2016 Liping Zhang <zlpnobody@gmail.com>

netfilter: nf_tables: fix *leak* when expr clone fail

When nft_expr_clone failed, a series of problems will happen:

1. module refcnt will leak, we call __module_get at the beginning but
we forget to put it back if ops->clone returns fail
2. memory will be leaked, if clone fail, we just return NULL and forget
to free the alloced element
3. set->nelems will become incorrect when set->size is specified. If
clone fail, we should decrease the set->nelems

Now this patch fixes these problems. And fortunately, clone fail will
only happen on counter expression when memory is exhausted.

Fixes: 086f332167d6 ("netfilter: nf_tables: add clone interface to expression operations")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 36b701fa 14-Sep-2016 Laura Garcia Liebana <nevola@gmail.com>

netfilter: nf_tables: validate maximum value of u32 netlink attributes

Fetch value and validate u32 netlink attribute. This validation is
usually required when the u32 netlink attributes are being stored in a
field whose size is smaller.

This patch revisits 4da449ae1df9 ("netfilter: nft_exthdr: Add size check
on u8 nft_exthdr attributes").

Fixes: 96518518cc41 ("netfilter: add nftables")
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# beac5afa 08-Sep-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: ensure proper initialization of nft_pktinfo fields

This patch introduces nft_set_pktinfo_unspec() that ensures proper
initialization all of pktinfo fields for non-IP traffic. This is used
by the bridge, netdev and arp families.

This new function relies on nft_set_pktinfo_proto_unspec() to set a new
tprot_set field that indicates if transport protocol information is
available. Remain fields are zeroed.

The meta expression has been also updated to check to tprot_set in first
place given that zero is a valid tprot value. Even a handcrafted packet
may come with the IPPROTO_RAW (255) protocol number so we can't rely on
this value as tprot unset.

Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c016c7e4 23-Aug-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: honor NLM_F_EXCL flag in set element insertion

If the NLM_F_EXCL flag is set, then new elements that clash with an
existing one return EEXIST. In case you try to add an element whose
data area differs from what we have, then this returns EBUSY. If no
flag is specified at all, then this returns success to userspace.

This patch also update the set insert operation so we can fetch the
existing element that clashes with the one you want to add, we need
this to make sure the element data doesn't differ.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 42a55769 08-Jul-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: get rid of possible_net_t from set and basechain

We can pass the netns pointer as parameter to the functions that need to
gain access to it. From basechains, I didn't find any client for this
field anymore so let's remove this too.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 82bec71d 22-Jun-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: get rid of NFT_BASECHAIN_DISABLED

This flag was introduced to restore rulesets from the new netdev
family, but since 5ebe0b0eec9d6f7 ("netfilter: nf_tables: destroy
basechain and rules on netdevice removal") the ruleset is released
once the netdev is gone.

This also removes nft_register_basechain() and
nft_unregister_basechain() since they have no clients anymore after
this rework.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 37a9cc52 12-Jun-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add generation mask to sets

Similar to ("netfilter: nf_tables: add generation mask to tables").

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 664b0f8c 12-Jun-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add generation mask to chains

Similar to ("netfilter: nf_tables: add generation mask to tables").

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# f2a6d766 14-Jun-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add generation mask to tables

This patch addresses two problems:

1) The netlink dump is inconsistent when interfering with an ongoing
transaction update for several reasons:

1.a) We don't honor the internal NFT_TABLE_INACTIVE flag, and we should
be skipping these inactive objects in the dump.

1.b) We perform speculative deletion during the preparation phase, that
may result in skipping active objects.

1.c) The listing order changes, which generates noise when tracking
incremental ruleset update via tools like git or our own
testsuite.

2) We don't allow to add and to update the object in the same batch,
eg. add table x; add table x { flags dormant\; }.

In order to resolve these problems:

1) If the user requests a deletion, the object becomes inactive in the
next generation. Then, ignore objects that scheduled to be deleted
from the lookup path, as they will be effectively removed in the
next generation.

2) From the get/dump path, if the object is not currently active, we
skip it.

3) Support 'add X -> update X' sequence from a transaction.

After this update, we obtain a consistent list as long as we stay
in the same generation. The userspace side can detect interferences
through the generation counter so it can restart the dumping.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 889f7ee7 12-Jun-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add generic macros to check for generation mask

Thus, we can reuse these to check the genmask of any object type, not
only rules. This is required now that tables, chain and sets will get a
generation mask field too in follow up patches.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 8588ac09 10-Jun-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: reject loops from set element jump to chain

Liping Zhang says:

"Users may add such a wrong nft rules successfully, which will cause an
endless jump loop:

# nft add rule filter test tcp dport vmap {1: jump test}

This is because before we commit, the element in the current anonymous
set is inactive, so osp->walk will skip this element and miss the
validate check."

To resolve this problem, this patch passes the generation mask to the
walk function through the iter container structure depending on the code
path:

1) If we're dumping the elements, then we have to check if the element
is active in the current generation. Thus, we check for the current
bit in the genmask.

2) If we're checking for loops, then we have to check if the element is
active in the next generation, as we're in the middle of a
transaction. Thus, we check for the next bit in the genmask.

Based on original patch from Liping Zhang.

Reported-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Liping Zhang <liping.zhang@spreadtrum.com>


# cb39ad8b 04-May-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: allow set names up to 32 bytes

Currently, we support set names of up to 16 bytes, get this aligned
with the maximum length we can use in ipset to make it easier when
considering migration to nf_tables.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e6d8ecac 05-Jan-2016 Carlos Falgueras García <carlosfg@riseup.net>

netfilter: nf_tables: Add new attributes into nft_set to store user data.

User data is stored at after 'nft_set_ops' private data into 'data[]'
flexible array. The field 'udata' points to user data and 'udlen' stores
its length.

Add new flag NFTA_SET_USERDATA.

Signed-off-by: Carlos Falgueras García <carlosfg@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 5ebe0b0e 15-Dec-2015 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: destroy basechain and rules on netdevice removal

If the netdevice is destroyed, the resources that are attached should
be released too as they belong to the device that is now gone.

Suggested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# df05ef87 15-Dec-2015 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: release objects on netns destruction

We have to release the existing objects on netns removal otherwise we
leak them. Chains are unregistered in first place to make sure no
packets are walking on our rules and sets anymore.

The object release happens by when we unregister the family via
nft_release_afinfo() which is called from nft_unregister_afinfo() from
the corresponding __net_exit path in every family.

Reported-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 33d5a7b1 28-Nov-2015 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: extend tracing infrastructure

nft monitor mode can then decode and display this trace data.

Parts of LL/Network/Transport headers are provided as separate
attributes.

Otherwise, printing IP address data becomes virtually impossible
for userspace since in the case of the netdev family we really don't
want userspace to have to know all the possible link layer types
and/or sizes just to display/print an ip address.

We also don't want userspace to have to follow ipv6 header chains
to get the s/dport info, the kernel already did this work for us.

To avoid bloating nft_do_chain all data required for tracing is
encapsulated in nft_traceinfo.

The structure is initialized unconditionally(!) for each nft_do_chain
invocation.

This unconditionall call will be moved under a static key in a
followup patch.

With lots of help from Patrick McHardy and Pablo Neira.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# a9ecfbe7 23-Nov-2015 Florian Westphal <fw@strlen.de>

netfilter: nf_tables: remove unused struct members

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 086f3321 10-Nov-2015 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add clone interface to expression operations

With the conversion of the counter expressions to make it percpu, we
need to clone the percpu memory area, otherwise we crash when using
counters from flow tables.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 06198b34 18-Sep-2015 Eric W. Biederman <ebiederm@xmission.com>

netfilter: Pass priv instead of nf_hook_ops to netfilter hooks

Only pass the void *priv parameter out of the nf_hook_ops. That is
all any of the functions are interested now, and by limiting what is
passed it becomes simpler to change implementation details.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 46448d00 18-Sep-2015 Eric W. Biederman <ebiederm@xmission.com>

netfilter: nf_tables: Pass struct net in nft_pktinfo

nft_pktinfo is passed on the stack so this does not bloat any in core
data structures.

By centrally computing this information this makes maintence of the code
simpler, and understading of the code easier.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 156c196f 18-Sep-2015 Eric W. Biederman <ebiederm@xmission.com>

netfilter: x_tables: Pass struct net in xt_action_param

As xt_action_param lives on the stack this does not bloat any
persistent data structures.

This is a first step in making netfilter code that needs to know
which network namespace it is executing in simpler.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 6aa187f2 18-Sep-2015 Eric W. Biederman <ebiederm@xmission.com>

netfilter: nf_tables: kill nft_pktinfo.ops

- Add nft_pktinfo.pf to replace ops->pf
- Add nft_pktinfo.hook to replace ops->hooknum

This simplifies the code, makes it more readable, and likely reduces
cache line misses. Maintainability is enhanced as the details of
nft_hook_ops are of no concern to the recpients of nft_pktinfo.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# bf798657 12-Aug-2015 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: Use 32 bit addressing register from nft_type_to_reg()

nft_type_to_reg() needs to return the register in the new 32 bit addressing,
otherwise we hit EINVAL when using mappings.

Fixes: 49499c3 ("netfilter: nf_tables: switch registers to 32 bit addressing")
Reported-by: Andreas Schultz <aschultz@tpip.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 835b8033 14-Jun-2015 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables_netdev: unregister hooks on net_device removal

In case the net_device is gone, we have to unregister the hooks and put back
the reference on the net_device object. Once it comes back, register them
again. This also covers the device rename case.

This patch also adds a new flag to indicate that the basechain is disabled, so
their hooks are not registered. This flag is used by the netdev family to
handle the case where the net_device object is gone. Currently this flag is not
exposed to userspace.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 2cbce139 12-Jun-2015 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: attach net_device to basechain

The device is part of the hook configuration, so instead of a global
configuration per table, set it to each of the basechain that we create.

This patch reworks ebddf1a8d78a ("netfilter: nf_tables: allow to bind table to
net_device").

Note that this adds a dev_name field in the nft_base_chain structure which is
required the netdev notification subscription that follows up in a patch to
handle gone net_devices.

Suggested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# ebddf1a8 26-May-2015 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: allow to bind table to net_device

This patch adds the internal NFT_AF_NEEDS_DEV flag to indicate that you must
attach this table to a net_device.

This change is required by the follow up patch that introduces the new netdev
table.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 151d799a 11-Apr-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: mark stateful expressions

Add a flag to mark stateful expressions.

This is used for dynamic expression instanstiation to limit the usable
expressions. Strictly speaking only the dynset expression can not be
used in order to avoid recursion, but since dynamically instantiating
non-stateful expressions will simply create an identical copy, which
behaves no differently than the original, this limits to expressions
where it actually makes sense to dynamically instantiate them.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# f25ad2e9 11-Apr-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: prepare for expressions associated to set elements

Preparation to attach expressions to set elements: add a set extension
type to hold an expression and dump the expression information with the
set element.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 0b2d8a7b 11-Apr-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: add helper functions for expression handling

Add helper functions for initializing, cloning, dumping and destroying
a single expression that is not part of a rule.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 7d740264 10-Apr-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: variable sized set element keys / data

This patch changes sets to support variable sized set element keys / data
up to 64 bytes each by using variable sized set extensions. This allows
to use concatenations with bigger data items suchs as IPv6 addresses.

As a side effect, small keys/data now don't require the full 16 bytes
of struct nft_data anymore but just the space they need.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# d0a11fc3 10-Apr-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: support variable sized data in nft_data_init()

Add a size argument to nft_data_init() and pass in the available space.
This will be used by the following patches to support variable sized
set element data.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 49499c3e 10-Apr-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: switch registers to 32 bit addressing

Switch the nf_tables registers from 128 bit addressing to 32 bit
addressing to support so called concatenations, where multiple values
can be concatenated over multiple registers for O(1) exact matches of
multiple dimensions using sets.

The old register values are mapped to areas of 128 bits for compatibility.
When dumping register numbers, values are expressed using the old values
if they refer to the beginning of a 128 bit area for compatibility.

To support concatenations, register loads of less than a full 32 bit
value need to be padded. This mainly affects the payload and exthdr
expressions, which both unconditionally zero the last word before
copying the data.

Userspace fully passes the testsuite using both old and new register
addressing.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b1c96ed3 10-Apr-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: add register parsing/dumping helpers

Add helper functions to parse and dump register values in netlink attributes.
These helpers will later be changed to take care of translation between the
old 128 bit and the new 32 bit register numbers.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 8cd8937a 10-Apr-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: convert sets to u32 data pointers

Simple conversion to use u32 pointers to the beginning of the data
area to keep follow up patches smaller.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e562d860 10-Apr-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: kill nft_data_cmp()

Only needlessly complicates things due to requiring specific argument
types. Use memcmp directly.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 1ca2e170 10-Apr-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: use struct nft_verdict within struct nft_data

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# a55e22e9 10-Apr-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: get rid of NFT_REG_VERDICT usage

Replace the array of registers passed to expressions by a struct nft_regs,
containing the verdict as a seperate member, which aliases to the
NFT_REG_VERDICT register.

This is needed to seperate the verdict from the data registers completely,
so their size can be changed.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# d07db988 10-Apr-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: introduce nft_validate_register_load()

Change nft_validate_input_register() to not only validate the input
register number, but also the length of the load, and rename it to
nft_validate_register_load() to reflect that change.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 27e6d201 10-Apr-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: kill nft_validate_output_register()

All users of nft_validate_register_store() first invoke
nft_validate_output_register(). There is in fact no use for using it
on its own, so simplify the code by folding the functionality into
nft_validate_register_store() and kill it.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 1ec10212 10-Apr-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: rename nft_validate_data_load()

The existing name is ambiguous, data is loaded as well when we read from
a register. Rename to nft_validate_register_store() for clarity and
consistency with the upcoming patch to introduce its counterpart,
nft_validate_register_load().

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 45d9bcda 10-Apr-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: validate len in nft_validate_data_load()

For values spanning multiple registers, we need to validate that enough
space is available from the destination register onwards. Add a len
argument to nft_validate_data_load() and consolidate the existing length
validations in preparation of that.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 68e942e8 05-Apr-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: support optional userdata for set elements

Add an userdata set extension and allow the user to attach arbitrary
data to set elements. This is intended to hold TLV encoded data like
comments or DNS annotations that have no meaning to the kernel.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 22fe54d5 05-Apr-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: add support for dynamic set updates

Add a new "dynset" expression for dynamic set updates.

A new set op ->update() is added which, for non existant elements,
invokes an initialization callback and inserts the new element.
For both new or existing elements the extenstion pointer is returned
to the caller to optionally perform timer updates or other actions.

Element removal is not supported so far, however that seems to be a
rather exotic need and can be added later on.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 11113e19 05-Apr-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: support different set binding types

Currently a set binding is assumed to be related to a lookup and, in
case of maps, a data load.

In order to use bindings for set updates, the loop detection checks
must be restricted to map operations only. Add a flags member to the
binding struct to hold the set "action" flags such as NFT_SET_MAP,
and perform loop detection based on these.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 3dd0673a 05-Apr-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: prepare set element accounting for async updates

Use atomic operations for the element count to avoid races with async
updates.

To properly handle the transactional semantics during netlink updates,
deleted but not yet committed elements are accounted for seperately and
are treated as being already removed. This means for the duration of
a netlink transaction, the limit might be exceeded by the amount of
elements deleted. Set implementations must be prepared to handle this.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 073bfd56 03-Apr-2015 David S. Miller <davem@davemloft.net>

netfilter: Pass nf_hook_state through nft_set_pktinfo*().

Signed-off-by: David S. Miller <davem@davemloft.net>


# 9d098292 25-Mar-2015 Patrick McHardy <kaber@trash.net>

netfilter: nft_hash: add support for timeouts

Add support for element timeouts to nft_hash. The lookup and walking
functions are changed to ignore timed out elements, a periodic garbage
collection task cleans out expired entries.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 69086658 25-Mar-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: add GC synchronization helpers

GC is expected to happen asynchrously to the netlink interface. In the
netlink path, both insertion and removal of elements consist of two
steps, insertion followed by activation or deactivation followed by
removal, during which the element must not be freed by GC.

The synchronization helpers use an unused bit in the genmask field to
atomically mark an element as "busy", meaning it is either currently
being handled through the netlink API or by GC.

Elements being processed by GC will never survive, netlink will simply
ignore them. Elements being currently processed through netlink will be
skipped by GC and reprocessed during the next run.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# cfed7e1b 25-Mar-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: add set garbage collection helpers

Add helpers for GC batch destruction: since element destruction needs
a RCU grace period for all set implementations, add some helper functions
for asynchronous batch destruction. Elements are collected in a batch
structure, which is asynchronously released using RCU once its full.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c3e1b005 25-Mar-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: add set element timeout support

Add API support for set element timeouts. Elements can have a individual
timeout value specified, overriding the sets' default.

Two new extension types are used for timeouts - the timeout value and
the expiration time. The timeout value only exists if it differs from
the default value.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 761da293 25-Mar-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: add set timeout API support

Add set timeout support to the netlink API. Sets with timeout support
enabled can have a default timeout value and garbage collection interval
specified.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# cc02e457 25-Mar-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: implement set transaction support

Set elements are the last object type not supporting transaction support.
Implement similar to the existing rule transactions:

The global transaction counter keeps track of two generations, current
and next. Each element contains a bitmask specifying in which generations
it is inactive.

New elements start out as inactive in the current generation and active
in the next. On commit, the previous next generation becomes the current
generation and the element becomes active. The bitmask is then cleared
to indicate that the element is active in all future generations. If the
transaction is aborted, the element is removed from the set before it
becomes active.

When removing an element, it gets marked as inactive in the next generation.
On commit the next generation becomes active and the therefor the element
inactive. It is then taken out of then set and released. On abort, the
element is marked as active for the next generation again.

Lookups ignore elements not active in the current generation.

The current set types (hash/rbtree) both use a field in the extension area
to store the generation mask. This (currently) does not require any
additional memory since we have some free space in there.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# ea4bd995 25-Mar-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: add transaction helper functions

Add some helper functions for building the genmask as preparation for
set transactions.

Also add a little documentation how this stuff actually works.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b2832dd6 25-Mar-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: return set extensions from ->lookup()

Return the extension area from the ->lookup() function to allow to
consolidate common actions.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 61edafbb 25-Mar-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: consolide set element destruction

With the conversion to set extensions, it is now possible to consolidate
the different set element destruction functions.

The set implementations' ->remove() functions are changed to only take
the element out of their internal data structures. Elements will be freed
in a batched fashion after the global transaction's completion RCU grace
period.

This reduces the amount of grace periods required for nft_hash from N
to zero additional ones, additionally this guarantees that the set
elements' extensions of all implementations can be used under RCU
protection.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# fe2811eb 25-Mar-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: convert hash and rbtree to set extensions

The set implementations' private struct will only contain the elements
needed to maintain the search structure, all other elements are moved
to the set extensions.

Element allocation and initialization is performed centrally by
nf_tables_api instead of by the different set implementations'
->insert() functions. A new "elemsize" member in the set ops specifies
the amount of memory to reserve for internal usage. Destruction
will also be moved out of the set implementations by a following patch.

Except for element allocation, the patch is a simple conversion to
using data from the extension area.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 3ac4c07a 25-Mar-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: add set extensions

Add simple set extension infrastructure for maintaining variable sized
and optional per element data.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 5ebb335d 21-Mar-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: move struct net pointer to base chain

The network namespace is only needed for base chains to get at the
gencursor. Also convert to possible_net_t.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 1cae565e 05-Mar-2015 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: limit maximum table name length to 32 bytes

Set the same as we use for chain names, it should be enough.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 1a1e1a12 03-Mar-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: cleanup nf_tables.h

The transaction related definitions are squeezed in between the rule
and expression definitions, which are closely related and should be
next to each other. The transaction definitions actually don't belong
into that file at all since it defines the global objects and API and
transactions are internal to nf_tables_api, but for now simply move
them to a seperate section.

Similar, the chain types are in between a set of registration functions,
they belong to the chain section.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 86f1ec32 03-Mar-2015 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: fix userdata length overflow

The NFT_USERDATA_MAXLEN is defined to 256, however we only have a u8
to store its size. Introduce a struct nft_userdata which contains a
length field and indicate its presence using a single bit in the rule.

The length field of struct nft_userdata is also a u8, however we don't
store zero sized data, so the actual length is udata->len + 1.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 75e8d06d 14-Jan-2015 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: validate hooks in NAT expressions

The user can crash the kernel if it uses any of the existing NAT
expressions from the wrong hook, so add some code to validate this
when loading the rule.

This patch introduces nft_chain_validate_hooks() which is based on
an existing function in the bridge version of the reject expression.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b326dd37 10-Nov-2014 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: restore synchronous object release from commit/abort

The existing xtables matches and targets, when used from nft_compat, may
sleep from the destroy path, ie. when removing rules. Since the objects
are released via call_rcu from softirq context, this results in lockdep
splats and possible lockups that may be hard to reproduce.

Patrick also indicated that delayed object release via call_rcu can
cause us problems in the ordering of event notifications when anonymous
sets are in place.

So, this patch restores the synchronous object release from the commit
and abort paths. This includes a call to synchronize_rcu() to make sure
that no packets are walking on the objects that are going to be
released. This is slowier though, but it's simple and it resolves the
aforementioned problems.

This is a partial revert of c7c32e7 ("netfilter: nf_tables: defer all
object release via rcu") that was introduced in 3.16 to speed up
interaction with userspace.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 7210e4e3 13-Oct-2014 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: restrict nat/masq expressions to nat chain type

This adds the missing validation code to avoid the use of nat/masq from
non-nat chains. The validation assumes two possible configuration
scenarios:

1) Use of nat from base chain that is not of nat type. Reject this
configuration from the nft_*_init() path of the expression.

2) Use of nat from non-base chain. In this case, we have to wait until
the non-base chain is referenced by at least one base chain via
jump/goto. This is resolved from the nft_*_validate() path which is
called from nf_tables_check_loops().

The user gets an -EOPNOTSUPP in both cases.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 9363dc4b 23-Sep-2014 Arturo Borrero <arturo.borrero.glez@gmail.com>

netfilter: nf_tables: store and dump set policy

We want to know in which cases the user explicitly sets the policy
options. In that case, we also want to dump back the info.

Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# ce355e20 09-Jul-2014 Eric Dumazet <edumazet@google.com>

netfilter: nf_tables: 64bit stats need some extra synchronization

Use generic u64_stats_sync infrastructure to get proper 64bit stats,
even on 32bit arches, at no extra cost for 64bit arches.

Without this fix, 32bit arches can have some wrong counters at the time
the carry is propagated into upper word.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# a0a7379e 10-Jun-2014 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: use u32 for chain use counter

Since 4fefee5 ("netfilter: nf_tables: allow to delete several objects
from a batch"), every new rule bumps the chain use counter. However,
this is limited to 16 bits, which means that it will overrun after
2^16 rules.

Use a u32 chain counter and check for overflows (just like we do for
table objects).

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c7c32e72 09-Apr-2014 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: defer all object release via rcu

Now that all objects are released in the reverse order via the
transaction infrastructure, we can enqueue the release via
call_rcu to save one synchronize_rcu. For small rule-sets loaded
via nft -f, it now takes around 50ms less here.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 128ad332 09-May-2014 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: remove skb and nlh from context structure

Instead of caching the original skbuff that contains the netlink
messages, this stores the netlink message sequence number, the
netlink portID and the report flag. This helps to prepare the
introduction of the object release via call_rcu.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 60319eb1 03-Apr-2014 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: use new transaction infrastructure to handle elements

Leave the set content in consistent state if we fail to load the
batch. Use the new generic transaction infrastructure to achieve
this.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 55dd6f93 03-Apr-2014 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: use new transaction infrastructure to handle table

This patch speeds up rule-set updates and it also provides a way
to revert updates and leave things in consistent state in case that
the batch needs to be aborted.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 91c7b38d 09-Apr-2014 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: use new transaction infrastructure to handle chain

This patch speeds up rule-set updates and it also introduces a way to
revert chain updates if the batch is aborted. The idea is to store the
changes in the transaction to apply that in the commit step.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 958bee14 03-Apr-2014 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: use new transaction infrastructure to handle sets

This patch reworks the nf_tables API so set updates are included in
the same batch that contains rule updates. This speeds up rule-set
updates since we skip a dialog of four messages between kernel and
user-space (two on each direction), from:

1) create the set and send netlink message to the kernel
2) process the response from the kernel that contains the allocated name.
3) add the set elements and send netlink message to the kernel.
4) process the response from the kernel (to check for errors).

To:

1) add the set to the batch.
2) add the set elements to the batch.
3) add the rule that points to the set.
4) send batch to the kernel.

This also introduces an internal set ID (NFTA_SET_ID) that is unique
in the batch so set elements and rules can refer to new sets.

Backward compatibility has been only retained in userspace, this
means that new nft versions can talk to the kernel both in the new
and the old fashion.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b380e5c7 03-Apr-2014 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add message type to transactions

The patch adds message type to the transaction to simplify the
commit the and abort routines. Yet another step forward in the
generalisation of the transaction infrastructure.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 1081d11b 03-Apr-2014 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: generalise transaction infrastructure

This patch generalises the existing rule transaction infrastructure
so it can be used to handle set, table and chain object transactions
as well. The transaction provides a data area that stores private
information depending on the transaction type.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 7c95f6d8 03-Apr-2014 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: deconstify table and chain in context structure

The new transaction infrastructure updates the family, table and chain
objects in the context structure, so let's deconstify them. While at it,
move the context structure initialization routine to the top of the
source file as it will be also used from the table and chain routines.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c50b960c 28-Mar-2014 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: implement proper set selection

The current set selection simply choses the first set type that provides
the requested features, which always results in the rbtree being chosen
by virtue of being the first set in the list.

What we actually want to do is choose the implementation that can provide
the requested features and is optimal from either a performance or memory
perspective depending on the characteristics of the elements and the
preferences specified by the user.

The elements are not known when creating a set. Even if we would provide
them for anonymous (literal) sets, we'd still have standalone sets where
the elements are not known in advance. We therefore need an abstract
description of the data charcteristics.

The kernel already knows the size of the key, this patch starts by
introducing a nested set description which so far contains only the maximum
amount of elements. Based on this the set implementations are changed to
provide an estimate of the required amount of memory and the lookup
complexity class.

The set ops have a new callback ->estimate() that is invoked during set
selection. It receives a structure containing the attributes known to the
kernel and is supposed to populate a struct nft_set_estimate with the
complexity class and, in case the size is known, the complete amount of
memory required, or the amount of memory required per element otherwise.

Based on the policy specified by the user (performance/memory, defaulting
to performance) the kernel will then select the best suited implementation.

Even if the set implementation would allow to add more than the specified
maximum amount of elements, they are enforced since new implementations
might not be able to add more than maximum based on which they were
selected.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 62472bce 07-Mar-2014 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: restore context for expression destructors

In order to fix set destruction notifications and get rid of unnecessary
members in private data structures, pass the context to expressions'
destructor functions again.

In order to do so, replace various members in the nft_rule_trans structure
by the full context.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 0768b3b3 19-Feb-2014 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add optional user data area to rules

This allows us to store user comment strings, but it could be also
used to store any kind of information that the user application needs
to link to the rule.

Scratch 8 bits for the new ulen field that indicates the length the
user data area. 4 bits from the handle (so it's 42 bits long, according
to Patrick, it would last 139 years with 1000 new rules per second)
and 4 bits from dlen (so the expression data area is 4K, which seems
sufficient by now even considering the compatibility layer).

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Patrick McHardy <kaber@trash.net>


# 67a8fc27 18-Feb-2014 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: add nft_dereference() macro

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 0165d932 25-Jan-2014 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: fix racy rule deletion

We may lost race if we flush the rule-set (which happens asynchronously
via call_rcu) and we try to remove the table (that userspace assumes
to be empty).

Fix this by recovering synchronous rule and chain deletion. This was
introduced time ago before we had no batch support, and synchronous
rule deletion performance was not good. Now that we have the batch
support, we can just postpone the purge of old rule in a second step
in the commit phase. All object deletions are synchronous after this
patch.

As a side effect, we save memory as we don't need rcu_head per rule
anymore.

Cc: Patrick McHardy <kaber@trash.net>
Reported-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 64d46806 05-Feb-2014 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: add AF specific expression support

For the reject module, we need to add AF-specific implementations to
get rid of incorrect module dependencies. Try to load an AF-specific
module first and fall back to generic modules.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 3876d22d 09-Jan-2014 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: rename nft_do_chain_pktinfo() to nft_do_chain()

We don't encode argument types into function names and since besides
nft_do_chain() there are only AF-specific versions, there is no risk
of confusion.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# fa2c1de0 09-Jan-2014 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: minor nf_chain_type cleanups

Minor nf_chain_type cleanups:

- reorder struct to plug a hoe
- rename struct module member to "owner" for consistency
- rename nf_hookfn array to "hooks" for consistency
- reorder initializers for better readability

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 2a37d755 09-Jan-2014 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: constify chain type definitions and pointers

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# baae3e62 09-Jan-2014 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: fix chain type module reference handling

The chain type module reference handling makes no sense at all: we take
a reference immediately when the module is registered, preventing the
module from ever being unloaded.

Fix by taking a reference when we're actually creating a chain of the
chain type and release the reference when destroying the chain.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 4566bf27 02-Jan-2014 Patrick McHardy <kaber@trash.net>

netfilter: nft_meta: add l4proto support

For L3-proto independant rules we need to get at the L4 protocol value
directly. Add it to the nft_pktinfo struct and use the meta expression
to retrieve it.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 115a60b1 02-Jan-2014 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: add support for multi family tables

Add support to register chains to multiple hooks for different address
families for mixed IPv4/IPv6 tables.

Signed-off-by: Patrick McHardy <kaber@trash.net>


# c9484874 02-Jan-2014 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: add hook ops to struct nft_pktinfo

Multi-family tables need the AF from the hook ops. Add a pointer to the
hook ops and replace usage of the hooknum member in struct nft_pktinfo.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 5eccdfaa 19-Oct-2013 Joe Perches <joe@perches.com>

nf_tables*.h: Remove extern from function prototypes

There are a mix of function prototypes with and without extern
in the kernel sources. Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler. Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b5bc89bf 10-Oct-2013 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add trace support

This patch adds support for tracing the packet travel through
the ruleset, in a similar fashion to x_tables.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 0628b123 14-Oct-2013 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nfnetlink: add batch support and use it from nf_tables

This patch adds a batch support to nfnetlink. Basically, it adds
two new control messages:

* NFNL_MSG_BATCH_BEGIN, that indicates the beginning of a batch,
the nfgenmsg->res_id indicates the nfnetlink subsystem ID.

* NFNL_MSG_BATCH_END, that results in the invocation of the
ss->commit callback function. If not specified or an error
ocurred in the batch, the ss->abort function is invoked
instead.

The end message represents the commit operation in nftables, the
lack of end message results in an abort. This patch also adds the
.call_batch function that is only called from the batch receival
path.

This patch adds atomic rule updates and dumps based on
bitmask generations. This allows to atomically commit a set of
rule-set updates incrementally without altering the internal
state of existing nf_tables expressions/matches/targets.

The idea consists of using a generation cursor of 1 bit and
a bitmask of 2 bits per rule. Assuming the gencursor is 0,
then the genmask (expressed as a bitmask) can be interpreted
as:

00 active in the present, will be active in the next generation.
01 inactive in the present, will be active in the next generation.
10 active in the present, will be deleted in the next generation.
^
gencursor

Once you invoke the transition to the next generation, the global
gencursor is updated:

00 active in the present, will be active in the next generation.
01 active in the present, needs to zero its future, it becomes 00.
10 inactive in the present, delete now.
^
gencursor

If a dump is in progress and nf_tables enters a new generation,
the dump will stop and return -EBUSY to let userspace know that
it has to retry again. In order to invalidate dumps, a global
genctr counter is increased everytime nf_tables enters a new
generation.

This new operation can be used from the user-space utility
that controls the firewall, eg.

nft -f restore

The rule updates contained in `file' will be applied atomically.

cat file
-----
add filter INPUT ip saddr 1.1.1.1 counter accept #1
del filter INPUT ip daddr 2.2.2.2 counter drop #2
-EOF-

Note that the rule 1 will be inactive until the transition to the
next generation, the rule 2 will be evicted in the next generation.

There is a penalty during the rule update due to the branch
misprediction in the packet matching framework. But that should be
quickly resolved once the iteration over the commit list that
contain rules that require updates is finished.

Event notification happens once the rule-set update has been
committed. So we skip notifications is case the rule-set update
is aborted, which can happen in case that the rule-set is tested
to apply correctly.

This patch squashed the following patches from Pablo:

* nf_tables: atomic rule updates and dumps
* nf_tables: get rid of per rule list_head for commits
* nf_tables: use per netns commit list
* nfnetlink: add batch support and use it from nf_tables
* nf_tables: all rule updates are transactional
* nf_tables: attach replacement rule after stale one
* nf_tables: do not allow deletion/replacement of stale rules
* nf_tables: remove unused NFTA_RULE_FLAGS

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 99633ab2 10-Oct-2013 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: complete net namespace support

Register family per netnamespace to ensure that sets are
only visible in its approapriate namespace.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 0ca743a5 13-Oct-2013 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add compatibility layer for x_tables

This patch adds the x_tables compatibility layer. This allows you
to use existing x_tables matches and targets from nf_tables.

This compatibility later allows us to use existing matches/targets
for features that are still missing in nf_tables. We can progressively
replace them with native nf_tables extensions. It also provides the
userspace compatibility software that allows you to express the
rule-set using the iptables syntax but using the nf_tables kernel
components.

In order to get this compatibility layer working, I've done the
following things:

* add NFNL_SUBSYS_NFT_COMPAT: this new nfnetlink subsystem is used
to query the x_tables match/target revision, so we don't need to
use the native x_table getsockopt interface.

* emulate xt structures: this required extending the struct nft_pktinfo
to include the fragment offset, which is already obtained from
ip[6]_tables and that is used by some matches/targets.

* add support for default policy to base chains, required to emulate
x_tables.

* add NFTA_CHAIN_USE attribute to obtain the number of references to
chains, required by x_tables emulation.

* add chain packet/byte counters using per-cpu.

* support 32-64 bits compat.

For historical reasons, this patch includes the following patches
that were posted in the netfilter-devel mailing list.

From Pablo Neira Ayuso:
* nf_tables: add default policy to base chains
* netfilter: nf_tables: add NFTA_CHAIN_USE attribute
* nf_tables: nft_compat: private data of target and matches in contiguous area
* nf_tables: validate hooks for compat match/target
* nf_tables: nft_compat: release cached matches/targets
* nf_tables: x_tables support as a compile time option
* nf_tables: fix alias for xtables over nftables module
* nf_tables: add packet and byte counters per chain
* nf_tables: fix per-chain counter stats if no counters are passed
* nf_tables: don't bump chain stats
* nf_tables: add protocol and flags for xtables over nf_tables
* nf_tables: add ip[6]t_entry emulation
* nf_tables: move specific layer 3 compat code to nf_tables_ipv[4|6]
* nf_tables: support 32bits-64bits x_tables compat
* nf_tables: fix compilation if CONFIG_COMPAT is disabled

From Patrick McHardy:
* nf_tables: move policy to struct nft_base_chain
* nf_tables: send notifications for base chain policy changes

From Alexander Primak:
* nf_tables: remove the duplicate NF_INET_LOCAL_OUT

From Nicolas Dichtel:
* nf_tables: fix compilation when nf-netlink is a module

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 9370761c 10-Oct-2013 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: convert built-in tables/chains to chain types

This patch converts built-in tables/chains to chain types that
allows you to deploy customized table and chain configurations from
userspace.

After this patch, you have to specify the chain type when
creating a new chain:

add chain ip filter output { type filter hook input priority 0; }
^^^^ ------

The existing chain types after this patch are: filter, route and
nat. Note that tables are just containers of chains with no specific
semantics, which is a significant change with regards to iptables.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# ef1f7df9 10-Oct-2013 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: expression ops overloading

Split the expression ops into two parts and support overloading of
the runtime expression ops based on the requested function through
a ->select_ops() callback.

This can be used to provide optimized implementations, for instance
for loading small aligned amounts of data from the packet or inlining
frequently used operations into the main evaluation loop.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 20a69341 10-Oct-2013 Patrick McHardy <kaber@trash.net>

netfilter: nf_tables: add netlink set API

This patch adds the new netlink API for maintaining nf_tables sets
independently of the ruleset. The API supports the following operations:

- creation of sets
- deletion of sets
- querying of specific sets
- dumping of all sets

- addition of set elements
- removal of set elements
- dumping of all set elements

Sets are identified by name, each table defines an individual namespace.
The name of a set may be allocated automatically, this is mostly useful
in combination with the NFT_SET_ANONYMOUS flag, which destroys a set
automatically once the last reference has been released.

Sets can be marked constant, meaning they're not allowed to change while
linked to a rule. This allows to perform lockless operation for set
types that would otherwise require locking.

Additionally, if the implementation supports it, sets can (as before) be
used as maps, associating a data value with each key (or range), by
specifying the NFT_SET_MAP flag and can be used for interval queries by
specifying the NFT_SET_INTERVAL flag.

Set elements are added and removed incrementally. All element operations
support batching, reducing netlink message and set lookup overhead.

The old "set" and "hash" expressions are replaced by a generic "lookup"
expression, which binds to the specified set. Userspace is not aware
of the actual set implementation used by the kernel anymore, all
configuration options are generic.

Currently the implementation selection logic is largely missing and the
kernel will simply use the first registered implementation supporting the
requested operation. Eventually, the plan is to have userspace supply a
description of the data characteristics and select the implementation
based on expected performance and memory use.

This patch includes the new 'lookup' expression to look up for element
matching in the set.

This patch includes kernel-doc descriptions for this set API and it
also includes the following fixes.

From Patrick McHardy:
* netfilter: nf_tables: fix set element data type in dumps
* netfilter: nf_tables: fix indentation of struct nft_set_elem comments
* netfilter: nf_tables: fix oops in nft_validate_data_load()
* netfilter: nf_tables: fix oops while listing sets of built-in tables
* netfilter: nf_tables: destroy anonymous sets immediately if binding fails
* netfilter: nf_tables: propagate context to set iter callback
* netfilter: nf_tables: add loop detection

From Pablo Neira Ayuso:
* netfilter: nf_tables: allow to dump all existing sets
* netfilter: nf_tables: fix wrong type for flags variable in newelem

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 96518518 14-Oct-2013 Patrick McHardy <kaber@trash.net>

netfilter: add nftables

This patch adds nftables which is the intended successor of iptables.
This packet filtering framework reuses the existing netfilter hooks,
the connection tracking system, the NAT subsystem, the transparent
proxying engine, the logging infrastructure and the userspace packet
queueing facilities.

In a nutshell, nftables provides a pseudo-state machine with 4 general
purpose registers of 128 bits and 1 specific purpose register to store
verdicts. This pseudo-machine comes with an extensible instruction set,
a.k.a. "expressions" in the nftables jargon. The expressions included
in this patch provide the basic functionality, they are:

* bitwise: to perform bitwise operations.
* byteorder: to change from host/network endianess.
* cmp: to compare data with the content of the registers.
* counter: to enable counters on rules.
* ct: to store conntrack keys into register.
* exthdr: to match IPv6 extension headers.
* immediate: to load data into registers.
* limit: to limit matching based on packet rate.
* log: to log packets.
* meta: to match metainformation that usually comes with the skbuff.
* nat: to perform Network Address Translation.
* payload: to fetch data from the packet payload and store it into
registers.
* reject (IPv4 only): to explicitly close connection, eg. TCP RST.

Using this instruction-set, the userspace utility 'nft' can transform
the rules expressed in human-readable text representation (using a
new syntax, inspired by tcpdump) to nftables bytecode.

nftables also inherits the table, chain and rule objects from
iptables, but in a more configurable way, and it also includes the
original datatype-agnostic set infrastructure with mapping support.
This set infrastructure is enhanced in the follow up patch (netfilter:
nf_tables: add netlink set API).

This patch includes the following components:

* the netlink API: net/netfilter/nf_tables_api.c and
include/uapi/netfilter/nf_tables.h
* the packet filter core: net/netfilter/nf_tables_core.c
* the expressions (described above): net/netfilter/nft_*.c
* the filter tables: arp, IPv4, IPv6 and bridge:
net/ipv4/netfilter/nf_tables_ipv4.c
net/ipv6/netfilter/nf_tables_ipv6.c
net/ipv4/netfilter/nf_tables_arp.c
net/bridge/netfilter/nf_tables_bridge.c
* the NAT table (IPv4 only):
net/ipv4/netfilter/nf_table_nat_ipv4.c
* the route table (similar to mangle):
net/ipv4/netfilter/nf_table_route_ipv4.c
net/ipv6/netfilter/nf_table_route_ipv6.c
* internal definitions under:
include/net/netfilter/nf_tables.h
include/net/netfilter/nf_tables_core.h
* It also includes an skeleton expression:
net/netfilter/nft_expr_template.c
and the preliminary implementation of the meta target
net/netfilter/nft_meta_target.c

It also includes a change in struct nf_hook_ops to add a new
pointer to store private data to the hook, that is used to store
the rule list per chain.

This patch is based on the patch from Patrick McHardy, plus merged
accumulated cleanups, fixes and small enhancements to the nftables
code that has been done since 2009, which are:

From Patrick McHardy:
* nf_tables: adjust netlink handler function signatures
* nf_tables: only retry table lookup after successful table module load
* nf_tables: fix event notification echo and avoid unnecessary messages
* nft_ct: add l3proto support
* nf_tables: pass expression context to nft_validate_data_load()
* nf_tables: remove redundant definition
* nft_ct: fix maxattr initialization
* nf_tables: fix invalid event type in nf_tables_getrule()
* nf_tables: simplify nft_data_init() usage
* nf_tables: build in more core modules
* nf_tables: fix double lookup expression unregistation
* nf_tables: move expression initialization to nf_tables_core.c
* nf_tables: build in payload module
* nf_tables: use NFPROTO constants
* nf_tables: rename pid variables to portid
* nf_tables: save 48 bits per rule
* nf_tables: introduce chain rename
* nf_tables: check for duplicate names on chain rename
* nf_tables: remove ability to specify handles for new rules
* nf_tables: return error for rule change request
* nf_tables: return error for NLM_F_REPLACE without rule handle
* nf_tables: include NLM_F_APPEND/NLM_F_REPLACE flags in rule notification
* nf_tables: fix NLM_F_MULTI usage in netlink notifications
* nf_tables: include NLM_F_APPEND in rule dumps

From Pablo Neira Ayuso:
* nf_tables: fix stack overflow in nf_tables_newrule
* nf_tables: nft_ct: fix compilation warning
* nf_tables: nft_ct: fix crash with invalid packets
* nft_log: group and qthreshold are 2^16
* nf_tables: nft_meta: fix socket uid,gid handling
* nft_counter: allow to restore counters
* nf_tables: fix module autoload
* nf_tables: allow to remove all rules placed in one chain
* nf_tables: use 64-bits rule handle instead of 16-bits
* nf_tables: fix chain after rule deletion
* nf_tables: improve deletion performance
* nf_tables: add missing code in route chain type
* nf_tables: rise maximum number of expressions from 12 to 128
* nf_tables: don't delete table if in use
* nf_tables: fix basechain release

From Tomasz Bursztyka:
* nf_tables: Add support for changing users chain's name
* nf_tables: Change chain's name to be fixed sized
* nf_tables: Add support for replacing a rule by another one
* nf_tables: Update uapi nftables netlink header documentation

From Florian Westphal:
* nft_log: group is u16, snaplen u32

From Phil Oester:
* nf_tables: operational limit match

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>