History log of /linux-master/net/netfilter/nf_conntrack_core.c
Revision Date Author Comments
# 1ebb85f9 09-Feb-2024 Eric Dumazet <edumazet@google.com>

netfilter: conntrack: expedite rcu in nf_conntrack_cleanup_net_list

nf_conntrack_cleanup_net_list() is calling synchronize_net()
while RTNL is not held. This effectively calls synchronize_rcu().

synchronize_rcu() is much slower than synchronize_rcu_expedited(),
and cleanup_net() is currently single threaded. In many workloads
we want cleanup_net() to be faster, in order to free memory and various
sysfs and procfs entries as fast as possible.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 62e7151a 27-Feb-2024 Florian Westphal <fw@strlen.de>

netfilter: bridge: confirm multicast packets before passing them up the stack

conntrack nf_confirm logic cannot handle cloned skbs referencing
the same nf_conn entry, which will happen for multicast (broadcast)
frames on bridges.

Example:
macvlan0
|
br0
/ \
ethX ethY

ethX (or Y) receives a L2 multicast or broadcast packet containing
an IP packet, flow is not yet in conntrack table.

1. skb passes through bridge and fake-ip (br_netfilter)Prerouting.
-> skb->_nfct now references a unconfirmed entry
2. skb is broad/mcast packet. bridge now passes clones out on each bridge
interface.
3. skb gets passed up the stack.
4. In macvlan case, macvlan driver retains clone(s) of the mcast skb
and schedules a work queue to send them out on the lower devices.

The clone skb->_nfct is not a copy, it is the same entry as the
original skb. The macvlan rx handler then returns RX_HANDLER_PASS.
5. Normal conntrack hooks (in NF_INET_LOCAL_IN) confirm the orig skb.

The Macvlan broadcast worker and normal confirm path will race.

This race will not happen if step 2 already confirmed a clone. In that
case later steps perform skb_clone() with skb->_nfct already confirmed (in
hash table). This works fine.

But such confirmation won't happen when eb/ip/nftables rules dropped the
packets before they reached the nf_confirm step in postrouting.

Pablo points out that nf_conntrack_bridge doesn't allow use of stateful
nat, so we can safely discard the nf_conn entry and let inet call
conntrack again.

This doesn't work for bridge netfilter: skb could have a nat
transformation. Also bridge nf prevents re-invocation of inet prerouting
via 'sabotage_in' hook.

Work around this problem by explicit confirmation of the entry at LOCAL_IN
time, before upper layer has a chance to clone the unconfirmed entry.

The downside is that this disables NAT and conntrack helpers.

Alternative fix would be to add locking to all code parts that deal with
unconfirmed packets, but even if that could be done in a sane way this
opens up other problems, for example:

-m physdev --physdev-out eth0 -j SNAT --snat-to 1.2.3.4
-m physdev --physdev-out eth1 -j SNAT --snat-to 1.2.3.5

For multicast case, only one of such conflicting mappings will be
created, conntrack only handles 1:1 NAT mappings.

Users should set create a setup that explicitly marks such traffic
NOTRACK (conntrack bypass) to avoid this, but we cannot auto-bypass
them, ruleset might have accept rules for untracked traffic already,
so user-visible behaviour would change.

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217777
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 6291b3a6 11-Oct-2023 Florian Westphal <fw@strlen.de>

netfilter: conntrack: convert nf_conntrack_update to netfilter verdicts

This function calls helpers that can return nf-verdicts, but then
those get converted to -1/0 as thats what the caller expects.

Theoretically NF_DROP could have an errno number set in the upper 24
bits of the return value. Or any of those helpers could return
NF_STOLEN, which would result in use-after-free.

This is fine as-is, the called functions don't do this yet.

But its better to avoid possible future problems if the upcoming
patchset to add NF_DROP_REASON() support gains further users, so remove
the 0/-1 translation from the picture and pass the verdicts down to
the caller.

Signed-off-by: Florian Westphal <fw@strlen.de>


# 8a23f4ab 06-Oct-2023 Florian Westphal <fw@strlen.de>

netfilter: conntrack: simplify nf_conntrack_alter_reply

nf_conntrack_alter_reply doesn't do helper reassignment anymore.
Remove the comments that make this claim.

Furthermore, remove dead code from the function and place ot
in nf_conntrack.h.

Signed-off-by: Florian Westphal <fw@strlen.de>


# 4914109a 16-Jul-2023 Xin Long <lucien.xin@gmail.com>

netfilter: allow exp not to be removed in nf_ct_find_expectation

Currently nf_conntrack_in() calling nf_ct_find_expectation() will
remove the exp from the hash table. However, in some scenario, we
expect the exp not to be removed when the created ct will not be
confirmed, like in OVS and TC conntrack in the following patches.

This patch allows exp not to be removed by setting IPS_CONFIRMED
in the status of the tmpl.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# eaf9e719 03-Jul-2023 Florian Westphal <fw@strlen.de>

netfilter: conntrack: don't fold port numbers into addresses before hashing

Originally this used jhash2() over tuple and folded the zone id,
the pernet hash value, destination port and l4 protocol number into the
32bit seed value.

When the switch to siphash was done, I used an on-stack temporary
buffer to build a suitable key to be hashed via siphash().

But this showed up as performance regression, so I got rid of
the temporary copy and collected to-be-hashed data in 4 u64 variables.

This makes it easy to build tuples that produce the same hash, which isn't
desirable even though chain lengths are limited.

Switch back to plain siphash, but just like with jhash2(), take advantage
of the fact that most of to-be-hashed data is already in a suitable order.

Use an empty struct as annotation in 'struct nf_conntrack_tuple' to mark
last member that can be used as hash input.

The only remaining data that isn't present in the tuple structure are the
zone identifier and the pernet hash: fold those into the key.

Fixes: d2c806abcf0b ("netfilter: conntrack: use siphash_4u64")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e1f543dc 24-May-2023 Tijs Van Buggenhout <tijs.van.buggenhout@axsguard.com>

netfilter: conntrack: fix NULL pointer dereference in nf_confirm_cthelper

An nf_conntrack_helper from nf_conn_help may become NULL after DNAT.

Observed when TCP port 1720 (Q931_PORT), associated with h323 conntrack
helper, is DNAT'ed to another destination port (e.g. 1730), while
nfqueue is being used for final acceptance (e.g. snort).

This happenned after transition from kernel 4.14 to 5.10.161.

Workarounds:
* keep the same port (1720) in DNAT
* disable nfqueue
* disable/unload h323 NAT helper

$ linux-5.10/scripts/decode_stacktrace.sh vmlinux < /tmp/kernel.log
BUG: kernel NULL pointer dereference, address: 0000000000000084
[..]
RIP: 0010:nf_conntrack_update (net/netfilter/nf_conntrack_core.c:2080 net/netfilter/nf_conntrack_core.c:2134) nf_conntrack
[..]
nfqnl_reinject (net/netfilter/nfnetlink_queue.c:237) nfnetlink_queue
nfqnl_recv_verdict (net/netfilter/nfnetlink_queue.c:1230) nfnetlink_queue
nfnetlink_rcv_msg (net/netfilter/nfnetlink.c:241) nfnetlink
[..]

Fixes: ee04805ff54a ("netfilter: conntrack: make conntrack userspace helpers work again")
Signed-off-by: Tijs Van Buggenhout <tijs.van.buggenhout@axsguard.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 2cdaa3ee 18-Apr-2023 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: conntrack: restore IPS_CONFIRMED out of nf_conntrack_hash_check_insert()

e6d57e9ff0ae ("netfilter: conntrack: fix rmmod double-free race")
consolidates IPS_CONFIRMED bit set in nf_conntrack_hash_check_insert().
However, this breaks ctnetlink:

# conntrack -I -p tcp --timeout 123 --src 1.2.3.4 --dst 5.6.7.8 --state ESTABLISHED --sport 1 --dport 4 -u SEEN_REPLY
conntrack v1.4.6 (conntrack-tools): Operation failed: Device or resource busy

This is a partial revert of the aforementioned commit to restore
IPS_CONFIRMED.

Fixes: e6d57e9ff0ae ("netfilter: conntrack: fix rmmod double-free race")
Reported-by: Stéphane Graber <stgraber@stgraber.org>
Tested-by: Stéphane Graber <stgraber@stgraber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e5d015a1 07-Mar-2023 Jeremy Sowden <jeremy@azazel.net>

netfilter: conntrack: fix typo

There's a spelling mistake in a comment. Fix it.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Florian Westphal <fw@strlen.de>


# c77737b7 06-Mar-2023 Eric Dumazet <edumazet@google.com>

netfilter: conntrack: adopt safer max chain length

Customers using GKE 1.25 and 1.26 are facing conntrack issues
root caused to commit c9c3b6811f74 ("netfilter: conntrack: make
max chain length random").

Even if we assume Uniform Hashing, a bucket often reachs 8 chained
items while the load factor of the hash table is smaller than 0.5

With a limit of 16, we reach load factors of 3.
With a limit of 32, we reach load factors of 11.
With a limit of 40, we reach load factors of 15.
With a limit of 50, we reach load factors of 24.

This patch changes MIN_CHAINLEN to 50, to minimize risks.

Ideally, we could in the future add a cushion based on expected
load factor (2 * nf_conntrack_max / nf_conntrack_buckets),
because some setups might expect unusual values.

Fixes: c9c3b6811f74 ("netfilter: conntrack: make max chain length random")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e6d57e9f 14-Feb-2023 Florian Westphal <fw@strlen.de>

netfilter: conntrack: fix rmmod double-free race

nf_conntrack_hash_check_insert() callers free the ct entry directly, via
nf_conntrack_free.

This isn't safe anymore because
nf_conntrack_hash_check_insert() might place the entry into the conntrack
table and then delteted the entry again because it found that a conntrack
extension has been removed at the same time.

In this case, the just-added entry is removed again and an error is
returned to the caller.

Problem is that another cpu might have picked up this entry and
incremented its reference count.

This results in a use-after-free/double-free, once by the other cpu and
once by the caller of nf_conntrack_hash_check_insert().

Fix this by making nf_conntrack_hash_check_insert() not fail anymore
after the insertion, just like before the 'Fixes' commit.

This is safe because a racing nf_ct_iterate() has to wait for us
to release the conntrack hash spinlocks.

While at it, make the function return -EAGAIN in the rmmod (genid
changed) case, this makes nfnetlink replay the command (suggested
by Pablo Neira).

Fixes: c56716c69ce1 ("netfilter: extensions: introduce extension genid count")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 2954fe60 01-Feb-2023 Florian Westphal <fw@strlen.de>

netfilter: let reset rules clean out conntrack entries

iptables/nftables support responding to tcp packets with tcp resets.

The generated tcp reset packet passes through both output and postrouting
netfilter hooks, but conntrack will never see them because the generated
skb has its ->nfct pointer copied over from the packet that triggered the
reset rule.

If the reset rule is used for established connections, this
may result in the conntrack entry to be around for a very long
time (default timeout is 5 days).

One way to avoid this would be to not copy the nf_conn pointer
so that the rest packet passes through conntrack too.

Problem is that output rules might not have the same conntrack
zone setup as the prerouting ones, so its possible that the
reset skb won't find the correct entry. Generating a template
entry for the skb seems error prone as well.

Add an explicit "closing" function that switches a confirmed
conntrack entry to closed state and wire this up for tcp.

If the entry isn't confirmed, no action is needed because
the conntrack entry will never be committed to the table.

Reported-by: Russel King <linux@armlinux.org.uk>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# df25455e 01-Feb-2023 Vlad Buslov <vladbu@nvidia.com>

netfilter: nf_conntrack: allow early drop of offloaded UDP conns

Both synchronous early drop algorithm and asynchronous gc worker completely
ignore connections with IPS_OFFLOAD_BIT status bit set. With new
functionality that enabled UDP NEW connection offload in action CT
malicious user can flood the conntrack table with offloaded UDP connections
by just sending a single packet per 5tuple because such connections can no
longer be deleted by early drop algorithm.

To mitigate the issue allow both early drop and gc to consider offloaded
UDP connections for deletion.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2a2fa2ef 15-Dec-2022 Florian Westphal <fw@strlen.de>

netfilter: conntrack: move rcu read lock to nf_conntrack_find_get

Move rcu_read_lock/unlock to nf_conntrack_find_get(), this avoids
nested rcu_read_lock call from resolve_normal_ct().

Signed-off-by: Florian Westphal <fw@strlen.de>


# 4883ec51 01-Jan-2023 Florian Westphal <fw@strlen.de>

netfilter: conntrack: avoid reload of ct->status

Compiler can't merge the two test_bit() calls, so load ct->status
once and use non-atomic accesses.

This is fine because IPS_EXPECTED or NAT_CLASH are either set at ct
creation time or not at all, but compiler can't know that.

Signed-off-by: Florian Westphal <fw@strlen.de>


# 50bfbb89 01-Jan-2023 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove pr_debug calls

Those are all useless or dubious.
getorigdst() is called via setsockopt, so return value/errno will
already indicate an appropriate error.

For other pr_debug calls there are better replacements, such as
slab/slub debugging or 'conntrack -E' (ctnetlink events).

Signed-off-by: Florian Westphal <fw@strlen.de>


# 8032bf12 09-Oct-2022 Jason A. Donenfeld <Jason@zx2c4.com>

treewide: use get_random_u32_below() instead of deprecated function

This is a simple mechanical transformation done by:

@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
(E)

Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Reviewed-by: SeongJae Park <sj@kernel.org> # for damon
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>


# d2c806ab 02-Nov-2022 Florian Westphal <fw@strlen.de>

netfilter: conntrack: use siphash_4u64

This function is used for every packet, siphash_4u64 is noticeably faster
than using local buffer + siphash:

Before:
1.23% kpktgend_0 [kernel.vmlinux] [k] __siphash_unaligned
0.14% kpktgend_0 [nf_conntrack] [k] hash_conntrack_raw
After:
0.79% kpktgend_0 [kernel.vmlinux] [k] siphash_4u64
0.15% kpktgend_0 [nf_conntrack] [k] hash_conntrack_raw

In the pktgen test this gives about ~2.4% performance improvement.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 9464d0b6 23-Nov-2022 Xin Long <lucien.xin@gmail.com>

netfilter: conntrack: fix using __this_cpu_add in preemptible

Currently in nf_conntrack_hash_check_insert(), when it fails in
nf_ct_ext_valid_pre/post(), NF_CT_STAT_INC() will be called in the
preemptible context, a call trace can be triggered:

BUG: using __this_cpu_add() in preemptible [00000000] code: conntrack/1636
caller is nf_conntrack_hash_check_insert+0x45/0x430 [nf_conntrack]
Call Trace:
<TASK>
dump_stack_lvl+0x33/0x46
check_preemption_disabled+0xc3/0xf0
nf_conntrack_hash_check_insert+0x45/0x430 [nf_conntrack]
ctnetlink_create_conntrack+0x3cd/0x4e0 [nf_conntrack_netlink]
ctnetlink_new_conntrack+0x1c0/0x450 [nf_conntrack_netlink]
nfnetlink_rcv_msg+0x277/0x2f0 [nfnetlink]
netlink_rcv_skb+0x50/0x100
nfnetlink_rcv+0x65/0x144 [nfnetlink]
netlink_unicast+0x1ae/0x290
netlink_sendmsg+0x257/0x4f0
sock_sendmsg+0x5f/0x70

This patch is to fix it by changing to use NF_CT_STAT_INC_ATOMIC() for
nf_ct_ext_valid_pre/post() check in nf_conntrack_hash_check_insert(),
as well as nf_ct_ext_valid_post() in __nf_conntrack_confirm().

Note that nf_ct_ext_valid_pre() check in __nf_conntrack_confirm() is
safe to use NF_CT_STAT_INC(), as it's under local_bh_disable().

Fixes: c56716c69ce1 ("netfilter: extensions: introduce extension genid count")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 52d1aa8b 09-Nov-2022 Daniel Xu <dxu@dxuuu.xyz>

netfilter: conntrack: Fix data-races around ct mark

nf_conn:mark can be read from and written to in parallel. Use
READ_ONCE()/WRITE_ONCE() for reads and writes to prevent unwanted
compiler optimizations.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 2aa19275 16-Sep-2022 Antoine Tenart <atenart@kernel.org>

netfilter: conntrack: revisit the gc initial rescheduling bias

The previous commit changed the way the rescheduling delay is computed
which has a side effect: the bias is now represented as much as the
other entries in the rescheduling delay which makes the logic to kick in
only with very large sets, as the initial interval is very large
(INT_MAX).

Revisit the GC initial bias to allow more frequent GC for smaller sets
while still avoiding wakeups when a machine is mostly idle. We're moving
from a large initial value to pretending we have 100 entries expiring at
the upper bound. This way only a few entries having a small timeout
won't impact much the rescheduling delay and non-idle machines will have
enough entries to lower the delay when needed. This also improves
readability as the initial bias is now linked to what is computed
instead of being an arbitrary large value.

Fixes: 2cfadb761d3d ("netfilter: conntrack: revisit gc autotuning")
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: Florian Westphal <fw@strlen.de>


# 95eabdd2 16-Sep-2022 Antoine Tenart <atenart@kernel.org>

netfilter: conntrack: fix the gc rescheduling delay

Commit 2cfadb761d3d ("netfilter: conntrack: revisit gc autotuning")
changed the eviction rescheduling to the use average expiry of scanned
entries (within 1-60s) by doing:

for (...) {
expires = clamp(nf_ct_expires(tmp), ...);
next_run += expires;
next_run /= 2;
}

The issue is the above will make the average ('next_run' here) more
dependent on the last expiration values than the firsts (for sets > 2).
Depending on the expiration values used to compute the average, the
result can be quite different than what's expected. To fix this we can
do the following:

for (...) {
expires = clamp(nf_ct_expires(tmp), ...);
next_run += (expires - next_run) / ++count;
}

Fixes: 2cfadb761d3d ("netfilter: conntrack: revisit gc autotuning")
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: Florian Westphal <fw@strlen.de>


# 864b656f 07-Sep-2022 Daniel Xu <dxu@dxuuu.xyz>

bpf: Add support for writing to nf_conn:mark

Support direct writes to nf_conn:mark from TC and XDP prog types. This
is useful when applications want to store per-connection metadata. This
is also particularly useful for applications that run both bpf and
iptables/nftables because the latter can trivially access this metadata.

One example use case would be if a bpf prog is responsible for advanced
packet classification and iptables/nftables is later used for routing
due to pre-existing/legacy code.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/ebca06dea366e3e7e861c12f375a548cc4c61108.1662568410.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>


# 6e116280 25-Jul-2022 Kumar Kartikeya Dwivedi <memxor@gmail.com>

net: netfilter: Remove ifdefs for code shared by BPF and ctnetlink

The current ifdefry for code shared by the BPF and ctnetlink side looks
ugly. As per Pablo's request, simplify this by unconditionally compiling
in the code. This can be revisited when the shared code between the two
grows further.

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220725085130.11553-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>


# b1185090 26-Aug-2022 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: remove nf_conntrack_helper sysctl and modparam toggles

__nf_ct_try_assign_helper() remains in place but it now requires a
template to configure the helper.

A toggle to disable automatic helper assignment was added by:

a9006892643a ("netfilter: nf_ct_helper: allow to disable automatic helper assignment")

in 2012 to address the issues described in "Secure use of iptables and
connection tracking helpers". Automatic conntrack helper assignment was
disabled by:

3bb398d925ec ("netfilter: nf_ct_helper: disable automatic helper assignment")

back in 2016.

This patch removes the sysctl and modparam toggles, users now have to
rely on explicit conntrack helper configuration via ruleset.

Update tools/testing/selftests/netfilter/nft_conntrack_helper.sh to
check that auto-assignment does not happen anymore.

Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# ef69aa3a 21-Jul-2022 Lorenzo Bianconi <lorenzo@kernel.org>

net: netfilter: Add kfuncs to set and change CT status

Introduce bpf_ct_set_status and bpf_ct_change_status kfunc helpers in
order to set nf_conn field of allocated entry or update nf_conn status
field of existing inserted entry. Use nf_ct_change_status_common to
share the permitted status field changes between netlink and BPF side
by refactoring ctnetlink_change_status.

It is required to introduce two kfuncs taking nf_conn___init and nf_conn
instead of sharing one because KF_TRUSTED_ARGS flag causes strict type
checking. This would disallow passing nf_conn___init to kfunc taking
nf_conn, and vice versa. We cannot remove the KF_TRUSTED_ARGS flag as we
only want to accept refcounted pointers and not e.g. ct->master.

Hence, bpf_ct_set_* kfuncs are meant to be used on allocated CT, and
bpf_ct_change_* kfuncs are meant to be used on inserted or looked up
CT entry.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220721134245.2450-10-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>


# 0b389236 21-Jul-2022 Kumar Kartikeya Dwivedi <memxor@gmail.com>

net: netfilter: Add kfuncs to set and change CT timeout

Introduce bpf_ct_set_timeout and bpf_ct_change_timeout kfunc helpers in
order to change nf_conn timeout. This is same as ctnetlink_change_timeout,
hence code is shared between both by extracting it out to
__nf_ct_change_timeout. It is also updated to return an error when it
sees IPS_FIXED_TIMEOUT_BIT bit in ct->status, as that check was missing.

It is required to introduce two kfuncs taking nf_conn___init and nf_conn
instead of sharing one because KF_TRUSTED_ARGS flag causes strict type
checking. This would disallow passing nf_conn___init to kfunc taking
nf_conn, and vice versa. We cannot remove the KF_TRUSTED_ARGS flag as we
only want to accept refcounted pointers and not e.g. ct->master.

Apart from this, bpf_ct_set_timeout is only called for newly allocated
CT so it doesn't need to inspect the status field just yet. Sharing the
helpers even if it was possible would make timeout setting helper
sensitive to order of setting status and timeout after allocation.

Hence, bpf_ct_set_* kfuncs are meant to be used on allocated CT, and
bpf_ct_change_* kfuncs are meant to be used on inserted or looked up
CT entry.

Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220721134245.2450-9-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>


# 6be79156 24-May-2022 Jackie Liu <liuyun01@kylinos.cn>

netfilter: conntrack: use fallthrough to cleanup

These cases all use the same function. we can simplify the code through
fallthrough.

$ size net/netfilter/nf_conntrack_core.o

text data bss dec hex filename
before 81601 81430 768 163799 27fd7 net/netfilter/nf_conntrack_core.o
after 80361 81430 768 162559 27aff net/netfilter/nf_conntrack_core.o

Arch: aarch64
Gcc : gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)

Reported-by: k2ci <kernel-bot@kylinos.cn>
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 0ed8f619 06-Jul-2022 Florian Westphal <fw@strlen.de>

netfilter: conntrack: fix crash due to confirmed bit load reordering

Kajetan Puchalski reports crash on ARM, with backtrace of:

__nf_ct_delete_from_lists
nf_ct_delete
early_drop
__nf_conntrack_alloc

Unlike atomic_inc_not_zero, refcount_inc_not_zero is not a full barrier.
conntrack uses SLAB_TYPESAFE_BY_RCU, i.e. it is possible that a 'newly'
allocated object is still in use on another CPU:

CPU1 CPU2
encounter 'ct' during hlist walk
delete_from_lists
refcount drops to 0
kmem_cache_free(ct);
__nf_conntrack_alloc() // returns same object
refcount_inc_not_zero(ct); /* might fail */

/* If set, ct is public/in the hash table */
test_bit(IPS_CONFIRMED_BIT, &ct->status);

In case CPU1 already set refcount back to 1, refcount_inc_not_zero()
will succeed.

The expected possibilities for a CPU that obtained the object 'ct'
(but no reference so far) are:

1. refcount_inc_not_zero() fails. CPU2 ignores the object and moves to
the next entry in the list. This happens for objects that are about
to be free'd, that have been free'd, or that have been reallocated
by __nf_conntrack_alloc(), but where the refcount has not been
increased back to 1 yet.

2. refcount_inc_not_zero() succeeds. CPU2 checks the CONFIRMED bit
in ct->status. If set, the object is public/in the table.

If not, the object must be skipped; CPU2 calls nf_ct_put() to
un-do the refcount increment and moves to the next object.

Parallel deletion from the hlists is prevented by a
'test_and_set_bit(IPS_DYING_BIT, &ct->status);' check, i.e. only one
cpu will do the unlink, the other one will only drop its reference count.

Because refcount_inc_not_zero is not a full barrier, CPU2 may try to
delete an object that is not on any list:

1. refcount_inc_not_zero() successful (refcount inited to 1 on other CPU)
2. CONFIRMED test also successful (load was reordered or zeroing
of ct->status not yet visible)
3. delete_from_lists unlinks entry not on the hlist, because
IPS_DYING_BIT is 0 (already cleared).

2) is already wrong: CPU2 will handle a partially initited object
that is supposed to be private to CPU1.

Add needed barriers when refcount_inc_not_zero() is successful.

It also inserts a smp_wmb() before the refcount is set to 1 during
allocation.

Because other CPU might still see the object, refcount_set(1)
"resurrects" it, so we need to make sure that other CPUs will also observe
the right content. In particular, the CONFIRMED bit test must only pass
once the object is fully initialised and either in the hash or about to be
inserted (with locks held to delay possible unlink from early_drop or
gc worker).

I did not change flow_offload_alloc(), as far as I can see it should call
refcount_inc(), not refcount_inc_not_zero(): the ct object is attached to
the skb so its refcount should be >= 1 in all cases.

v2: prefer smp_acquire__after_ctrl_dep to smp_rmb (Will Deacon).
v3: keep smp_acquire__after_ctrl_dep close to refcount_inc_not_zero call
add comment in nf_conntrack_netlink, no control dependency there
due to locks.

Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/all/Yr7WTfd6AVTQkLjI@e126311.manchester.arm.com/
Reported-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
Diagnosed-by: Will Deacon <will@kernel.org>
Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api")
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Will Deacon <will@kernel.org>


# 90d1daa4 25-Apr-2022 Florian Westphal <fw@strlen.de>

netfilter: conntrack: add nf_conntrack_events autodetect mode

This adds the new nf_conntrack_events=2 mode and makes it the
default.

This leverages the earlier flag in struct net to allow to avoid
the event extension as long as no event listener is active in
the namespace.

This avoids, for most cases, allocation of ct->ext area.
A followup patch will take further advantage of this by avoiding
calls down into the event framework if the extension isn't present.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b0a7ab4a 25-Apr-2022 Florian Westphal <fw@strlen.de>

netfilter: conntrack: un-inline nf_ct_ecache_ext_add

Only called when new ct is allocated or the extension isn't present.
This function will be extended, place this in the conntrack module
instead of inlining.

The callers already depend on nf_conntrack module.
Return value is changed to bool, noone used the returned pointer.

Make sure that the core drops the newly allocated conntrack
if the extension is requested but can't be added.
This makes it necessary to ifdef the section, as the stub
always returns false we'd drop every new conntrack if the
the ecache extension is disabled in kconfig.

Add from data path (xt_CT, nft_ct) is unchanged.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 8169ff58 08-Apr-2022 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: conntrack: add nf_ct_iter_data object for nf_ct_iterate_cleanup*()

This patch adds a structure to collect all the context data that is
passed to the cleanup iterator.

struct nf_ct_iter_data {
struct net *net;
void *data;
u32 portid;
int report;
};

There is a netns field that allows to clean up conntrack entries
specifically owned by the specified netns.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 0bcfbafb 11-Apr-2022 Florian Westphal <fw@strlen.de>

netfilter: conntrack: avoid unconditional local_bh_disable

Now that the conntrack entry isn't placed on the pcpu list anymore the
bh only needs to be disabled in the 'expectation present' case.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 8a75a2c1 11-Apr-2022 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove unconfirmed list

It has no function anymore and can be removed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# ace53fdc 11-Apr-2022 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove __nf_ct_unconfirmed_destroy

Its not needed anymore:

A. If entry is totally new, then the rcu-protected resource
must already have been removed from global visibility before call
to nf_ct_iterate_destroy.

B. If entry was allocated before, but is not yet in the hash table
(uncofirmed case), genid gets incremented and synchronize_rcu() call
makes sure access has completed.

C. Next attempt to peek at extension area will fail for unconfirmed
conntracks, because ext->genid != genid.

D. Conntracks in the hash are iterated as before.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c56716c6 11-Apr-2022 Florian Westphal <fw@strlen.de>

netfilter: extensions: introduce extension genid count

Multiple netfilter extensions store pointers to external data
in their extension area struct.

Examples:
1. Timeout policies
2. Connection tracking helpers.

No references are taken for these.

When a helper or timeout policy is removed, the conntrack table gets
traversed and affected extensions are cleared.

Conntrack entries not yet in the hashtable are referenced via a special
list, the unconfirmed list.

On removal of a policy or connection tracking helper, the unconfirmed
list gets traversed an all entries are marked as dying, this prevents
them from getting committed to the table at insertion time: core checks
for dying bit, if set, the conntrack entry gets destroyed at confirm
time.

The disadvantage is that each new conntrack has to be added to the percpu
unconfirmed list, and each insertion needs to remove it from this list.
The list is only ever needed when a policy or helper is removed -- a rare
occurrence.

Add a generation ID count: Instead of adding to the list and then
traversing that list on policy/helper removal, increment a counter
that is stored in the extension area.

For unconfirmed conntracks, the extension has the genid valid at ct
allocation time.

Removal of a helper/policy etc. increments the counter.
At confirmation time, validate that ext->genid == global_id.

If the stored number is not the same, do not allow the conntrack
insertion, just like as if a confirmed-list traversal would have flagged
the entry as dying.

After insertion, the genid is no longer relevant (conntrack entries
are now reachable via the conntrack table iterators and is set to 0.

This allows removal of the percpu unconfirmed list.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 17438b42 11-Apr-2022 Florian Westphal <fw@strlen.de>

netfilter: remove nf_ct_unconfirmed_destroy helper

This helper tags connections not yet in the conntrack table as
dying. These nf_conn entries will be dropped instead when the
core attempts to insert them from the input or postrouting
'confirm' hook.

After the previous change, the entries get unlinked from the
list earlier, so that by the time the actual exit hook runs,
new connections no longer have a timeout policy assigned.

Its enough to walk the hashtable instead.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 1397af5b 11-Apr-2022 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove the percpu dying list

Its no longer needed. Entries that need event redelivery are placed
on the new pernet dying list.

The advantage is that there is no need to take additional spinlock on
conntrack removal unless event redelivery failed or the conntrack entry
was never added to the table in the first place (confirmed bit not set).

The IPS_CONFIRMED bit now needs to be set as soon as the entry has been
unlinked from the unconfirmed list, else the destroy function may
attempt to unlink it a second time.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 2ed3bf18 11-Apr-2022 Florian Westphal <fw@strlen.de>

netfilter: ecache: use dedicated list for event redelivery

This disentangles event redelivery and the percpu dying list.

Because entries are now stored on a dedicated list, all
entries are in NFCT_ECACHE_DESTROY_FAIL state and all entries
still have confirmed bit set -- the reference count is at least 1.

The 'struct net' back-pointer can be removed as well.

The pcpu dying list will be removed eventually, it has no functionality.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 2cfadb76 16-Feb-2022 Florian Westphal <fw@strlen.de>

netfilter: conntrack: revisit gc autotuning

as of commit 4608fdfc07e1
("netfilter: conntrack: collect all entries in one cycle")
conntrack gc was changed to run every 2 minutes.

On systems where conntrack hash table is set to large value, most evictions
happen from gc worker rather than the packet path due to hash table
distribution.

This causes netlink event overflows when events are collected.

This change collects average expiry of scanned entries and
reschedules to the average remaining value, within 1 to 60 second interval.

To avoid event overflows, reschedule after each bucket and add a
limit for both run time and number of evictions per run.

If more entries have to be evicted, reschedule and restart 1 jiffy
into the future.

Reported-by: Karel Rericha <karel@maxtel.cz>
Cc: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Cc: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 1015c3de 20-Jan-2022 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove extension register api

These no longer register/unregister a meaningful structure so remove it.

Cc: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 1bc91a5d 20-Jan-2022 Florian Westphal <fw@strlen.de>

netfilter: conntrack: handle ->destroy hook via nat_ops instead

The nat module already exposes a few functions to the conntrack core.
Move the nat extension destroy hook to it.

After this, no conntrack extension needs a destroy hook.
'struct nf_ct_ext_type' and the register/unregister api can be removed
in a followup patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 5f31edc0 20-Jan-2022 Florian Westphal <fw@strlen.de>

netfilter: conntrack: move extension sizes into core

No need to specify this in the registration modules, we already
collect all sizes for build-time checks on the maximum combined size.

After this change, all extensions except nat have no meaningful content
in their nf_ct_ext_type struct definition.

Next patch handles nat, this will then allow to remove the dynamic
register api completely.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b4c2b959 14-Jan-2022 Kumar Kartikeya Dwivedi <memxor@gmail.com>

net/netfilter: Add unstable CT lookup helpers for XDP and TC-BPF

This change adds conntrack lookup helpers using the unstable kfunc call
interface for the XDP and TC-BPF hooks. The primary usecase is
implementing a synproxy in XDP, see Maxim's patchset [0].

Export get_net_ns_by_id as nf_conntrack_bpf.c needs to call it.

This object is only built when CONFIG_DEBUG_INFO_BTF_MODULES is enabled.

[0]: https://lore.kernel.org/bpf/20211019144655.3483197-1-maximmi@nvidia.com

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220114163953.1455836-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>


# ee0a4dc9 08-Mar-2022 Florian Westphal <fw@strlen.de>

Revert "netfilter: conntrack: tag conntracks picked up in local out hook"

This was a prerequisite for the ill-fated
"netfilter: nat: force port remap to prevent shadowing well-known ports".

As this has been reverted, this change can be backed out too.

Signed-off-by: Florian Westphal <fw@strlen.de>


# 830af2eb 13-Jan-2022 Florian Westphal <fw@strlen.de>

netfilter: conntrack: don't increment invalid counter on NF_REPEAT

The packet isn't invalid, REPEAT means we're trying again after cleaning
out a stale connection, e.g. via tcp tracker.

This caused increases of invalid stat counter in a test case involving
frequent connection reuse, even though no packet is actually invalid.

Fixes: 56a62e2218f5 ("netfilter: conntrack: fix NF_REPEAT handling")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 408bdcfc 06-Jan-2022 Florian Westphal <fw@strlen.de>

net: prefer nf_ct_put instead of nf_conntrack_put

Its the same as nf_conntrack_put(), but without the
need for an indirect call. The downside is a module dependency on
nf_conntrack, but all of these already depend on conntrack anyway.

Cc: Paul Blakey <paulb@mellanox.com>
Cc: dev@openvswitch.org
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 6ae7989c 06-Jan-2022 Florian Westphal <fw@strlen.de>

netfilter: conntrack: avoid useless indirection during conntrack destruction

nf_ct_put() results in a usesless indirection:

nf_ct_put -> nf_conntrack_put -> nf_conntrack_destroy -> rcu readlock +
indirect call of ct_hooks->destroy().

There are two _put helpers:
nf_ct_put and nf_conntrack_put. The latter is what should be used in
code that MUST NOT cause a linker dependency on the conntrack module
(e.g. calls from core network stack).

Everyone else should call nf_ct_put() instead.

A followup patch will convert a few nf_conntrack_put() calls to
nf_ct_put(), in particular from modules that already have a conntrack
dependency such as act_ct or even nf_conntrack itself.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 285c8a7a 06-Jan-2022 Florian Westphal <fw@strlen.de>

netfilter: make function op structures const

No functional changes, these structures should be const.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 3fce1649 06-Jan-2022 Florian Westphal <fw@strlen.de>

netfilter: core: move ip_ct_attach indirection to struct nf_ct_hook

ip_ct_attach predates struct nf_ct_hook, we can place it there and
remove the exported symbol.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 71977437 06-Jan-2022 Florian Westphal <fw@strlen.de>

netfilter: conntrack: convert to refcount_t api

Convert nf_conn reference counting from atomic_t to refcount_t based api.
refcount_t api provides more runtime sanity checks and will warn on
certain constructs, e.g. refcount_inc() on a zero reference count, which
usually indicates use-after-free.

For this reason template allocation is changed to init the refcount to
1, the subsequenct add operations are removed.

Likewise, init_conntrack() is changed to set the initial refcount to 1
instead refcount_inc().

This is safe because the new entry is not (yet) visible to other cpus.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 9795ded7 03-Jan-2022 Paul Blakey <paulb@nvidia.com>

net/sched: act_ct: Fill offloading tuple iifidx

Driver offloading ct tuples can use the information of which devices
received the packets that created the offloaded connections, to
more efficiently offload them only to the relevant device.

Add new act_ct nf conntrack extension, which is used to store the skb
devices before offloading the connection, and then fill in the tuple
iifindex so drivers can get the device via metadata dissector match.

Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4a6fbdd8 17-Dec-2021 Florian Westphal <fw@strlen.de>

netfilter: conntrack: tag conntracks picked up in local out hook

This allows to identify flows that originate from local machine
in a followup patch.

It would be possible to make this a ->status bit instead.
For now I did not do that yet because I don't have a use-case for
exposing this info to userspace.

If one comes up the toggle can be replaced with a status bit.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 284ca764 08-Dec-2021 luo penghao <luo.penghao@zte.com.cn>

netfilter: conntrack: Remove useless assignment statements

The old_size assignment here will not be used anymore

The clang_analyzer complains as follows:

Value stored to 'old_size' is never read

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: luo penghao <luo.penghao@zte.com.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 4be1dbb7 18-Nov-2021 Kees Cook <keescook@chromium.org>

netfilter: conntrack: Use memset_startat() to zero struct nf_conn

In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memset(), avoid intentionally writing across
neighboring fields.

Use memset_startat() to avoid confusing memset() about writing beyond
the target struct member.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 49ecc2e9 15-Nov-2021 Eric Dumazet <edumazet@google.com>

net: align static siphash keys

siphash keys use 16 bytes.

Define siphash_aligned_key_t macro so that we can make sure they
are not crossing a cache line boundary.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 802a7dc5 07-Dec-2021 Eric Dumazet <edumazet@google.com>

netfilter: conntrack: annotate data-races around ct->timeout

(struct nf_conn)->timeout can be read/written locklessly,
add READ_ONCE()/WRITE_ONCE() to prevent load/store tearing.

BUG: KCSAN: data-race in __nf_conntrack_alloc / __nf_conntrack_find_get

write to 0xffff888132e78c08 of 4 bytes by task 6029 on cpu 0:
__nf_conntrack_alloc+0x158/0x280 net/netfilter/nf_conntrack_core.c:1563
init_conntrack+0x1da/0xb30 net/netfilter/nf_conntrack_core.c:1635
resolve_normal_ct+0x502/0x610 net/netfilter/nf_conntrack_core.c:1746
nf_conntrack_in+0x1c5/0x88f net/netfilter/nf_conntrack_core.c:1901
ipv6_conntrack_local+0x19/0x20 net/netfilter/nf_conntrack_proto.c:414
nf_hook_entry_hookfn include/linux/netfilter.h:142 [inline]
nf_hook_slow+0x72/0x170 net/netfilter/core.c:619
nf_hook include/linux/netfilter.h:262 [inline]
NF_HOOK include/linux/netfilter.h:305 [inline]
ip6_xmit+0xa3a/0xa60 net/ipv6/ip6_output.c:324
inet6_csk_xmit+0x1a2/0x1e0 net/ipv6/inet6_connection_sock.c:135
__tcp_transmit_skb+0x132a/0x1840 net/ipv4/tcp_output.c:1402
tcp_transmit_skb net/ipv4/tcp_output.c:1420 [inline]
tcp_write_xmit+0x1450/0x4460 net/ipv4/tcp_output.c:2680
__tcp_push_pending_frames+0x68/0x1c0 net/ipv4/tcp_output.c:2864
tcp_push_pending_frames include/net/tcp.h:1897 [inline]
tcp_data_snd_check+0x62/0x2e0 net/ipv4/tcp_input.c:5452
tcp_rcv_established+0x880/0x10e0 net/ipv4/tcp_input.c:5947
tcp_v6_do_rcv+0x36e/0xa50 net/ipv6/tcp_ipv6.c:1521
sk_backlog_rcv include/net/sock.h:1030 [inline]
__release_sock+0xf2/0x270 net/core/sock.c:2768
release_sock+0x40/0x110 net/core/sock.c:3300
sk_stream_wait_memory+0x435/0x700 net/core/stream.c:145
tcp_sendmsg_locked+0xb85/0x25a0 net/ipv4/tcp.c:1402
tcp_sendmsg+0x2c/0x40 net/ipv4/tcp.c:1440
inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:644
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg net/socket.c:724 [inline]
__sys_sendto+0x21e/0x2c0 net/socket.c:2036
__do_sys_sendto net/socket.c:2048 [inline]
__se_sys_sendto net/socket.c:2044 [inline]
__x64_sys_sendto+0x74/0x90 net/socket.c:2044
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

read to 0xffff888132e78c08 of 4 bytes by task 17446 on cpu 1:
nf_ct_is_expired include/net/netfilter/nf_conntrack.h:286 [inline]
____nf_conntrack_find net/netfilter/nf_conntrack_core.c:776 [inline]
__nf_conntrack_find_get+0x1c7/0xac0 net/netfilter/nf_conntrack_core.c:807
resolve_normal_ct+0x273/0x610 net/netfilter/nf_conntrack_core.c:1734
nf_conntrack_in+0x1c5/0x88f net/netfilter/nf_conntrack_core.c:1901
ipv6_conntrack_local+0x19/0x20 net/netfilter/nf_conntrack_proto.c:414
nf_hook_entry_hookfn include/linux/netfilter.h:142 [inline]
nf_hook_slow+0x72/0x170 net/netfilter/core.c:619
nf_hook include/linux/netfilter.h:262 [inline]
NF_HOOK include/linux/netfilter.h:305 [inline]
ip6_xmit+0xa3a/0xa60 net/ipv6/ip6_output.c:324
inet6_csk_xmit+0x1a2/0x1e0 net/ipv6/inet6_connection_sock.c:135
__tcp_transmit_skb+0x132a/0x1840 net/ipv4/tcp_output.c:1402
__tcp_send_ack+0x1fd/0x300 net/ipv4/tcp_output.c:3956
tcp_send_ack+0x23/0x30 net/ipv4/tcp_output.c:3962
__tcp_ack_snd_check+0x2d8/0x510 net/ipv4/tcp_input.c:5478
tcp_ack_snd_check net/ipv4/tcp_input.c:5523 [inline]
tcp_rcv_established+0x8c2/0x10e0 net/ipv4/tcp_input.c:5948
tcp_v6_do_rcv+0x36e/0xa50 net/ipv6/tcp_ipv6.c:1521
sk_backlog_rcv include/net/sock.h:1030 [inline]
__release_sock+0xf2/0x270 net/core/sock.c:2768
release_sock+0x40/0x110 net/core/sock.c:3300
tcp_sendpage+0x94/0xb0 net/ipv4/tcp.c:1114
inet_sendpage+0x7f/0xc0 net/ipv4/af_inet.c:833
rds_tcp_xmit+0x376/0x5f0 net/rds/tcp_send.c:118
rds_send_xmit+0xbed/0x1500 net/rds/send.c:367
rds_send_worker+0x43/0x200 net/rds/threads.c:200
process_one_work+0x3fc/0x980 kernel/workqueue.c:2298
worker_thread+0x616/0xa70 kernel/workqueue.c:2445
kthread+0x2c7/0x2e0 kernel/kthread.c:327
ret_from_fork+0x1f/0x30

value changed: 0x00027cc2 -> 0x00000000

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 17446 Comm: kworker/u4:5 Tainted: G W 5.16.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: krdsd rds_send_worker

Note: I chose an arbitrary commit for the Fixes: tag,
because I do not think we need to backport this fix to very old kernels.

Fixes: e37542ba111f ("netfilter: conntrack: avoid possible false sharing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e9edc188 17-Sep-2021 Eric Dumazet <edumazet@google.com>

netfilter: conntrack: serialize hash resizes and cleanups

Syzbot was able to trigger the following warning [1]

No repro found by syzbot yet but I was able to trigger similar issue
by having 2 scripts running in parallel, changing conntrack hash sizes,
and:

for j in `seq 1 1000` ; do unshare -n /bin/true >/dev/null ; done

It would take more than 5 minutes for net_namespace structures
to be cleaned up.

This is because nf_ct_iterate_cleanup() has to restart everytime
a resize happened.

By adding a mutex, we can serialize hash resizes and cleanups
and also make get_next_corpse() faster by skipping over empty
buckets.

Even without resizes in the picture, this patch considerably
speeds up network namespace dismantles.

[1]
INFO: task syz-executor.0:8312 can't die for more than 144 seconds.
task:syz-executor.0 state:R running task stack:25672 pid: 8312 ppid: 6573 flags:0x00004006
Call Trace:
context_switch kernel/sched/core.c:4955 [inline]
__schedule+0x940/0x26f0 kernel/sched/core.c:6236
preempt_schedule_common+0x45/0xc0 kernel/sched/core.c:6408
preempt_schedule_thunk+0x16/0x18 arch/x86/entry/thunk_64.S:35
__local_bh_enable_ip+0x109/0x120 kernel/softirq.c:390
local_bh_enable include/linux/bottom_half.h:32 [inline]
get_next_corpse net/netfilter/nf_conntrack_core.c:2252 [inline]
nf_ct_iterate_cleanup+0x15a/0x450 net/netfilter/nf_conntrack_core.c:2275
nf_conntrack_cleanup_net_list+0x14c/0x4f0 net/netfilter/nf_conntrack_core.c:2469
ops_exit_list+0x10d/0x160 net/core/net_namespace.c:171
setup_net+0x639/0xa30 net/core/net_namespace.c:349
copy_net_ns+0x319/0x760 net/core/net_namespace.c:470
create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226
ksys_unshare+0x445/0x920 kernel/fork.c:3128
__do_sys_unshare kernel/fork.c:3202 [inline]
__se_sys_unshare kernel/fork.c:3200 [inline]
__x64_sys_unshare+0x2d/0x40 kernel/fork.c:3200
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f63da68e739
RSP: 002b:00007f63d7c05188 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
RAX: ffffffffffffffda RBX: 00007f63da792f80 RCX: 00007f63da68e739
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000040000000
RBP: 00007f63da6e8cc4 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f63da792f80
R13: 00007fff50b75d3f R14: 00007f63d7c05300 R15: 0000000000022000

Showing all locks held in the system:
1 lock held by khungtaskd/27:
#0: ffffffff8b980020 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x260 kernel/locking/lockdep.c:6446
2 locks held by kworker/u4:2/153:
#0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
#0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: arch_atomic_long_set include/linux/atomic/atomic-long.h:41 [inline]
#0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: atomic_long_set include/linux/atomic/atomic-instrumented.h:1198 [inline]
#0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: set_work_data kernel/workqueue.c:634 [inline]
#0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: set_work_pool_and_clear_pending kernel/workqueue.c:661 [inline]
#0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x896/0x1690 kernel/workqueue.c:2268
#1: ffffc9000140fdb0 ((kfence_timer).work){+.+.}-{0:0}, at: process_one_work+0x8ca/0x1690 kernel/workqueue.c:2272
1 lock held by systemd-udevd/2970:
1 lock held by in:imklog/6258:
#0: ffff88807f970ff0 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0xe9/0x100 fs/file.c:990
3 locks held by kworker/1:6/8158:
1 lock held by syz-executor.0/8312:
2 locks held by kworker/u4:13/9320:
1 lock held by syz-executor.5/10178:
1 lock held by syz-executor.4/10217:

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b16ac3c4 08-Sep-2021 Florian Westphal <fw@strlen.de>

netfilter: conntrack: include zone id in tuple hash again

commit deedb59039f111 ("netfilter: nf_conntrack: add direction support for zones")
removed the zone id from the hash value.

This has implications on hash chain lengths with overlapping tuples, which
can hit 64k entries on released kernels, before upper droplimit was added
in d7e7747ac5c ("netfilter: refuse insertion if chain has grown too large").

With that change reverted, test script coming with this series shows
linear insertion time growth:

10000 entries in 3737 ms (now 10000 total, loop 1)
10000 entries in 16994 ms (now 20000 total, loop 2)
10000 entries in 47787 ms (now 30000 total, loop 3)
10000 entries in 72731 ms (now 40000 total, loop 4)
10000 entries in 95761 ms (now 50000 total, loop 5)
10000 entries in 96809 ms (now 60000 total, loop 6)
inserted 60000 entries from packet path in 333825 ms

With d7e7747ac5c in place, the test fails.

There are three supported zone use cases:
1. Connection is in the default zone (zone 0).
This means to special config (the default).
2. Connection is in a different zone (1 to 2**16).
This means rules are in place to put packets in
the desired zone, e.g. derived from vlan id or interface.
3. Original direction is in zone X and Reply is in zone 0.

3) allows to use of the existing NAT port collision avoidance to provide
connectivity to internet/wan even when the various zones have overlapping
source networks separated via policy routing.

In case the original zone is 0 all three cases are identical.

There is no way to place original direction in zone x and reply in
zone y (with y != 0).

Zones need to be assigned manually via the iptables/nftables ruleset,
before conntrack lookup occurs (raw table in iptables) using the
"CT" target conntrack template support
(-j CT --{zone,zone-orig,zone-reply} X).

Normally zone assignment happens based on incoming interface, but could
also be derived from packet mark, vlan id and so on.

This means that when case 3 is used, the ruleset will typically not even
assign a connection tracking template to the "reply" packets, so lookup
happens in zone 0.

However, it is possible that reply packets also match a ct zone
assignment rule which sets up a template for zone X (X > 0) in original
direction only.

Therefore, after making the zone id part of the hash, we need to do a
second lookup using the reply zone id if we did not find an entry on
the first lookup.

In practice, most deployments will either not use zones at all or the
origin and reply zones are the same, no second lookup is required in
either case.

After this change, packet path insertion test passes with constant
insertion times:

10000 entries in 1064 ms (now 10000 total, loop 1)
10000 entries in 1074 ms (now 20000 total, loop 2)
10000 entries in 1066 ms (now 30000 total, loop 3)
10000 entries in 1079 ms (now 40000 total, loop 4)
10000 entries in 1081 ms (now 50000 total, loop 5)
10000 entries in 1082 ms (now 60000 total, loop 6)
inserted 60000 entries from packet path in 6452 ms

Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c9c3b681 08-Sep-2021 Florian Westphal <fw@strlen.de>

netfilter: conntrack: make max chain length random

Similar to commit 67d6d681e15b
("ipv4: make exception cache less predictible"):

Use a random drop length to make it harder to detect when entries were
hashed to same bucket list.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# d7e7747a 26-Aug-2021 Florian Westphal <fw@strlen.de>

netfilter: refuse insertion if chain has grown too large

Also add a stat counter for this that gets exported both via old /proc
interface and ctnetlink.

Assuming the old default size of 16536 buckets and max hash occupancy of
64k, this results in 128k insertions (origin+reply), so ~8 entries per
chain on average.

The revised settings in this series will result in about two entries per
bucket on average.

This allows a hard-limit ceiling of 64.

This is not tunable at the moment, but its possible to either increase
nf_conntrack_buckets or decrease nf_conntrack_max to reduce average
lengths.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# dd6d2910 26-Aug-2021 Florian Westphal <fw@strlen.de>

netfilter: conntrack: switch to siphash

Replace jhash in conntrack and nat core with siphash.

While at it, use the netns mix value as part of the input key
rather than abuse the seed value.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# d532bcd0 26-Aug-2021 Florian Westphal <fw@strlen.de>

netfilter: conntrack: sanitize table size default settings

conntrack has two distinct table size settings:
nf_conntrack_max and nf_conntrack_buckets.

The former limits how many conntrack objects are allowed to exist
in each namespace.

The second sets the size of the hashtable.

As all entries are inserted twice (once for original direction, once for
reply), there should be at least twice as many buckets in the table than
the maximum number of conntrack objects that can exist at the same time.

Change the default multiplier to 1 and increase the chosen bucket sizes.
This results in the same nf_conntrack_max settings as before but reduces
the average bucket list length.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 4608fdfc 26-Jul-2021 Florian Westphal <fw@strlen.de>

netfilter: conntrack: collect all entries in one cycle

Michal Kubecek reports that conntrack gc is responsible for frequent
wakeups (every 125ms) on idle systems.

On busy systems, timed out entries are evicted during lookup.
The gc worker is only needed to remove entries after system becomes idle
after a busy period.

To resolve this, always scan the entire table.
If the scan is taking too long, reschedule so other work_structs can run
and resume from next bucket.

After a completed scan, wait for 2 minutes before the next cycle.
Heuristics for faster re-schedule are removed.

GC_SCAN_INTERVAL could be exposed as a sysctl in the future to allow
tuning this as-needed or even turn the gc worker off.

Reported-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 30a56a2b 18-Jul-2021 Florian Westphal <fw@strlen.de>

netfilter: conntrack: adjust stop timestamp to real expiry value

In case the entry is evicted via garbage collection there is
delay between the timeout value and the eviction event.

This adjusts the stop value based on how much time has passed.

Fixes: b87a2f9199ea82 ("netfilter: conntrack: add gc worker to remove timed-out entries")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# cf4466ea 27-Jun-2021 Manfred Spraul <manfred@colorfullife.com>

netfilter: conntrack: Mark access for KCSAN

KCSAN detected an data race with ipc/sem.c that is intentional.

As nf_conntrack_lock() uses the same algorithm: Update
nf_conntrack_core as well:

nf_conntrack_lock() contains
a1) spin_lock()
a2) smp_load_acquire(nf_conntrack_locks_all).

a1) actually accesses one lock from an array of locks.

nf_conntrack_locks_all() contains
b1) nf_conntrack_locks_all=true (normal write)
b2) spin_lock()
b3) spin_unlock()

b2 and b3 are done for every lock.

This guarantees that nf_conntrack_locks_all() prevents any
concurrent nf_conntrack_lock() owners:
If a thread past a1), then b2) will block until that thread releases
the lock.
If the threat is before a1, then b3)+a1) ensure the write b1) is
visible, thus a2) is guaranteed to see the updated value.

But: This is only the latest time when b1) becomes visible.
It may also happen that b1) is visible an undefined amount of time
before the b3). And thus KCSAN will notice a data race.

In addition, the compiler might be too clever.

Solution: Use WRITE_ONCE().

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# a23f89a9 30-Jun-2021 Vasily Averin <vvs@virtuozzo.com>

netfilter: conntrack: nf_ct_gre_keymap_flush() removal

nf_ct_gre_keymap_flush() is useless.
It is called from nf_conntrack_cleanup_net_list() only and tries to remove
nf_ct_gre_keymap entries from pernet gre keymap list. Though:
a) at this point the list should already be empty, all its entries were
deleted during the conntracks cleanup, because
nf_conntrack_cleanup_net_list() executes nf_ct_iterate_cleanup(kill_all)
before nf_conntrack_proto_pernet_fini():
nf_conntrack_cleanup_net_list
+- nf_ct_iterate_cleanup
| nf_ct_put
| nf_conntrack_put
| nf_conntrack_destroy
| destroy_conntrack
| destroy_gre_conntrack
| nf_ct_gre_keymap_destroy
`- nf_conntrack_proto_pernet_fini
nf_ct_gre_keymap_flush

b) Let's say we find that the keymap list is not empty. This means netns
still has a conntrack associated with gre, in which case we should not free
its memory, because this will lead to a double free and related crashes.
However I doubt it could have gone unnoticed for years, obviously
this does not happen in real life. So I think we can remove
both nf_ct_gre_keymap_flush() and nf_conntrack_proto_pernet_fini().

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 0418b989 01-Jun-2021 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nftables: add nf_ct_pernet() helper function

Consolidate call to net_generic(net, nf_conntrack_net_id) in this
wrapper function.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c53bd0e9 12-Apr-2021 Florian Westphal <fw@strlen.de>

netfilter: conntrack: move ct counter to net_generic data

Its only needed from slowpath (sysctl, ctnetlink, gc worker) and
when a new conntrack object is allocated.

Furthermore, each write dirties the otherwise read-mostly pernet
data in struct net.ct, which are accessed from packet path.

Move it to the net_generic data. This makes struct netns_ct
read-mostly.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# f6f2e580 12-Apr-2021 Florian Westphal <fw@strlen.de>

netfilter: conntrack: move expect counter to net_generic data

Creation of a new conntrack entry isn't a frequent operation (compared
to 'ct entry already exists'). Creation of a new entry that is also an
expected (related) connection even less so.

Place this counter in net_generic data.

A followup patch will also move the conntrack count -- this will make
netns_ct a read-mostly structure.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 1379940b 01-Apr-2021 Florian Westphal <fw@strlen.de>

netfilter: conntrack: move ecache dwork to net_generic infra

dwork struct is large (>128 byte) and not needed when conntrack module
is not loaded.

Place it in net_generic data instead. The struct net dwork member is now
obsolete and will be removed in a followup patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 07998281 04-Feb-2021 Florian Westphal <fw@strlen.de>

netfilter: conntrack: skip identical origin tuple in same zone only

The origin skip check needs to re-test the zone. Else, we might skip
a colliding tuple in the reply direction.

This only occurs when using 'directional zones' where origin tuples
reside in different zones but the reply tuples share the same zone.

This causes the new conntrack entry to be dropped at confirmation time
because NAT clash resolution was elided.

Fixes: 4e35c1cb9460240 ("netfilter: nf_nat: skip nat clash resolution for same-origin entries")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# ff73e747 25-Aug-2020 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove unneeded nf_ct_put

We can delay refcount increment until we reassign the existing entry to
the current skb.

A 0 refcount can't happen while the nf_conn object is still in the
hash table and parallel mutations are impossible because we hold the
bucket lock.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# bc924704 25-Aug-2020 Florian Westphal <fw@strlen.de>

netfilter: conntrack: add clash resolution stat counter

There is a misconception about what "insert_failed" means.

We increment this even when a clash got resolved, so it might not indicate
a problem.

Add a dedicated counter for clash resolution and only increment
insert_failed if a clash cannot be resolved.

For the old /proc interface, export this in place of an older stat
that got removed a while back.
For ctnetlink, export this with a new attribute.

Also correct an outdated comment that implies we add a duplicate tuple --
we only add the (unique) reply direction.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 4afc41df 25-Aug-2020 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove ignore stats

This counter increments when nf_conntrack_in sees a packet that already
has a conntrack attached or when the packet is marked as UNTRACKED.
Neither is an error.

The former is normal for loopback traffic. The second happens for
certain ICMPv6 packets or when nftables/ip(6)tables rules are in place.

In case someone needs to count UNTRACKED packets, or packets
that are marked as untracked before conntrack_in this can be done with
both nftables and ip(6)tables rules.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b1328e54 25-Aug-2020 Florian Westphal <fw@strlen.de>

netfilter: conntrack: do not increment two error counters at same time

The /proc interface for nf_conntrack displays the "error" counter as
"icmp_error".

It makes sense to not increment "invalid" when failing to handle an icmp
packet since those are special.

For example, its possible for conntrack to see partial and/or fragmented
packets inside icmp errors. This should be a separate event and not get
mixed with the "invalid" counter.

Likewise, remove the "error" increment for errors from get_l4proto().
After this, the error counter will only increment for errors coming from
icmp(v6) packet handling.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 73f9407b 03-Aug-2020 Roi Dayan <roid@mellanox.com>

netfilter: conntrack: Move nf_ct_offload_timeout to header file

To be used by callers from other modules.

[ Rename DAY to NF_CT_DAY to avoid possible symbol name pollution
issue --Pablo ]

Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 8201d923 20-Jul-2020 Ahmed S. Darwish <a.darwish@linutronix.de>

netfilter: conntrack: Use sequence counter with associated spinlock

A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.

Use the new seqcount_spinlock_t data type, which allows to associate a
spinlock with the sequence counter. This enables lockdep to verify that
the spinlock used for writer serialization is held when the write side
critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200720155530.1173732-15-a.darwish@linutronix.de


# 3db86c39 12-Jul-2020 Andrew Lunn <andrew@lunn.ch>

net: netfilter: kerneldoc fixes

Simple fixes which require no deep knowledge of the code.

Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d005fbb8 01-Jul-2020 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: conntrack: refetch conntrack after nf_conntrack_update()

__nf_conntrack_update() might refresh the conntrack object that is
attached to the skbuff. Otherwise, this triggers UAF.

[ 633.200434] ==================================================================
[ 633.200472] BUG: KASAN: use-after-free in nf_conntrack_update+0x34e/0x770 [nf_conntrack]
[ 633.200478] Read of size 1 at addr ffff888370804c00 by task nfqnl_test/6769

[ 633.200487] CPU: 1 PID: 6769 Comm: nfqnl_test Not tainted 5.8.0-rc2+ #388
[ 633.200490] Hardware name: LENOVO 23259H1/23259H1, BIOS G2ET32WW (1.12 ) 05/30/2012
[ 633.200491] Call Trace:
[ 633.200499] dump_stack+0x7c/0xb0
[ 633.200526] ? nf_conntrack_update+0x34e/0x770 [nf_conntrack]
[ 633.200532] print_address_description.constprop.6+0x1a/0x200
[ 633.200539] ? _raw_write_lock_irqsave+0xc0/0xc0
[ 633.200568] ? nf_conntrack_update+0x34e/0x770 [nf_conntrack]
[ 633.200594] ? nf_conntrack_update+0x34e/0x770 [nf_conntrack]
[ 633.200598] kasan_report.cold.9+0x1f/0x42
[ 633.200604] ? call_rcu+0x2c0/0x390
[ 633.200633] ? nf_conntrack_update+0x34e/0x770 [nf_conntrack]
[ 633.200659] nf_conntrack_update+0x34e/0x770 [nf_conntrack]
[ 633.200687] ? nf_conntrack_find_get+0x30/0x30 [nf_conntrack]

Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1436
Fixes: ee04805ff54a ("netfilter: conntrack: make conntrack userspace helpers work again")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# cb8aa9a3 04-May-2020 Romain Bellan <romain.bellan@wifirst.fr>

netfilter: ctnetlink: add kernel side filtering for dump

Conntrack dump does not support kernel side filtering (only get exists,
but it returns only one entry. And user has to give a full valid tuple)

It means that userspace has to implement filtering after receiving many
irrelevant entries, consuming resources (conntrack table is sometimes
very huge, much more than a routing table for example).

This patch adds filtering in kernel side. To achieve this goal, we:

* Add a new CTA_FILTER netlink attributes, actually a flag list to
parametize filtering
* Convert some *nlattr_to_tuple() functions, to allow a partial parsing
of CTA_TUPLE_ORIG and CTA_TUPLE_REPLY (so nf_conntrack_tuple it not
fully set)

Filtering is now possible on:
* IP SRC/DST values
* Ports for TCP and UDP flows
* IMCP(v6) codes types and IDs

Filtering is done as an "AND" operator. For example, when flags
PROTO_SRC_PORT, PROTO_NUM and IP_SRC are sets, only entries matching all
values are dumped.

Changes since v1:
Set NLM_F_DUMP_FILTERED in nlm flags if entries are filtered

Changes since v2:
Move several constants to nf_internals.h
Move a fix on netlink values check in a separate patch
Add a check on not-supported flags
Return EOPNOTSUPP if CDA_FILTER is set in ctnetlink_flush_conntrack
(not yet implemented)
Code style issues

Changes since v3:
Fix compilation warning reported by kbuild test robot

Changes since v4:
Fix a regression introduced in v3 (returned EINVAL for valid netlink
messages without CTA_MARK)

Changes since v5:
Change definition of CTA_FILTER_F_ALL
Fix a regression when CTA_TUPLE_ZONE is not set

Signed-off-by: Romain Bellan <romain.bellan@wifirst.fr>
Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 94945ad2 26-May-2020 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: conntrack: comparison of unsigned in cthelper confirmation

net/netfilter/nf_conntrack_core.c: In function nf_confirm_cthelper:
net/netfilter/nf_conntrack_core.c:2117:15: warning: comparison of unsigned expression in < 0 is always false [-Wtype-limits]
2117 | if (protoff < 0 || (frag_off & htons(~0x7)) != 0)
| ^

ipv6_skip_exthdr() returns a signed integer.

Reported-by: Colin Ian King <colin.king@canonical.com>
Fixes: 703acd70f249 ("netfilter: nfnetlink_cthelper: unbreak userspace helper support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 46c1e062 27-May-2020 Nathan Chancellor <nathan@kernel.org>

netfilter: conntrack: Pass value of ctinfo to __nf_conntrack_update

Clang warns:

net/netfilter/nf_conntrack_core.c:2068:21: warning: variable 'ctinfo' is
uninitialized when used here [-Wuninitialized]
nf_ct_set(skb, ct, ctinfo);
^~~~~~
net/netfilter/nf_conntrack_core.c:2024:2: note: variable 'ctinfo' is
declared here
enum ip_conntrack_info ctinfo;
^
1 warning generated.

nf_conntrack_update was split up into nf_conntrack_update and
__nf_conntrack_update, where the assignment of ctinfo is in
nf_conntrack_update but it is used in __nf_conntrack_update.

Pass the value of ctinfo from nf_conntrack_update to
__nf_conntrack_update so that uninitialized memory is not used
and everything works properly.

Fixes: ee04805ff54a ("netfilter: conntrack: make conntrack userspace helpers work again")
Link: https://github.com/ClangBuiltLinux/linux/issues/1039
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# ee04805f 24-May-2020 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: conntrack: make conntrack userspace helpers work again

Florian Westphal says:

"Problem is that after the helper hook was merged back into the confirm
one, the queueing itself occurs from the confirm hook, i.e. we queue
from the last netfilter callback in the hook-list.

Therefore, on return, the packet bypasses the confirm action and the
connection is never committed to the main conntrack table.

To fix this there are several ways:
1. revert the 'Fixes' commit and have a extra helper hook again.
Works, but has the drawback of adding another indirect call for
everyone.

2. Special case this: split the hooks only when userspace helper
gets added, so queueing occurs at a lower priority again,
and normal enqueue reinject would eventually call the last hook.

3. Extend the existing nf_queue ct update hook to allow a forced
confirmation (plus run the seqadj code).

This goes for 3)."

Fixes: 827318feb69cb ("netfilter: conntrack: remove helper hook again")
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 54ab49fd 10-May-2020 Florian Westphal <fw@strlen.de>

netfilter: conntrack: fix infinite loop on rmmod

'rmmod nf_conntrack' can hang forever, because the netns exit
gets stuck in nf_conntrack_cleanup_net_list():

i_see_dead_people:
busy = 0;
list_for_each_entry(net, net_exit_list, exit_list) {
nf_ct_iterate_cleanup(kill_all, net, 0, 0);
if (atomic_read(&net->ct.count) != 0)
busy = 1;
}
if (busy) {
schedule();
goto i_see_dead_people;
}

When nf_ct_iterate_cleanup iterates the conntrack table, all nf_conn
structures can be found twice:
once for the original tuple and once for the conntracks reply tuple.

get_next_corpse() only calls the iterator when the entry is
in original direction -- the idea was to avoid unneeded invocations
of the iterator callback.

When support for clashing entries was added, the assumption that
all nf_conn objects are added twice, once in original, once for reply
tuple no longer holds -- NF_CLASH_BIT entries are only added in
the non-clashing reply direction.

Thus, if at least one NF_CLASH entry is in the list then
nf_conntrack_cleanup_net_list() always skips it completely.

During normal netns destruction, this causes a hang of several
seconds, until the gc worker removes the entry (NF_CLASH entries
always have a 1 second timeout).

But in the rmmod case, the gc worker has already been stopped, so
ct.count never becomes 0.

We can fix this in two ways:

1. Add a second test for CLASH_BIT and call iterator for those
entries as well, or:
2. Skip the original tuple direction and use the reply tuple.

2) is simpler, so do that.

Fixes: 6a757c07e51f80ac ("netfilter: conntrack: allow insertion of clashing entries")
Reported-by: Chen Yi <yiche@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 2c407aca 30-Apr-2020 Arnd Bergmann <arnd@arndb.de>

netfilter: conntrack: avoid gcc-10 zero-length-bounds warning

gcc-10 warns around a suspicious access to an empty struct member:

net/netfilter/nf_conntrack_core.c: In function '__nf_conntrack_alloc':
net/netfilter/nf_conntrack_core.c:1522:9: warning: array subscript 0 is outside the bounds of an interior zero-length array 'u8[0]' {aka 'unsigned char[0]'} [-Wzero-length-bounds]
1522 | memset(&ct->__nfct_init_offset[0], 0,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from net/netfilter/nf_conntrack_core.c:37:
include/net/netfilter/nf_conntrack.h:90:5: note: while referencing '__nfct_init_offset'
90 | u8 __nfct_init_offset[0];
| ^~~~~~~~~~~~~~~~~~

The code is correct but a bit unusual. Rework it slightly in a way that
does not trigger the warning, using an empty struct instead of an empty
array. There are probably more elegant ways to do this, but this is the
smallest change.

Fixes: c41884ce0562 ("netfilter: conntrack: avoid zeroing timer")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 9312eaba 27-Mar-2020 wenxu <wenxu@ucloud.cn>

netfilter: conntrack: add nf_ct_acct_add()

Add nf_ct_acct_add function to update the conntrack counter
with packets and bytes.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 8ac2bd35 23-Mar-2020 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: conntrack: export nf_ct_acct_update()

This function allows you to update the conntrack counters.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 6b36d482 10-Mar-2020 Jules Irenge <jbi.octave@gmail.com>

netfilter: conntrack: Add missing annotations for nf_conntrack_all_lock() and nf_conntrack_all_unlock()

Sparse reports warnings at nf_conntrack_all_lock()
and nf_conntrack_all_unlock()

warning: context imbalance in nf_conntrack_all_lock()
- wrong count at exit
warning: context imbalance in nf_conntrack_all_unlock()
- unexpected unlock

Add the missing __acquires(&nf_conntrack_locks_all_lock)
Add missing __releases(&nf_conntrack_locks_all_lock)

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 9325f070 20-Feb-2020 Li RongQing <lirongqing@baidu.com>

netfilter: cleanup unused macro

TEMPLATE_NULLS_VAL is not used after commit 0838aa7fcfcd
("netfilter: fix netns dependencies with conntrack templates")

PFX is not used after commit 8bee4bad03c5b ("netfilter: xt
extensions: use pr_<level>")

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 6a757c07 03-Feb-2020 Florian Westphal <fw@strlen.de>

netfilter: conntrack: allow insertion of clashing entries

This patch further relaxes the need to drop an skb due to a clash with
an existing conntrack entry.

Current clash resolution handles the case where the clash occurs between
two identical entries (distinct nf_conn objects with same tuples), i.e.:

Original Reply
existing: 10.2.3.4:42 -> 10.8.8.8:53 10.2.3.4:42 <- 10.0.0.6:5353
clashing: 10.2.3.4:42 -> 10.8.8.8:53 10.2.3.4:42 <- 10.0.0.6:5353

... existing handling will discard the unconfirmed clashing entry and
makes skb->_nfct point to the existing one. The skb can then be
processed normally just as if the clash would not have existed in the
first place.

For other clashes, the skb needs to be dropped.
This frequently happens with DNS resolvers that send A and AAAA queries
back-to-back when NAT rules are present that cause packets to get
different DNAT transformations applied, for example:

-m statistics --mode random ... -j DNAT --dnat-to 10.0.0.6:5353
-m statistics --mode random ... -j DNAT --dnat-to 10.0.0.7:5353

In this case the A or AAAA query is dropped which incurs a costly
delay during name resolution.

This patch also allows this collision type:
Original Reply
existing: 10.2.3.4:42 -> 10.8.8.8:53 10.2.3.4:42 <- 10.0.0.6:5353
clashing: 10.2.3.4:42 -> 10.8.8.8:53 10.2.3.4:42 <- 10.0.0.7:5353

In this case, clash is in original direction -- the reply direction
is still unique.

The change makes it so that when the 2nd colliding packet is received,
the clashing conntrack is tagged with new IPS_NAT_CLASH_BIT, gets a fixed
1 second timeout and is inserted in the reply direction only.

The entry is hidden from 'conntrack -L', it will time out quickly
and it can be early dropped because it will never progress to the
ASSURED state.

To avoid special-casing the delete code path to special case
the ORIGINAL hlist_nulls node, a new helper, "hlist_nulls_add_fake", is
added so hlist_nulls_del() will work.

Example:

CPU A: CPU B:
1. 10.2.3.4:42 -> 10.8.8.8:53 (A)
2. 10.2.3.4:42 -> 10.8.8.8:53 (AAAA)
3. Apply DNAT, reply changed to 10.0.0.6
4. 10.2.3.4:42 -> 10.8.8.8:53 (AAAA)
5. Apply DNAT, reply changed to 10.0.0.7
6. confirm/commit to conntrack table, no collisions
7. commit clashing entry

Reply comes in:

10.2.3.4:42 <- 10.0.0.6:5353 (A)
-> Finds a conntrack, DNAT is reversed & packet forwarded to 10.2.3.4:42
10.2.3.4:42 <- 10.0.0.7:5353 (AAAA)
-> Finds a conntrack, DNAT is reversed & packet forwarded to 10.2.3.4:42
The conntrack entry is deleted from table, as it has the NAT_CLASH
bit set.

In case of a retransmit from ORIGINAL dir, all further packets will get
the DNAT transformation to 10.0.0.6.

I tried to come up with other solutions but they all have worse
problems.

Alternatives considered were:
1. Confirm ct entries at allocation time, not in postrouting.
a. will cause uneccesarry work when the skb that creates the
conntrack is dropped by ruleset.
b. in case nat is applied, ct entry would need to be moved in
the table, which requires another spinlock pair to be taken.
c. breaks the 'unconfirmed entry is private to cpu' assumption:
we would need to guard all nfct->ext allocation requests with
ct->lock spinlock.

2. Make the unconfirmed list a hash table instead of a pcpu list.
Shares drawback c) of the first alternative.

3. Document this is expected and force users to rearrange their
ruleset (e.g. by using "-m cluster" instead of "-m statistics").
nft has the 'jhash' expression which can be used instead of 'numgen'.

Major drawback: doesn't fix what I consider a bug, not very realistic
and I believe its reasonable to have the existing rulesets to 'just
work'.

4. Document this is expected and force users to steer problematic
packets to the same CPU -- this would serialize the "allocate new
conntrack entry/nat table evaluation/perform nat/confirm entry", so
no race can occur. Similar drawback to 3.

Another advantage of this patch compared to 1) and 2) is that there are
no changes to the hot path; things are handled in the udp tracker and
the clash resolution path.

Cc: rcu@vger.kernel.org
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# bb89abe5 03-Feb-2020 Florian Westphal <fw@strlen.de>

netfilter: conntrack: split resolve_clash function

Followup patch will need a helper function with the 'clashing entries
refer to the identical tuple in both directions' resolution logic.

This patch will add another resolve_clash helper where loser_ct must
not be added to the dying list because it will be inserted into the
table.

Therefore this also moves the stat counters and dying-list insertion
of the losing ct.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b1b32552 03-Feb-2020 Florian Westphal <fw@strlen.de>

netfilter: conntrack: place confirm-bit setting in a helper

... so it can be re-used from clash resolution in followup patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 3d1e0b40 03-Feb-2020 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove two args from resolve_clash

ctinfo is whats taken from the skb, i.e.
ct = nf_ct_get(skb, &ctinfo).

We do not pass 'ct' and instead re-fetch it from the skb.
Just do the same for both netns and ctinfo.

Also add a comment on what clash resolution is supposed to do.
While at it, one indent level can be removed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b9e0102a 28-Jan-2020 Joe Perches <joe@perches.com>

netfilter: Use kvcalloc

Convert the uses of kvmalloc_array with __GFP_ZERO to
the equivalent kvcalloc.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 13d74c0a 12-Dec-2019 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove two export symbols

Not used anywhere, remove them.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c7c17e6a 28-Nov-2019 Florian Westphal <fw@strlen.de>

netfilter: conntrack: tell compiler to not inline nf_ct_resolve_clash

At this time compiler inlines it, but this code will not be executed
under normal conditions.

Also, no inlining allows to use "nf_ct_resolve_clash%return" perf probe.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 2ad9d774 15-Oct-2019 Florian Westphal <fw@strlen.de>

netfilter: conntrack: free extension area immediately

Instead of waiting for rcu grace period just free it directly.

This is safe because conntrack lookup doesn't consider extensions.

Other accesses happen while ct->ext can't be free'd, either because
a ct refcount was taken or because the conntrack hash bucket lock or
the dying list spinlock have been taken.

This allows to remove __krealloc in a followup patch, netfilter was the
only user.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e37542ba 09-Oct-2019 Eric Dumazet <edumazet@google.com>

netfilter: conntrack: avoid possible false sharing

As hinted by KCSAN, we need at least one READ_ONCE()
to prevent a compiler optimization.

More details on :
https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONCE#it-may-improve-performance

sysbot report :
BUG: KCSAN: data-race in __nf_ct_refresh_acct / __nf_ct_refresh_acct

read to 0xffff888123eb4f08 of 4 bytes by interrupt on cpu 0:
__nf_ct_refresh_acct+0xd4/0x1b0 net/netfilter/nf_conntrack_core.c:1796
nf_ct_refresh_acct include/net/netfilter/nf_conntrack.h:201 [inline]
nf_conntrack_tcp_packet+0xd40/0x3390 net/netfilter/nf_conntrack_proto_tcp.c:1161
nf_conntrack_handle_packet net/netfilter/nf_conntrack_core.c:1633 [inline]
nf_conntrack_in+0x410/0xaa0 net/netfilter/nf_conntrack_core.c:1727
ipv4_conntrack_in+0x27/0x40 net/netfilter/nf_conntrack_proto.c:178
nf_hook_entry_hookfn include/linux/netfilter.h:135 [inline]
nf_hook_slow+0x83/0x160 net/netfilter/core.c:512
nf_hook include/linux/netfilter.h:260 [inline]
NF_HOOK include/linux/netfilter.h:303 [inline]
ip_rcv+0x12f/0x1a0 net/ipv4/ip_input.c:523
__netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
__netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
napi_skb_finish net/core/dev.c:5671 [inline]
napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
virtnet_receive drivers/net/virtio_net.c:1323 [inline]
virtnet_poll+0x436/0x7d0 drivers/net/virtio_net.c:1428
napi_poll net/core/dev.c:6352 [inline]
net_rx_action+0x3ae/0xa50 net/core/dev.c:6418
__do_softirq+0x115/0x33f kernel/softirq.c:292

write to 0xffff888123eb4f08 of 4 bytes by task 7191 on cpu 1:
__nf_ct_refresh_acct+0xfb/0x1b0 net/netfilter/nf_conntrack_core.c:1797
nf_ct_refresh_acct include/net/netfilter/nf_conntrack.h:201 [inline]
nf_conntrack_tcp_packet+0xd40/0x3390 net/netfilter/nf_conntrack_proto_tcp.c:1161
nf_conntrack_handle_packet net/netfilter/nf_conntrack_core.c:1633 [inline]
nf_conntrack_in+0x410/0xaa0 net/netfilter/nf_conntrack_core.c:1727
ipv4_conntrack_local+0xbe/0x130 net/netfilter/nf_conntrack_proto.c:200
nf_hook_entry_hookfn include/linux/netfilter.h:135 [inline]
nf_hook_slow+0x83/0x160 net/netfilter/core.c:512
nf_hook include/linux/netfilter.h:260 [inline]
__ip_local_out+0x1f7/0x2b0 net/ipv4/ip_output.c:114
ip_local_out+0x31/0x90 net/ipv4/ip_output.c:123
__ip_queue_xmit+0x3a8/0xa40 net/ipv4/ip_output.c:532
ip_queue_xmit+0x45/0x60 include/net/ip.h:236
__tcp_transmit_skb+0xdeb/0x1cd0 net/ipv4/tcp_output.c:1158
__tcp_send_ack+0x246/0x300 net/ipv4/tcp_output.c:3685
tcp_send_ack+0x34/0x40 net/ipv4/tcp_output.c:3691
tcp_cleanup_rbuf+0x130/0x360 net/ipv4/tcp.c:1575

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 7191 Comm: syz-fuzzer Not tainted 5.3.0+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: cc16921351d8 ("netfilter: conntrack: avoid same-timeout update")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Florian Westphal <fw@strlen.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>


# 44b63b0a 23-Aug-2019 Li RongQing <lirongqing@baidu.com>

netfilter: not mark a spinlock as __read_mostly

when spinlock is locked/unlocked, its elements will be changed,
so marking it as __read_mostly is not suitable.

and remove a duplicate definition of nf_conntrack_locks_all_lock
strange that compiler does not complain.

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 656c8e9c 08-Aug-2019 Dirk Morris <dmorris@metaloft.com>

netfilter: conntrack: Use consistent ct id hash calculation

Change ct id hash calculation to only use invariants.

Currently the ct id hash calculation is based on some fields that can
change in the lifetime on a conntrack entry in some corner cases. The
current hash uses the whole tuple which contains an hlist pointer which
will change when the conntrack is placed on the dying list resulting in
a ct id change.

This patch also removes the reply-side tuple and extension pointer from
the hash calculation so that the ct id will will not change from
initialization until confirmation.

Fixes: 3c79107631db1f7 ("netfilter: ctnetlink: don't use conntrack/expect object addresses as id")
Signed-off-by: Dirk Morris <dmorris@metaloft.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 05ba4c89 08-Jul-2019 Yonatan Goldschmidt <yon.goldschmidt@gmail.com>

netfilter: Update obsolete comments referring to ip_conntrack

In 9fb9cbb1082d ("[NETFILTER]: Add nf_conntrack subsystem.") the new
generic nf_conntrack was introduced, and it came to supersede the old
ip_conntrack.

This change updates (some) of the obsolete comments referring to old
file/function names of the ip_conntrack mechanism, as well as removes a
few self-referencing comments that we shouldn't maintain anymore.

I did not update any comments referring to historical actions (e.g,
comments like "this file was derived from ..." were left untouched, even
if the referenced file is no longer here).

Signed-off-by: Yonatan Goldschmidt <yon.goldschmidt@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# d2912cb1 04-Jun-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500

Based on 2 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation #

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 4122 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 87e389b4 04-Jun-2019 Florian Westphal <fw@strlen.de>

netfilter: conntrack: small conntrack lookup optimization

____nf_conntrack_find() performs checks on the conntrack objects in
this order:

1. if (nf_ct_is_expired(ct))

This fetches ct->timeout, in third cache line.

The hnnode that is used to store the list pointers resides in the first
(origin) or second (reply tuple) cache lines.

This test rarely passes, but its necessary to reap obsolete entries.

2. if (nf_ct_is_dying(ct))

This fetches ct->status, also in third cache line.

The test is useless, and can be removed:
Consider:
cpu0 cpu1
ct = ____nf_conntrack_find()
atomic_inc_not_zero(ct) -> ok
nf_ct_key_equal -> ok
is_dying -> DYING bit not set, ok
set_bit(ct, DYING);
... unhash ... etc.
return ct
-> returning a ct with dying bit set, despite
having a test for it.

This (unlikely) case is fine - refcount prevents ct from getting free'd.

3. if (nf_ct_key_equal(h, tuple, zone, net))

nf_ct_key_equal checks in following order:

1. Tuple equal (first or second cacheline)
2. Zone equal (third cacheline)
3. confirmed bit set (->status, third cacheline)
4. net namespace match (third cacheline).

Swapping "timeout" and "cpu" places timeout in the first cacheline.
This has two advantages:

1. For a conntrack that won't even match the original tuple,
we will now only fetch the first and maybe the second cacheline
instead of always accessing the 3rd one as well.

2. in case of TCP ct->timeout changes frequently because we
reduce/increase it when there are packets outstanding in the network.

The first cacheline contains both the reference count and the ct spinlock,
i.e. moving timeout there avoids writes to 3rd cacheline.

The restart sequence in __nf_conntrack_find() is removed, if we found a
candidate, but then fail to increment the refcount or discover the tuple
has changed (object recycling), just pretend we did not find an entry.

A second lookup won't find anything until another CPU adds a new conntrack
with identical tuple into the hash table, which is very unlikely.

We have the confirmation-time checks (when we hold hash lock) that deal
with identical entries and even perform clash resolution in some cases.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 916f6efa 16-Apr-2019 Florian Westphal <fw@strlen.de>

netfilter: never get/set skb->tstamp

setting net.netfilter.nf_conntrack_timestamp=1 breaks xmit with fq
scheduler. skb->tstamp might be "refreshed" using ktime_get_real(),
but fq expects CLOCK_MONOTONIC.

This patch removes all places in netfilter that check/set skb->tstamp:

1. To fix the bogus "start" time seen with conntrack timestamping for
outgoing packets, never use skb->tstamp and always use current time.
2. In nfqueue and nflog, only use skb->tstamp for incoming packets,
as determined by current hook (prerouting, input, forward).
3. xt_time has to use system clock as well rather than skb->tstamp.
We could still use skb->tstamp for prerouting/input/foward, but
I see no advantage to make this conditional.

Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC")
Cc: Eric Dumazet <edumazet@google.com>
Reported-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 3c791076 01-Apr-2019 Florian Westphal <fw@strlen.de>

netfilter: ctnetlink: don't use conntrack/expect object addresses as id

else, we leak the addresses to userspace via ctnetlink events
and dumps.

Compute an ID on demand based on the immutable parts of nf_conn struct.

Another advantage compared to using an address is that there is no
immediate re-use of the same ID in case the conntrack entry is freed and
reallocated again immediately.

Fixes: 3583240249ef ("[NETFILTER]: nf_conntrack_expect: kill unique ID")
Fixes: 7f85f914721f ("[NETFILTER]: nf_conntrack: kill unique ID")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 8176c833 28-Mar-2019 Alexander Potapenko <glider@google.com>

netfilter: conntrack: initialize ct->timeout

KMSAN started reporting an error when accessing ct->timeout for the
first time without initialization:

BUG: KMSAN: uninit-value in __nf_ct_refresh_acct+0x1ae/0x470 net/netfilter/nf_conntrack_core.c:1765
...
dump_stack+0x173/0x1d0 lib/dump_stack.c:113
kmsan_report+0x131/0x2a0 mm/kmsan/kmsan.c:624
__msan_warning+0x7a/0xf0 mm/kmsan/kmsan_instr.c:310
__nf_ct_refresh_acct+0x1ae/0x470 net/netfilter/nf_conntrack_core.c:1765
nf_ct_refresh_acct ./include/net/netfilter/nf_conntrack.h:201
nf_conntrack_udp_packet+0xb44/0x1040 net/netfilter/nf_conntrack_proto_udp.c:122
nf_conntrack_handle_packet net/netfilter/nf_conntrack_core.c:1605
nf_conntrack_in+0x1250/0x26c9 net/netfilter/nf_conntrack_core.c:1696
...
Uninit was created at:
kmsan_save_stack_with_flags mm/kmsan/kmsan.c:205
kmsan_internal_poison_shadow+0x92/0x150 mm/kmsan/kmsan.c:159
kmsan_kmalloc+0xa9/0x130 mm/kmsan/kmsan_hooks.c:173
kmem_cache_alloc+0x554/0xb10 mm/slub.c:2789
__nf_conntrack_alloc+0x16f/0x690 net/netfilter/nf_conntrack_core.c:1342
init_conntrack+0x6cb/0x2490 net/netfilter/nf_conntrack_core.c:1421

Signed-off-by: Alexander Potapenko <glider@google.com>
Fixes: cc16921351d8ba1 ("netfilter: conntrack: avoid same-timeout update")
Cc: Florian Westphal <fw@strlen.de>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 2e7b162c 26-Feb-2019 Li RongQing <lirongqing@baidu.com>

netfilter: nf_conntrack: ensure that CONNTRACK_LOCKS is power of 2

CONNTRACK_LOCKS is divisor when computer array index, if it is power of
2, compiler will optimize modulo operation as bitwise AND, or else
modulo will lower performance.

Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# cc169213 21-Feb-2019 Florian Westphal <fw@strlen.de>

netfilter: conntrack: avoid same-timeout update

No need to dirty a cache line if timeout is unchanged.
Also, WARN() is useless here: we crash on 'skb->len' access
if skb is NULL.

Last, ct->timeout is u32, not 'unsigned long' so adapt the
function prototype accordingly.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# d2c5c103 19-Feb-2019 Florian Westphal <fw@strlen.de>

netfilter: nat: remove nf_nat_l3proto.h and nf_nat_core.h

The l3proto name is gone, its header file is the last trace.
While at it, also remove nf_nat_core.h, its very small and all users
include nf_nat.h too.

before:
text data bss dec hex filename
22948 1612 4136 28696 7018 nf_nat.ko

after removal of l3proto register/unregister functions:
text data bss dec hex filename
22196 1516 4136 27848 6cc8 nf_nat.ko

checkpatch complains about overly long lines, but line breaks
do not make things more readable and the line length gets smaller
here, not larger.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 13f5251f 11-Feb-2019 Chieh-Min Wang <chiehminw@synology.com>

netfilter: conntrack: fix cloned unconfirmed skb->_nfct race in __nf_conntrack_confirm

For bridge(br_flood) or broadcast/multicast packets, they could clone
skb with unconfirmed conntrack which break the rule that unconfirmed
skb->_nfct is never shared. With nfqueue running on my system, the race
can be easily reproduced with following warning calltrace:

[13257.707525] CPU: 0 PID: 12132 Comm: main Tainted: P W 4.4.60 #7744
[13257.707568] Hardware name: Qualcomm (Flattened Device Tree)
[13257.714700] [<c021f6dc>] (unwind_backtrace) from [<c021bce8>] (show_stack+0x10/0x14)
[13257.720253] [<c021bce8>] (show_stack) from [<c0449e10>] (dump_stack+0x94/0xa8)
[13257.728240] [<c0449e10>] (dump_stack) from [<c022a7e0>] (warn_slowpath_common+0x94/0xb0)
[13257.735268] [<c022a7e0>] (warn_slowpath_common) from [<c022a898>] (warn_slowpath_null+0x1c/0x24)
[13257.743519] [<c022a898>] (warn_slowpath_null) from [<c06ee450>] (__nf_conntrack_confirm+0xa8/0x618)
[13257.752284] [<c06ee450>] (__nf_conntrack_confirm) from [<c0772670>] (ipv4_confirm+0xb8/0xfc)
[13257.761049] [<c0772670>] (ipv4_confirm) from [<c06e7a60>] (nf_iterate+0x48/0xa8)
[13257.769725] [<c06e7a60>] (nf_iterate) from [<c06e7af0>] (nf_hook_slow+0x30/0xb0)
[13257.777108] [<c06e7af0>] (nf_hook_slow) from [<c07f20b4>] (br_nf_post_routing+0x274/0x31c)
[13257.784486] [<c07f20b4>] (br_nf_post_routing) from [<c06e7a60>] (nf_iterate+0x48/0xa8)
[13257.792556] [<c06e7a60>] (nf_iterate) from [<c06e7af0>] (nf_hook_slow+0x30/0xb0)
[13257.800458] [<c06e7af0>] (nf_hook_slow) from [<c07e5580>] (br_forward_finish+0x94/0xa4)
[13257.808010] [<c07e5580>] (br_forward_finish) from [<c07f22ac>] (br_nf_forward_finish+0x150/0x1ac)
[13257.815736] [<c07f22ac>] (br_nf_forward_finish) from [<c06e8df0>] (nf_reinject+0x108/0x170)
[13257.824762] [<c06e8df0>] (nf_reinject) from [<c06ea854>] (nfqnl_recv_verdict+0x3d8/0x420)
[13257.832924] [<c06ea854>] (nfqnl_recv_verdict) from [<c06e940c>] (nfnetlink_rcv_msg+0x158/0x248)
[13257.841256] [<c06e940c>] (nfnetlink_rcv_msg) from [<c06e5564>] (netlink_rcv_skb+0x54/0xb0)
[13257.849762] [<c06e5564>] (netlink_rcv_skb) from [<c06e4ec8>] (netlink_unicast+0x148/0x23c)
[13257.858093] [<c06e4ec8>] (netlink_unicast) from [<c06e5364>] (netlink_sendmsg+0x2ec/0x368)
[13257.866348] [<c06e5364>] (netlink_sendmsg) from [<c069fb8c>] (sock_sendmsg+0x34/0x44)
[13257.874590] [<c069fb8c>] (sock_sendmsg) from [<c06a03dc>] (___sys_sendmsg+0x1ec/0x200)
[13257.882489] [<c06a03dc>] (___sys_sendmsg) from [<c06a11c8>] (__sys_sendmsg+0x3c/0x64)
[13257.890300] [<c06a11c8>] (__sys_sendmsg) from [<c0209b40>] (ret_fast_syscall+0x0/0x34)

The original code just triggered the warning but do nothing. It will
caused the shared conntrack moves to the dying list and the packet be
droppped (nf_ct_resolve_clash returns NF_DROP for dying conntrack).

- Reproduce steps:

+----------------------------+
| br0(bridge) |
| |
+-+---------+---------+------+
| eth0| | eth1| | eth2|
| | | | | |
+--+--+ +--+--+ +---+-+
| | |
| | |
+--+-+ +-+--+ +--+-+
| PC1| | PC2| | PC3|
+----+ +----+ +----+

iptables -A FORWARD -m mark --mark 0x1000000/0x1000000 -j NFQUEUE --queue-num 100 --queue-bypass

ps: Our nfq userspace program will set mark on packets whose connection
has already been processed.

PC1 sends broadcast packets simulated by hping3:

hping3 --rand-source --udp 192.168.1.255 -i u100

- Broadcast racing flow chart is as follow:

br_handle_frame
BR_HOOK(NFPROTO_BRIDGE, NF_BR_PRE_ROUTING, br_handle_frame_finish)
// skb->_nfct (unconfirmed conntrack) is constructed at PRE_ROUTING stage
br_handle_frame_finish
// check if this packet is broadcast
br_flood_forward
br_flood
list_for_each_entry_rcu(p, &br->port_list, list) // iterate through each port
maybe_deliver
deliver_clone
skb = skb_clone(skb)
__br_forward
BR_HOOK(NFPROTO_BRIDGE, NF_BR_FORWARD,...)
// queue in our nfq and received by our userspace program
// goto __nf_conntrack_confirm with process context on CPU 1
br_pass_frame_up
BR_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_IN,...)
// goto __nf_conntrack_confirm with softirq context on CPU 0

Because conntrack confirm can happen at both INPUT and POSTROUTING
stage. So with NFQUEUE running, skb->_nfct with the same unconfirmed
conntrack could race on different core.

This patch fixes a repeating kernel splat, now it is only displayed
once.

Signed-off-by: Chieh-Min Wang <chiehminw@synology.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 4e35c1cb 29-Jan-2019 Martynas Pumputis <martynas@weave.works>

netfilter: nf_nat: skip nat clash resolution for same-origin entries

It is possible that two concurrent packets originating from the same
socket of a connection-less protocol (e.g. UDP) can end up having
different IP_CT_DIR_REPLY tuples which results in one of the packets
being dropped.

To illustrate this, consider the following simplified scenario:

1. Packet A and B are sent at the same time from two different threads
by same UDP socket. No matching conntrack entry exists yet.
Both packets cause allocation of a new conntrack entry.
2. get_unique_tuple gets called for A. No clashing entry found.
conntrack entry for A is added to main conntrack table.
3. get_unique_tuple is called for B and will find that the reply
tuple of B is already taken by A.
It will allocate a new UDP source port for B to resolve the clash.
4. conntrack entry for B cannot be added to main conntrack table
because its ORIGINAL direction is clashing with A and the REPLY
directions of A and B are not the same anymore due to UDP source
port reallocation done in step 3.

This patch modifies nf_conntrack_tuple_taken so it doesn't consider
colliding reply tuples if the IP_CT_DIR_ORIGINAL tuples are equal.

[ Florian: simplify patch to not use .allow_clash setting
and always ignore identical flows ]

Signed-off-by: Martynas Pumputis <martynas@weave.works>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e2f7cc72 21-Jan-2019 Florian Westphal <fw@strlen.de>

netfilter: conntrack: fix bogus port values for other l4 protocols

We must only extract l4 proto information if we can track the layer 4
protocol.

Before removal of pkt_to_tuple callback, the code to extract port
information was only reached for TCP/UDP/LITE/DCCP/SCTP.

The other protocols were handled by the indirect call, and the
'generic' tracker took care of other protocols that have no notion
of 'ports'.

After removal of the callback we must be more strict here and only
init port numbers for those protocols that have ports.

Fixes: df5e1629087a ("netfilter: conntrack: remove pkt_to_tuple callback")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 81e01647 21-Jan-2019 Florian Westphal <fw@strlen.de>

netfilter: conntrack: fix IPV6=n builds

Stephen Rothwell reports:
After merging the netfilter-next tree, today's linux-next build
(powerpc ppc64_defconfig) failed like this:

ERROR: "nf_conntrack_invert_icmpv6_tuple" [nf_conntrack.ko] undefined!
ERROR: "nf_conntrack_icmpv6_packet" [nf_conntrack.ko] undefined!
ERROR: "nf_conntrack_icmpv6_init_net" [nf_conntrack.ko] undefined!
ERROR: "icmpv6_pkt_to_tuple" [nf_conntrack.ko] undefined!
ERROR: "nf_ct_gre_keymap_destroy" [nf_conntrack.ko] undefined!

icmpv6 related errors are due to lack of IS_ENABLED(CONFIG_IPV6) (no
icmpv6 support is builtin if kernel has CONFIG_IPV6=n), the
nf_ct_gre_keymap_destroy error is due to lack of PROTO_GRE check.

Fixes: a47c54048162 ("netfilter: conntrack: handle builtin l4proto packet functions via direct calls")
Fixes: e2e48b471634 ("netfilter: conntrack: handle icmp pkt_to_tuple helper via direct calls")
Fixes: 197c4300aec0 ("netfilter: conntrack: remove invert_tuple callback")
Fixes: 2a389de86e4a ("netfilter: conntrack: remove l4proto init and get_net callbacks")
Fixes: e56894356f60 ("netfilter: conntrack: remove l4proto destroy hook")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 4a60dc74 15-Jan-2019 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove nf_ct_l4proto_find_get

Its now same as __nf_ct_l4proto_find(), so rename that to
nf_ct_l4proto_find and use it everywhere.

It never returns NULL and doesn't need locks or reference counts.

Before this series:
302824 net/netfilter/nf_conntrack.ko
21504 net/netfilter/nf_conntrack_proto_gre.ko

text data bss dec hex filename
6281 1732 4 8017 1f51 nf_conntrack_proto_gre.ko
108356 20613 236 129205 1f8b5 nf_conntrack.ko

After:
294864 net/netfilter/nf_conntrack.ko
text data bss dec hex filename
106979 19557 240 126776 1ef38 nf_conntrack.ko

so, even with builtin gre, total size got reduced.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e5689435 15-Jan-2019 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove l4proto destroy hook

Only one user (gre), add a direct call and remove this facility.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 303e0c55 15-Jan-2019 Florian Westphal <fw@strlen.de>

netfilter: conntrack: avoid unneeded nf_conntrack_l4proto lookups

after removal of the packet and invert function pointers, several
places do not need to lookup the l4proto structure anymore.

Remove those lookups.
The function nf_ct_invert_tuplepr becomes redundant, replace
it with nf_ct_invert_tuple everywhere.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 44fb87f6 15-Jan-2019 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove remaining l4proto indirect packet calls

Now that all l4trackers are builtin, no need to use a mix of direct and
indirect calls.
This removes the last two users: gre and the generic l4 protocol
tracker.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 197c4300 15-Jan-2019 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove invert_tuple callback

Only used by icmp(v6). Prefer a direct call and remove this
function from the l4proto struct.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# df5e1629 15-Jan-2019 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove pkt_to_tuple callback

GRE is now builtin, so we can handle it via direct call and
remove the callback.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e2e48b47 15-Jan-2019 Florian Westphal <fw@strlen.de>

netfilter: conntrack: handle icmp pkt_to_tuple helper via direct calls

rather than handling them via indirect call, use a direct one instead.
This leaves GRE as the last user of this indirect call facility.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# a47c5404 15-Jan-2019 Florian Westphal <fw@strlen.de>

netfilter: conntrack: handle builtin l4proto packet functions via direct calls

The l4 protocol trackers are invoked via indirect call: l4proto->packet().

With one exception (gre), all l4trackers are builtin, so we can make
.packet optional and use a direct call for most protocols.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# ca79b0c2 28-Dec-2018 Arun KS <arunks@codeaurora.org>

mm: convert totalram_pages and totalhigh_pages variables to atomic

totalram_pages and totalhigh_pages are made static inline function.

Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3d6357de 28-Dec-2018 Arun KS <arunks@codeaurora.org>

mm: reference totalram_pages and managed_pages once per function

Patch series "mm: convert totalram_pages, totalhigh_pages and managed
pages to atomic", v5.

This series converts totalram_pages, totalhigh_pages and
zone->managed_pages to atomic variables.

totalram_pages, zone->managed_pages and totalhigh_pages updates are
protected by managed_page_count_lock, but readers never care about it.
Convert these variables to atomic to avoid readers potentially seeing a
store tear.

Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 It seemes better
to remove the lock and convert variables to atomic. With the change,
preventing poteintial store-to-read tearing comes as a bonus.

This patch (of 4):

This is in preparation to a later patch which converts totalram_pages and
zone->managed_pages to atomic variables. Please note that re-reading the
value might lead to a different value and as such it could lead to
unexpected behavior. There are no known bugs as a result of the current
code but it is better to prevent from them in principle.

Link: http://lkml.kernel.org/r/1542090790-21750-2-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fc3893fd 18-Dec-2018 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove empty pernet fini stubs

after moving sysctl handling into single place, the init functions
can't fail anymore and some of the fini functions are empty.

Remove them and change return type to void.
This also simplifies error unwinding in conntrack module init path.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# f393808d 25-Oct-2018 Vasily Khoruzhick <vasilykh@arista.com>

netfilter: conntrack: fix calculation of next bucket number in early_drop

If there's no entry to drop in bucket that corresponds to the hash,
early_drop() should look for it in other buckets. But since it increments
hash instead of bucket number, it actually looks in the same bucket 8
times: hsize is 16k by default (14 bits) and hash is 32-bit value, so
reciprocal_scale(hash, hsize) returns the same value for hash..hash+7 in
most cases.

Fix it by increasing bucket number instead of hash and rename _hash
to bucket to avoid future confusion.

Fixes: 3e86638e9a0b ("netfilter: conntrack: consider ct netns in early_drop logic")
Cc: <stable@vger.kernel.org> # v4.7+
Signed-off-by: Vasily Khoruzhick <vasilykh@arista.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# dd2934a9 16-Sep-2018 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove l3->l4 mapping information

l4 protocols are demuxed by l3num, l4num pair.

However, almost all l4 trackers are l3 agnostic.

Only exceptions are:
- gre, icmp (ipv4 only)
- icmpv6 (ipv6 only)

This commit gets rid of the l3 mapping, l4 trackers can now be looked up
by their IPPROTO_XXX value alone, which gets rid of the additional l3
indirection.

For icmp, ipcmp6 and gre, add a check on state->pf and
return -NF_ACCEPT in case we're asked to track e.g. icmpv6-in-ipv4,
this seems more fitting than using the generic tracker.

Additionally we can kill the 2nd l4proto definitions that were needed
for v4/v6 split -- they are now the same so we can use single l4proto
struct for each protocol, rather than two.

The EXPORT_SYMBOLs can be removed as all these object files are
part of nf_conntrack with no external references.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 6fe78fa4 12-Sep-2018 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove error callback and handle icmp from core

icmp(v6) are the only two layer four protocols that need the error()
callback (to handle icmp errors that are related to an established
connections, e.g. packet too big, port unreachable and the like).

Remove the error callback and handle these two special cases from the core.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 9976fc6e 12-Sep-2018 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove the l4proto->new() function

->new() gets invoked after ->error() and before ->packet() if
a conntrack lookup has found no result for the tuple.

We can fold it into ->packet() -- the packet() implementations
can check if the conntrack is confirmed (new) or not
(already in hash).

If its unconfirmed, the conntrack isn't in the hash yet so current
skb created a new conntrack entry.

Only relevant side effect -- if packet() doesn't return NF_ACCEPT
but -NF_ACCEPT (or drop), while the conntrack was just created,
then the newly allocated conntrack is freed right away, rather than not
created in the first place.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 93e66024 12-Sep-2018 Florian Westphal <fw@strlen.de>

netfilter: conntrack: pass nf_hook_state to packet and error handlers

nf_hook_state contains all the hook meta-information: netns, protocol family,
hook location, and so on.

Instead of only passing selected information, pass a pointer to entire
structure.

This will allow to merge the error and the packet handlers and remove
the ->new() function in followup patches.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 285189c7 25-Jul-2018 Li RongQing <lirongqing@baidu.com>

netfilter: use kvmalloc_array to allocate memory for hashtable

nf_ct_alloc_hashtable is used to allocate memory for conntrack,
NAT bysrc and expectation hashtable. Assuming 64k bucket size,
which means 7th order page allocation, __get_free_pages, called
by nf_ct_alloc_hashtable, will trigger the direct memory reclaim
and stall for a long time, when system has lots of memory stress

so replace combination of __get_free_pages and vzalloc with
kvmalloc_array, which provides a overflow check and a fallback
if no high order memory is available, and do not retry to reclaim
memory, reduce stall

and remove nf_ct_free_hashtable, since it is just a kvfree

Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Wang Li <wangli39@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 440534d3 09-Jul-2018 Gao Feng <gfree.wind@vip.163.com>

netfilter: Remove useless param helper of nf_ct_helper_ext_add

The param helper of nf_ct_helper_ext_add is useless now, then remove
it now.

Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# ed07d9a0 02-Jul-2018 Martynas Pumputis <martynas@weave.works>

netfilter: nf_conntrack: resolve clash for matching conntracks

This patch enables the clash resolution for NAT (disabled in
"590b52e10d41") if clashing conntracks match (i.e. both tuples are equal)
and a protocol allows it.

The clash might happen for a connections-less protocol (e.g. UDP) when
two threads in parallel writes to the same socket and consequent calls
to "get_unique_tuple" return the same tuples (incl. reply tuples).

In this case it is safe to perform the resolution, as the losing CT
describes the same mangling as the winning CT, so no modifications to
the packet are needed, and the result of rules traversal for the loser's
packet stays valid.

Signed-off-by: Martynas Pumputis <martynas@weave.works>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# a0ae2562 28-Jun-2018 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove l3proto abstraction

This unifies ipv4 and ipv6 protocol trackers and removes the l3proto
abstraction.

This gets rid of all l3proto indirect calls and the need to do
a lookup on the function to call for l3 demux.

It increases module size by only a small amount (12kbyte), so this reduces
size because nf_conntrack.ko is useless without either nf_conntrack_ipv4
or nf_conntrack_ipv6 module.

before:
text data bss dec hex filename
7357 1088 0 8445 20fd nf_conntrack_ipv4.ko
7405 1084 4 8493 212d nf_conntrack_ipv6.ko
72614 13689 236 86539 1520b nf_conntrack.ko
19K nf_conntrack_ipv4.ko
19K nf_conntrack_ipv6.ko
179K nf_conntrack.ko

after:
text data bss dec hex filename
79277 13937 236 93450 16d0a nf_conntrack.ko
191K nf_conntrack.ko

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c779e849 28-Jun-2018 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove get_timeout() indirection

Not needed, we can have the l4trackers fetch it themselvs.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 97e08cae 28-Jun-2018 Florian Westphal <fw@strlen.de>

netfilter: conntrack: avoid l4proto pkt_to_tuple calls

Handle common protocols (udp, tcp, ..), in the core and only
do the call if needed by the l4proto tracker.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 8b3892ea 28-Jun-2018 Florian Westphal <fw@strlen.de>

netfilter: conntrack: avoid calls to l4proto invert_tuple

Handle the common cases (tcp, udp, etc). in the core and only
do the indirect call for the protocols that need it (GRE for instance).

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 6816d931 28-Jun-2018 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove get_l4proto indirection from l3 protocol trackers

Handle it in the core instead.

ipv6_skip_exthdr() is built-in even if ipv6 is a module, i.e. this
doesn't create an ipv6 dependency.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# d1b6fe94 28-Jun-2018 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove invert_tuple indirection from l3 protocol trackers

Its simpler to just handle it directly in nf_ct_invert_tuple().
Also gets rid of need to pass l3proto pointer to resolve_conntrack().

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 47a91b14 28-Jun-2018 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove pkt_to_tuple indirection from l3 protocol trackers

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 60e3be94 25-Jun-2018 Florian Westphal <fw@strlen.de>

openvswitch: use nf_ct_get_tuplepr, invert_tuplepr

These versions deal with the l3proto/l4proto details internally.
It removes only caller of nf_ct_get_tuple, so make it static.

After this, l3proto->get_l4proto() can be removed in a followup patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b60a6040 06-Jul-2018 Toke Høiland-Jørgensen <toke@toke.dk>

netfilter: Add nf_ct_get_tuple_skb global lookup function

This adds a global netfilter function to extract a conntrack tuple from an
skb. The function uses a new function added to nf_ct_hook, which will try
to get the tuple from skb->_nfct, and do a full lookup if that fails. This
makes it possible to use the lookup function before the skb has passed
through the conntrack init hooks (e.g., in an ingress qdisc). The tuple is
copied to the caller to avoid issues with reference counting.

The function returns false if conntrack is not loaded, allowing it to be
used without incurring a module dependency on conntrack. This is used by
the NAT mode in sch_cake.

Cc: netfilter-devel@vger.kernel.org
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2045cdfa 06-Jul-2018 Andrey Ryabinin <ryabinin.a.a@gmail.com>

netfilter: nf_conntrack: Fix possible possible crash on module loading.

Loading the nf_conntrack module with doubled hashsize parameter, i.e.
modprobe nf_conntrack hashsize=12345 hashsize=12345
causes NULL-ptr deref.

If 'hashsize' specified twice, the nf_conntrack_set_hashsize() function
will be called also twice.
The first nf_conntrack_set_hashsize() call will set the
'nf_conntrack_htable_size' variable:

nf_conntrack_set_hashsize()
...
/* On boot, we can set this without any fancy locking. */
if (!nf_conntrack_htable_size)
return param_set_uint(val, kp);

But on the second invocation, the nf_conntrack_htable_size is already set,
so the nf_conntrack_set_hashsize() will take a different path and call
the nf_conntrack_hash_resize() function. Which will crash on the attempt
to dereference 'nf_conntrack_hash' pointer:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
RIP: 0010:nf_conntrack_hash_resize+0x255/0x490 [nf_conntrack]
Call Trace:
nf_conntrack_set_hashsize+0xcd/0x100 [nf_conntrack]
parse_args+0x1f9/0x5a0
load_module+0x1281/0x1a50
__se_sys_finit_module+0xbe/0xf0
do_syscall_64+0x7c/0x390
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fix this, by checking !nf_conntrack_hash instead of
!nf_conntrack_htable_size. nf_conntrack_hash will be initialized only
after the module loaded, so the second invocation of the
nf_conntrack_set_hashsize() won't crash, it will just reinitialize
nf_conntrack_htable_size again.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 368982cd 23-May-2018 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nfnetlink_queue: resolve clash for unconfirmed conntracks

In nfqueue, two consecutive skbuffs may race to create the conntrack
entry. Hence, the one that loses the race gets dropped due to clash in
the insertion into the hashes from the nf_conntrack_confirm() path.

This patch adds a new nf_conntrack_update() function which searches for
possible clashes and resolve them. NAT mangling for the packet losing
race is corrected by using the conntrack information that won race.

In order to avoid direct module dependencies with conntrack and NAT, the
nf_ct_hook and nf_nat_hook structures are used for this purpose.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 2c205dd3 23-May-2018 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: add struct nf_nat_hook and use it

Move decode_session() and parse_nat_setup_hook() indirections to struct
nf_nat_hook structure.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 1f4b2439 23-May-2018 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: add struct nf_ct_hook and use it

Move the nf_ct_destroy indirection to the struct nf_ct_hook.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 538c5672 06-May-2018 Florent Fourcot <florent.fourcot@wifirst.fr>

netfilter: ctnetlink: export nf_conntrack_max

IPCTNL_MSG_CT_GET_STATS netlink command allow to monitor current number
of conntrack entries. However, if one wants to compare it with the
maximum (and detect exhaustion), the only solution is currently to read
sysctl value.

This patch add nf_conntrack_max value in netlink message, and simplify
monitoring for application built on netlink API.

Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 152f2531 29-Mar-2018 Kirill Tkhai <ktkhai@virtuozzo.com>

net: Remove rtnl_lock() in nf_ct_iterate_destroy()

rtnl_lock() doesn't protect net::ct::count,
and it's not needed for__nf_ct_unconfirmed_destroy()
and for nf_queue_nf_hook_drop().

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f0b07bb1 29-Mar-2018 Kirill Tkhai <ktkhai@virtuozzo.com>

net: Introduce net_rwsem to protect net_namespace_list

rtnl_lock() is used everywhere, and contention is very high.
When someone wants to iterate over alive net namespaces,
he/she has no a possibility to do that without exclusive lock.
But the exclusive rtnl_lock() in such places is overkill,
and it just increases the contention. Yes, there is already
for_each_net_rcu() in kernel, but it requires rcu_read_lock(),
and this can't be sleepable. Also, sometimes it may be need
really prevent net_namespace_list growth, so for_each_net_rcu()
is not fit there.

This patch introduces new rw_semaphore, which will be used
instead of rtnl_mutex to protect net_namespace_list. It is
sleepable and allows not-exclusive iterations over net
namespaces list. It allows to stop using rtnl_lock()
in several places (what is made in next patches) and makes
less the time, we keep rtnl_mutex. Here we just add new lock,
while the explanation of we can remove rtnl_lock() there are
in next patches.

Fine grained locks generally are better, then one big lock,
so let's do that with net_namespace_list, while the situation
allows that.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e5531166 19-Jan-2018 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: remove messages print and boot/module load time

Several reasons for this:

* Several modules maintain internal version numbers, that they print at
boot/module load time, that are not exposed to userspace, as a
primitive mechanism to make revision number control from the earlier
days of Netfilter.

* IPset shows the protocol version at boot/module load time, instead
display this via module description, as Jozsef suggested.

* Remove copyright notice at boot/module load time in two spots, the
Netfilter codebase is a collective development effort, if we would
have to display copyrights for each contributor at boot/module load
time for each extensions we have, we would probably fill up logs with
lots of useless information - from a technical standpoint.

So let's be consistent and remove them all.

Acked-by: Florian Westphal <fw@strlen.de>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 90964016 06-Jan-2018 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conntrack: add IPS_OFFLOAD status bit

This new bit tells us that the conntrack entry is owned by the flow
table offload infrastructure.

# cat /proc/net/nf_conntrack
ipv4 2 tcp 6 src=10.141.10.2 dst=147.75.205.195 sport=36392 dport=443 src=147.75.205.195 dst=192.168.2.195 sport=443 dport=36392 [OFFLOAD] mark=0 zone=0 use=2

Note the [OFFLOAD] tag in the listing.

The timer of such conntrack entries look like stopped from userspace.
In practise, to make sure the conntrack entry does not go away, the
conntrack timer is periodically set to an arbitrary large value that
gets refreshed on every iteration from the garbage collector, so it
never expires- and they display no internal state in the case of TCP
flows. This allows us to save a bitcheck from the packet path via
nf_ct_is_expired().

Conntrack entries that have been offloaded to the flow table
infrastructure cannot be deleted/flushed via ctnetlink. The flow table
infrastructure is also responsible for releasing this conntrack entry.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# ffa53c58 24-Oct-2017 Paul E. McKenney <paulmck@kernel.org>

netfilter: Eliminate cond_resched_rcu_qs() in favor of cond_resched()

Now that cond_resched() also provides RCU quiescent states when
needed, it can be used in place of cond_resched_rcu_qs(). This
commit therefore makes this change.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: Florian Westphal <fw@strlen.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: <netfilter-devel@vger.kernel.org>


# 0984d427 02-Nov-2017 Vincent Guittot <vincent.guittot@linaro.org>

netfilter: conntrack: use power efficient workqueue

conntrack uses the bounded system_long_wq workqueue for its works that
don't have to run on the cpu they have been queued.
Using bounded workqueue prevents the scheduler to make smart decision about
the best place to schedule the work.

This patch replaces system_long_wq with system_power_efficient_wq. the work
stays bounded to a cpu by default unless the CONFIG_WQ_POWER_EFFICIENT is
enable. In the latter case, the work can be scheduled on the best cpu from
a power or a performance point of view.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 5caaed15 02-Nov-2017 Florian Westphal <fw@strlen.de>

netfilter: conntrack: don't cache nlattr_tuple_size result in nla_size

We currently call ->nlattr_tuple_size() once at register time and
cache result in l4proto->nla_size.

nla_size is the only member that is written to, avoiding this would
allow to make l4proto trackers const.

We can use ->nlattr_tuple_size() at run time, and cache result in
the individual trackers instead.

This is an intermediate step, next patch removes nlattr_size()
callback and computes size at compile time, then removes nla_size.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e4dca7b7 17-Oct-2017 Kees Cook <keescook@chromium.org>

treewide: Fix function prototypes for module_param_call()

Several function prototypes for the set/get functions defined by
module_param_call() have a slightly wrong argument types. This fixes
those in an effort to clean up the calls when running under type-enforced
compiler instrumentation for CFI. This is the result of running the
following semantic patch:

@match_module_param_call_function@
declarer name module_param_call;
identifier _name, _set_func, _get_func;
expression _arg, _mode;
@@

module_param_call(_name, _set_func, _get_func, _arg, _mode);

@fix_set_prototype
depends on match_module_param_call_function@
identifier match_module_param_call_function._set_func;
identifier _val, _param;
type _val_type, _param_type;
@@

int _set_func(
-_val_type _val
+const char * _val
,
-_param_type _param
+const struct kernel_param * _param
) { ... }

@fix_get_prototype
depends on match_module_param_call_function@
identifier match_module_param_call_function._get_func;
identifier _val, _param;
type _val_type, _param_type;
@@

int _get_func(
-_val_type _val
+char * _val
,
-_param_type _param
+const struct kernel_param * _param
) { ... }

Two additional by-hand changes are included for places where the above
Coccinelle script didn't notice them:

drivers/platform/x86/thinkpad_acpi.c
fs/lockd/svc.c

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jessica Yu <jeyu@kernel.org>


# eb6fad5a 11-Oct-2017 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove pf argument from l4 packet functions

not needed/used anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 44d6e2f2 30-Aug-2017 Varsha Rao <rvarsha016@gmail.com>

net: Replace NF_CT_ASSERT() with WARN_ON().

This patch removes NF_CT_ASSERT() and instead uses WARN_ON().

Signed-off-by: Varsha Rao <rvarsha016@gmail.com>


# d1c1e39d 28-Aug-2017 Florian Westphal <fw@strlen.de>

netfilter: remove unused hooknum arg from packet functions

tested with allmodconfig build.

Signed-off-by: Florian Westphal <fw@strlen.de>


# b3480fe0 11-Aug-2017 Florian Westphal <fw@strlen.de>

netfilter: conntrack: make protocol tracker pointers const

Doesn't change generated code, but will make it easier to eventually
make the actual trackers themselvers const.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 2a04aabf 31-Jul-2017 Julia Lawall <julia.lawall@lip6.fr>

netfilter: constify nf_conntrack_l3/4proto parameters

When a nf_conntrack_l3/4proto parameter is not on the left hand side
of an assignment, its address is not taken, and it is not passed to a
function that may modify its fields, then it can be declared as const.

This change is useful from a documentation point of view, and can
possibly facilitate making some nf_conntrack_l3/4proto structures const
subsequently.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e2a75007 25-Jul-2017 Florian Westphal <fw@strlen.de>

netfilter: conntrack: destroy functions need to free queued packets

queued skbs might be using conntrack extensions that are being removed,
such as timeout. This happens for skbs that have a skb->nfct in
unconfirmed state (i.e., not in hash table yet).

This is destructive, but there are only two use cases:
- module removal (rare)
- netns cleanup (most likely no conntracks exist, and if they do,
they are removed anyway later on).

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 84657984 25-Jul-2017 Florian Westphal <fw@strlen.de>

netfilter: add and use nf_ct_unconfirmed_destroy

This also removes __nf_ct_unconfirmed_destroy() call from
nf_ct_iterate_cleanup_net, so that function can be used only
when missing conntracks from unconfirmed list isn't a problem.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# a232cd0e 20-Jul-2017 Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>

netfilter: conntrack: Change to deferable work queue

Delayed workqueue causes wakeups to idle CPUs. This was
causing a power impact for devices. Use deferable work
queue instead so that gc_worker runs when CPU is active only.

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 3ef0c7a7 06-Jul-2017 Manfred Spraul <manfred@colorfullife.com>

net/netfilter/nf_conntrack_core: Fix net_conntrack_lock()

As we want to remove spin_unlock_wait() and replace it with explicit
spin_lock()/spin_unlock() calls, we can use this to simplify the
locking.

In addition:
- Reading nf_conntrack_locks_all needs ACQUIRE memory ordering.
- The new code avoids the backwards loop.

Only slightly tested, I did not manage to trigger calls to
nf_conntrack_all_lock().

V2: With improved comments, to clearly show how the barriers
pair.

Fixes: b16c29191dc8 ("netfilter: nf_conntrack: use safer way to lock all buckets")
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: netfilter-devel@vger.kernel.org
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 0b35f603 18-Jul-2017 Taehee Yoo <ap420073@gmail.com>

netfilter: Remove duplicated rcu_read_lock.

This patch removes duplicate rcu_read_lock().

1. IPVS part:

According to Julian Anastasov's mention, contexts of ipvs are described
at: http://marc.info/?l=netfilter-devel&m=149562884514072&w=2, in summary:

- packet RX/TX: does not need locks because packets come from hooks.
- sync msg RX: backup server uses RCU locks while registering new
connections.
- ip_vs_ctl.c: configuration get/set, RCU locks needed.
- xt_ipvs.c: It is a netfilter match, running from hook context.

As result, rcu_read_lock and rcu_read_unlock can be removed from:

- ip_vs_core.c: all
- ip_vs_ctl.c:
- only from ip_vs_has_real_service
- ip_vs_ftp.c: all
- ip_vs_proto_sctp.c: all
- ip_vs_proto_tcp.c: all
- ip_vs_proto_udp.c: all
- ip_vs_xmit.c: all (contains only packet processing)

2. Netfilter part:

There are three types of functions that are guaranteed the rcu_read_lock().
First, as result, functions are only called by nf_hook():

- nf_conntrack_broadcast_help(), pptp_expectfn(), set_expected_rtp_rtcp().
- tcpmss_reverse_mtu(), tproxy_laddr4(), tproxy_laddr6().
- match_lookup_rt6(), check_hlist(), hashlimit_mt_common().
- xt_osf_match_packet().

Second, functions that caller already held the rcu_read_lock().
- destroy_conntrack(), ctnetlink_conntrack_event().
- ctnl_timeout_find_get(), nfqnl_nf_hook_drop().

Third, functions that are mixed with type1 and type2.

These functions are called by nf_hook() also these are called by
ordinary functions that already held the rcu_read_lock():

- __ctnetlink_glue_build(), ctnetlink_expect_event().
- ctnetlink_proto_size().

Applied files are below:

- nf_conntrack_broadcast.c, nf_conntrack_core.c, nf_conntrack_netlink.c.
- nf_conntrack_pptp.c, nf_conntrack_sip.c, nfnetlink_cttimeout.c.
- nfnetlink_queue.c, xt_TCPMSS.c, xt_TPROXY.c, xt_addrtype.c.
- xt_connlimit.c, xt_hashlimit.c, xt_osf.c

Detailed calltrace can be found at:
http://marc.info/?l=netfilter-devel&m=149667610710350&w=2

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 7866cc57 30-May-2017 Florian Westphal <fw@strlen.de>

netns: add and use net_ns_barrier

Quoting Joe Stringer:
If a user loads nf_conntrack_ftp, sends FTP traffic through a network
namespace, destroys that namespace then unloads the FTP helper module,
then the kernel will crash.

Events that lead to the crash:
1. conntrack is created with ftp helper in netns x
2. This netns is destroyed
3. netns destruction is scheduled
4. netns destruction wq starts, removes netns from global list
5. ftp helper is unloaded, which resets all helpers of the conntracks
via for_each_net()

but because netns is already gone from list the for_each_net() loop
doesn't include it, therefore all of these conntracks are unaffected.

6. helper module unload finishes
7. netns wq invokes destructor for rmmod'ed helper

CC: "Eric W. Biederman" <ebiederm@xmission.com>
Reported-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 0d02d564 20-May-2017 Florian Westphal <fw@strlen.de>

netfilter: conntrack: restart iteration on resize

We could some conntracks when a resize occurs in parallel.

Avoid this by sampling generation seqcnt and doing a restart if needed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 2843fb69 20-May-2017 Florian Westphal <fw@strlen.de>

netfilter: conntrack: add nf_ct_iterate_destroy

sledgehammer to be used on module unload (to remove affected conntracks
from all namespaces).

It will also flag all unconfirmed conntracks as dying, i.e. they will
not be committed to main table.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b0feacaa 20-May-2017 Florian Westphal <fw@strlen.de>

netfilter: conntrack: don't call iter for non-confirmed conntracks

nf_ct_iterate_cleanup_net currently calls iter() callback also for
conntracks on the unconfirmed list, but this is unsafe.

Acesses to nf_conn are fine, but some users access the extension area
in the iter() callback, but that does only work reliably for confirmed
conntracks (ct->ext can be reallocated at any time for unconfirmed
conntrack).

The seond issue is that there is a short window where a conntrack entry
is neither on the list nor in the table: To confirm an entry, it is first
removed from the unconfirmed list, then insert into the table.

Fix this by iterating the unconfirmed list first and marking all entries
as dying, then wait for rcu grace period.

This makes sure all entries that were about to be confirmed either are
in the main table, or will be dropped soon.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 9fd6452d 20-May-2017 Florian Westphal <fw@strlen.de>

netfilter: conntrack: rename nf_ct_iterate_cleanup

There are several places where we needlesly call nf_ct_iterate_cleanup,
we should instead iterate the full table at module unload time.

This is a leftover from back when the conntrack table got duplicated
per net namespace.

So rename nf_ct_iterate_cleanup to nf_ct_iterate_cleanup_net.
A later patch will then add a non-net variant.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# ab71632c 03-May-2017 Geert Uytterhoeven <geert@linux-m68k.org>

netfilter: conntrack: Force inlining of build check to prevent build failure

If gcc (e.g. 4.1.2) decides not to inline total_extension_size(), the
build will fail with:

net/built-in.o: In function `nf_conntrack_init_start':
(.text+0x9baf6): undefined reference to `__compiletime_assert_1893'

or

ERROR: "__compiletime_assert_1893" [net/netfilter/nf_conntrack.ko] undefined!

Fix this by forcing inlining of total_extension_size().

Fixes: b3a5db109e0670d6 ("netfilter: conntrack: use u8 for extension sizes again")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c6dd940b 16-Apr-2017 Florian Westphal <fw@strlen.de>

netfilter: allow early drop of assured conntracks

If insertion of a new conntrack fails because the table is full, the kernel
searches the next buckets of the hash slot where the new connection
was supposed to be inserted at for an entry that hasn't seen traffic
in reply direction (non-assured), if it finds one, that entry is
is dropped and the new connection entry is allocated.

Allow the conntrack gc worker to also remove *assured* conntracks if
resources are low.

Do this by querying the l4 tracker, e.g. tcp connections are now dropped
if they are no longer established (e.g. in finwait).

This could be refined further, e.g. by adding 'soft' established timeout
(i.e., a timeout that is only used once we get close to resource
exhaustion).

Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b3a5db10 15-Apr-2017 Florian Westphal <fw@strlen.de>

netfilter: conntrack: use u8 for extension sizes again

commit 223b02d923ecd7c84cf9780bb3686f455d279279
("netfilter: nf_conntrack: reserve two bytes for nf_ct_ext->len")
had to increase size of the extension offsets because total size of the
extensions had increased to a point where u8 did overflow.

3 years later we've managed to diet extensions a bit and we no longer
need u16. Furthermore we can now add a compile-time assertion for this
problem.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 5f0d5a3a 18-Jan-2017 Paul E. McKenney <paulmck@kernel.org>

mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU

A group of Linux kernel hackers reported chasing a bug that resulted
from their assumption that SLAB_DESTROY_BY_RCU provided an existence
guarantee, that is, that no block from such a slab would be reallocated
during an RCU read-side critical section. Of course, that is not the
case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
slab of blocks.

However, there is a phrase for this, namely "type safety". This commit
therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
to avoid future instances of this sort of confusion.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
[ paulmck: Add comments mentioning the old name, as requested by Eric
Dumazet, in order to help people familiar with the old name find
the new one. ]
Acked-by: David Rientjes <rientjes@google.com>


# cc41c84b 14-Apr-2017 Florian Westphal <fw@strlen.de>

netfilter: kill the fake untracked conntrack objects

resurrect an old patch from Pablo Neira to remove the untracked objects.

Currently, there are four possible states of an skb wrt. conntrack.

1. No conntrack attached, ct is NULL.
2. Normal (kmem cache allocated) ct attached.
3. a template (kmalloc'd), not in any hash tables at any point in time
4. the 'untracked' conntrack, a percpu nf_conn object, tagged via
IPS_UNTRACKED_BIT in ct->status.

Untracked is supposed to be identical to case 1. It exists only
so users can check

-m conntrack --ctstate UNTRACKED vs.
-m conntrack --ctstate INVALID

e.g. attempts to set connmark on INVALID or UNTRACKED conntracks is
supposed to be a no-op.

Thus currently we need to check
ct == NULL || nf_ct_is_untracked(ct)

in a lot of places in order to avoid altering untracked objects.

The other consequence of the percpu untracked object is that all
-j NOTRACK (and, later, kfree_skb of such skbs) result in an atomic op
(inc/dec the untracked conntracks refcount).

This adds a new kernel-private ctinfo state, IP_CT_UNTRACKED, to
make the distinction instead.

The (few) places that care about packet invalid (ct is NULL) vs.
packet untracked now need to test ct == NULL vs. ctinfo == IP_CT_UNTRACKED,
but all other places can omit the nf_ct_is_untracked() check.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 6e699867 28-Mar-2017 Florian Westphal <fw@strlen.de>

netfilter: nat: avoid use of nf_conn_nat extension

successful insert into the bysource hash sets IPS_SRC_NAT_DONE status bit
so we can check that instead of presence of nat extension which requires
extra deref.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# fc09e4a7 08-Mar-2017 Florian Westphal <fw@strlen.de>

netfilter: nf_conntrack: reduce resolve_normal_ct args

also mark init_conntrack noinline, in most cases resolve_normal_ct will
find an existing conntrack entry.

text data bss dec hex filename
16735 5707 176 22618 585a net/netfilter/nf_conntrack_core.o
16687 5707 176 22570 582a net/netfilter/nf_conntrack_core.o

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 170a1fb9 10-Mar-2017 Steven Rostedt (VMware) <rostedt@goodmis.org>

netfilter: Force fake conntrack entry to be at least 8 bytes aligned

Since the nfct and nfctinfo have been combined, the nf_conn structure
must be at least 8 bytes aligned, as the 3 LSB bits are used for the
nfctinfo. But there's a fake nf_conn structure to denote untracked
connections, which is created by a PER_CPU construct. This does not
guarantee that it will be 8 bytes aligned and can break the logic in
determining the correct nfctinfo.

I triggered this on a 32bit machine with the following error:

BUG: unable to handle kernel NULL pointer dereference at 00000af4
IP: nf_ct_deliver_cached_events+0x1b/0xfb
*pdpt = 0000000031962001 *pde = 0000000000000000

Oops: 0000 [#1] SMP
[Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 crc_ccitt ppdev r8169 parport_pc parport
OK ]
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-test+ #75
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
task: c126ec00 task.stack: c1258000
EIP: nf_ct_deliver_cached_events+0x1b/0xfb
EFLAGS: 00010202 CPU: 0
EAX: 0021cd01 EBX: 00000000 ECX: 27b0c767 EDX: 32bcb17a
ESI: f34135c0 EDI: f34135c0 EBP: f2debd60 ESP: f2debd3c
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 80050033 CR2: 00000af4 CR3: 309a0440 CR4: 001406f0
Call Trace:
<SOFTIRQ>
? ipv6_skip_exthdr+0xac/0xcb
ipv6_confirm+0x10c/0x119 [nf_conntrack_ipv6]
nf_hook_slow+0x22/0xc7
nf_hook+0x9a/0xad [ipv6]
? ip6t_do_table+0x356/0x379 [ip6_tables]
? ip6_fragment+0x9e9/0x9e9 [ipv6]
ip6_output+0xee/0x107 [ipv6]
? ip6_fragment+0x9e9/0x9e9 [ipv6]
dst_output+0x36/0x4d [ipv6]
NF_HOOK.constprop.37+0xb2/0xba [ipv6]
? icmp6_dst_alloc+0x2c/0xfd [ipv6]
? local_bh_enable+0x14/0x14 [ipv6]
mld_sendpack+0x1c5/0x281 [ipv6]
? mark_held_locks+0x40/0x5c
mld_ifc_timer_expire+0x1f6/0x21e [ipv6]
call_timer_fn+0x135/0x283
? detach_if_pending+0x55/0x55
? mld_dad_timer_expire+0x3e/0x3e [ipv6]
__run_timers+0x111/0x14b
? mld_dad_timer_expire+0x3e/0x3e [ipv6]
run_timer_softirq+0x1c/0x36
__do_softirq+0x185/0x37c
? test_ti_thread_flag.constprop.19+0xd/0xd
do_softirq_own_stack+0x22/0x28
</SOFTIRQ>
irq_exit+0x5a/0xa4
smp_apic_timer_interrupt+0x2a/0x34
apic_timer_interrupt+0x37/0x3c

By using DEFINE/DECLARE_PER_CPU_ALIGNED we can enforce at least 8 byte
alignment as all cache line sizes are at least 8 bytes or more.

Fixes: a9e419dc7be6 ("netfilter: merge ctinfo into nfct pointer storage area")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# a9e419dc 23-Jan-2017 Florian Westphal <fw@strlen.de>

netfilter: merge ctinfo into nfct pointer storage area

After this change conntrack operations (lookup, creation, matching from
ruleset) only access one instead of two sk_buff cache lines.

This works for normal conntracks because those are allocated from a slab
that guarantees hw cacheline or 8byte alignment (whatever is larger)
so the 3 bits needed for ctinfo won't overlap with nf_conn addresses.

Template allocation now does manual address alignment (see previous change)
on arches that don't have sufficent kmalloc min alignment.

Some spots intentionally use skb->_nfct instead of skb_nfct() helpers,
this is to avoid undoing the skb_nfct() use when we remove untracked
conntrack object in the future.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 30322309 23-Jan-2017 Florian Westphal <fw@strlen.de>

netfilter: guarantee 8 byte minalign for template addresses

The next change will merge skb->nfct pointer and skb->nfctinfo
status bits into single skb->_nfct (unsigned long) area.

For this to work nf_conn addresses must always be aligned at least on
an 8 byte boundary since we will need the lower 3bits to store nfctinfo.

Conntrack templates are allocated via kmalloc.
kbuild test robot reported
BUILD_BUG_ON failed: NFCT_INFOMASK >= ARCH_KMALLOC_MINALIGN
on v1 of this patchset, so not all platforms meet this requirement.

Do manual alignment if needed, the alignment offset is stored in the
nf_conn entry protocol area. This works because templates are not
handed off to L4 protocol trackers.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c74454fa 23-Jan-2017 Florian Westphal <fw@strlen.de>

netfilter: add and use nf_ct_set helper

Add a helper to assign a nf_conn entry and the ctinfo bits to an sk_buff.
This avoids changing code in followup patch that merges skb->nfct and
skb->nfctinfo into skb->_nfct.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# cb9c6836 23-Jan-2017 Florian Westphal <fw@strlen.de>

skbuff: add and use skb_nfct helper

Followup patch renames skb->nfct and changes its type so add a helper to
avoid intrusive rename change later.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 97a6ad13 23-Jan-2017 Florian Westphal <fw@strlen.de>

netfilter: reduce direct skb->nfct usage

Next patch makes direct skb->nfct access illegal, reduce noise
in next patch by using accessors we already have.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 11df4b76 23-Jan-2017 Florian Westphal <fw@strlen.de>

netfilter: conntrack: no need to pass ctinfo to error handler

It is never accessed for reading and the only places that write to it
are the icmp(6) handlers, which also set skb->nfct (and skb->nfctinfo).

The conntrack core specifically checks for attached skb->nfct after
->error() invocation and returns early in this case.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e5072053 17-Jan-2017 Florian Westphal <fw@strlen.de>

netfilter: conntrack: refine gc worker heuristics, redux

This further refines the changes made to conntrack gc_worker in
commit e0df8cae6c16 ("netfilter: conntrack: refine gc worker heuristics").

The main idea of that change was to reduce the scan interval when evictions
take place.

However, on the reporters' setup, there are 1-2 million conntrack entries
in total and roughly 8k new (and closing) connections per second.

In this case we'll always evict at least one entry per gc cycle and scan
interval is always at 1 jiffy because of this test:

} else if (expired_count) {
gc_work->next_gc_run /= 2U;
next_run = msecs_to_jiffies(1);

being true almost all the time.

Given we scan ~10k entries per run its clearly wrong to reduce interval
based on nonzero eviction count, it will only waste cpu cycles since a vast
majorities of conntracks are not timed out.

Thus only look at the ratio (scanned entries vs. evicted entries) to make
a decision on whether to reduce or not.

Because evictor is supposed to only kick in when system turns idle after
a busy period, pick a high ratio -- this makes it 50%. We thus keep
the idea of increasing scan rate when its likely that table contains many
expired entries.

In order to not let timed-out entries hang around for too long
(important when using event logging, in which case we want to timely
destroy events), we now scan the full table within at most
GC_MAX_SCAN_JIFFIES (16 seconds) even in worst-case scenario where all
timed-out entries sit in same slot.

I tested this with a vm under synflood (with
sysctl net.netfilter.nf_conntrack_tcp_timeout_syn_recv=3).

While flood is ongoing, interval now stays at its max rate
(GC_MAX_SCAN_JIFFIES / GC_MAX_BUCKETS_DIV -> 125ms).

With feedback from Nicolas Dichtel.

Reported-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Fixes: b87a2f9199ea82eaadc ("netfilter: conntrack: add gc worker to remove timed-out entries")
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Tested-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 524b698d 16-Jan-2017 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove GC_MAX_EVICTS break

Instead of breaking loop and instant resched, don't bother checking
this in first place (the loop calls cond_resched for every bucket anyway).

Suggested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 2456e855 25-Dec-2016 Thomas Gleixner <tglx@linutronix.de>

ktime: Get rid of the union

ktime is a union because the initial implementation stored the time in
scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
variant for 32bit machines. The Y2038 cleanup removed the timespec variant
and switched everything to scalar nanoseconds. The union remained, but
become completely pointless.

Get rid of the union and just keep ktime_t as simple typedef of type s64.

The conversion was done with coccinelle and some manual mopping up.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>


# 56a62e22 08-Nov-2016 Arnd Bergmann <arnd@arndb.de>

netfilter: conntrack: fix NF_REPEAT handling

gcc correctly identified a theoretical uninitialized variable use:

net/netfilter/nf_conntrack_core.c: In function 'nf_conntrack_in':
net/netfilter/nf_conntrack_core.c:1125:14: error: 'l4proto' may be used uninitialized in this function [-Werror=maybe-uninitialized]

This could only happen when we 'goto out' before looking up l4proto,
and then enter the retry, implying that l3proto->get_l4proto()
returned NF_REPEAT. This does not currently get returned in any
code path and probably won't ever happen, but is not good to
rely on.

Moving the repeat handling up a little should have the same
behavior as today but avoids the warning by making that case
impossible to enter.

[ I have mangled this original patch to remove the check for tmpl, we
should inconditionally jump back to the repeat label in case we hit
NF_REPEAT instead. I have also moved the comment that explains this
where it belongs. --pablo ]

Fixes: 08733a0cb7de ("netfilter: handle NF_REPEAT from nf_conntrack_in()")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e0df8cae 04-Nov-2016 Florian Westphal <fw@strlen.de>

netfilter: conntrack: refine gc worker heuristics

Nicolas Dichtel says:
After commit b87a2f9199ea ("netfilter: conntrack: add gc worker to
remove timed-out entries"), netlink conntrack deletion events may be
sent with a huge delay.

Nicolas further points at this line:

goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, GC_MAX_BUCKETS);

and indeed, this isn't optimal at all. Rationale here was to ensure that
we don't block other work items for too long, even if
nf_conntrack_htable_size is huge. But in order to have some guarantee
about maximum time period where a scan of the full conntrack table
completes we should always use a fixed slice size, so that once every
N scans the full table has been examined at least once.

We also need to balance this vs. the case where the system is either idle
(i.e., conntrack table (almost) empty) or very busy (i.e. eviction happens
from packet path).

So, after some discussion with Nicolas:

1. want hard guarantee that we scan entire table at least once every X s
-> need to scan fraction of table (get rid of upper bound)

2. don't want to eat cycles on idle or very busy system
-> increase interval if we did not evict any entries

3. don't want to block other worker items for too long
-> make fraction really small, and prefer small scan interval instead

4. Want reasonable short time where we detect timed-out entry when
system went idle after a burst of traffic, while not doing scans
all the time.
-> Store next gc scan in worker, increasing delays when no eviction
happened and shrinking delay when we see timed out entries.

The old gc interval is turned into a max number, scans can now happen
every jiffy if stale entries are present.

Longest possible time period until an entry is evicted is now 2 minutes
in worst case (entry expires right after it was deemed 'not expired').

Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 08733a0c 03-Nov-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: handle NF_REPEAT from nf_conntrack_in()

NF_REPEAT is only needed from nf_conntrack_in() under a very specific
case required by the TCP protocol tracker, we can handle this case
without returning to the core hook path. Handling of NF_REPEAT from the
nf_reinject() is left untouched.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 7bb6615d 18-Oct-2016 Nicolas Dichtel <nicolas.dichtel@6wind.com>

netfilter: conntrack: restart gc immediately if GC_MAX_EVICTS is reached

When the maximum evictions number is reached, do not wait 5 seconds before
the next run.

CC: Florian Westphal <fw@strlen.de>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e2361cb9 21-Sep-2016 Aaron Conole <aconole@bytheb.org>

netfilter: Remove explicit rcu_read_lock in nf_hook_slow

All of the callers of nf_hook_slow already hold the rcu_read_lock, so this
cleanup removes the recursive call. This is just a cleanup, as the locking
code gracefully handles this situation.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 4440a2ab 12-Sep-2016 Gao Feng <fgao@ikuai8.com>

netfilter: synproxy: Check oom when adding synproxy and seqadj ct extensions

When memory is exhausted, nfct_seqadj_ext_add may fail to add the
synproxy and seqadj extensions. The function nf_ct_seqadj_init doesn't
check if get valid seqadj pointer by the nfct_seqadj.

Now drop the packet directly when fail to add seqadj extension to
avoid dereference NULL pointer in nf_ct_seqadj_init from
init_conntrack().

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 8e8118f8 11-Sep-2016 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove packet hotpath stats

These counters sit in hot path and do show up in perf, this is especially
true for 'found' and 'searched' which get incremented for every packet
processed.

Information like

searched=212030105
new=623431
found=333613
delete=623327

does not seem too helpful nowadays:

- on busy systems found and searched will overflow every few hours
(these are 32bit integers), other more busy ones every few days.

- for debugging there are better methods, such as iptables' trace target,
the conntrack log sysctls. Nowadays we also have perf tool.

This removes packet path stat counters except those that
are expected to be 0 (or close to 0) on a normal system, e.g.
'insert_failed' (race happened) or 'invalid' (proto tracker rejects).

The insert stat is retained for the ctnetlink case.
The found stat is retained for the tuple-is-taken check when NAT has to
determine if it needs to pick a different source address.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# ad66713f 25-Aug-2016 Florian Westphal <fw@strlen.de>

netfilter: remove __nf_ct_kill_acct helper

After timer removal this just calls nf_ct_delete so remove the __ prefix
version and make nf_ct_kill a shorthand for nf_ct_delete.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c023c0e4 25-Aug-2016 Florian Westphal <fw@strlen.de>

netfilter: conntrack: resched gc again if eviction rate is high

If we evicted a large fraction of the scanned conntrack entries re-schedule
the next gc cycle for immediate execution.

This triggers during tests where load is high, then drops to zero and
many connections will be in TW/CLOSE state with < 30 second timeouts.

Without this change it will take several minutes until conntrack count
comes back to normal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b87a2f91 25-Aug-2016 Florian Westphal <fw@strlen.de>

netfilter: conntrack: add gc worker to remove timed-out entries

Conntrack gc worker to evict stale entries.

GC happens once every 5 seconds, but we only scan at most 1/64th of the
table (and not more than 8k) buckets to avoid hogging cpu.

This means that a complete scan of the table will take several minutes
of wall-clock time.

Considering that the gc run will never have to evict any entries
during normal operation because those will happen from packet path
this should be fine.

We only need gc to make sure userspace (conntrack event listeners)
eventually learn of the timeout, and for resource reclaim in case the
system becomes idle.

We do not disable BH and cond_resched for every bucket so this should
not introduce noticeable latencies either.

A followup patch will add a small change to speed up GC for the extreme
case where most entries are timed out on an otherwise idle system.

v2: Use cond_resched_rcu_qs & add comment wrt. missing restart on
nulls value change in gc worker, suggested by Eric Dumazet.

v3: don't call cancel_delayed_work_sync twice (again, Eric).

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# f330a7fd 25-Aug-2016 Florian Westphal <fw@strlen.de>

netfilter: conntrack: get rid of conntrack timer

With stats enabled this eats 80 bytes on x86_64 per nf_conn entry, as
Eric Dumazet pointed out during netfilter workshop 2016.

Eric also says: "Another reason was the fact that Thomas was about to
change max timer range [..]" (500462a9de657f8, 'timers: Switch to
a non-cascading wheel').

Remove the timer and use a 32bit jiffies value containing timestamp until
entry is valid.

During conntrack lookup, even before doing tuple comparision, check
the timeout value and evict the entry in case it is too old.

The dying bit is used as a synchronization point to avoid races where
multiple cpus try to evict the same entry.

Because lookup is always lockless, we need to bump the refcnt once
when we evict, else we could try to evict already-dead entry that
is being recycled.

This is the standard/expected way when conntrack entries are destroyed.

Followup patches will introduce garbage colliction via work queue
and further places where we can reap obsoleted entries (e.g. during
netlink dumps), this is needed to avoid expired conntracks from hanging
around for too long when lookup rate is low after a busy period.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 95a8d19f 25-Aug-2016 Florian Westphal <fw@strlen.de>

netfilter: restart search if moved to other chain

In case nf_conntrack_tuple_taken did not find a conflicting entry
check that all entries in this hash slot were tested and restart
in case an entry was moved to another chain.

Reported-by: Eric Dumazet <edumazet@google.com>
Fixes: ea781f197d6a ("netfilter: nf_conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu()")
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 2567c4ea 18-Aug-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conntrack: restore nf_conntrack_htable_size as exported symbol

This is required to iterate over the hash table in cttimeout, ctnetlink
and nf_conntrack_ipv4.

>> ERROR: "nf_conntrack_htable_size" [net/netfilter/nfnetlink_cttimeout.ko] undefined!
ERROR: "nf_conntrack_htable_size" [net/netfilter/nf_conntrack_netlink.ko] undefined!
ERROR: "nf_conntrack_htable_size" [net/ipv4/netfilter/nf_conntrack_ipv4.ko] undefined!

Fixes: adf0516845bcd0 ("netfilter: remove ip_conntrack* sysctl compat code")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 92e47ba8 13-Aug-2016 Liping Zhang <liping.zhang@spreadtrum.com>

netfilter: conntrack: simplify the code by using nf_conntrack_get_ht

Since commit 64b87639c9cb ("netfilter: conntrack: fix race between
nf_conntrack proc read and hash resize") introduce the
nf_conntrack_get_ht, so there's no need to check nf_conntrack_generation
again and again to get the hash table and hash size. And convert
nf_conntrack_get_ht to inline function here.

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# adf05168 12-Aug-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: remove ip_conntrack* sysctl compat code

This backward compatibility has been around for more than ten years,
since Yasuyuki Kozakai introduced IPv6 in conntrack. These days, we have
alternate /proc/net/nf_conntrack* entries, the ctnetlink interface and
the conntrack utility got adopted by many people in the user community
according to what I observed on the netfilter user mailing list.

So let's get rid of this.

Note that nf_conntrack_htable_size and unsigned int nf_conntrack_max do
not need to be exported as symbol anymore.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 590b52e1 11-Jul-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: conntrack: skip clash resolution if nat is in place

The clash resolution is not easy to apply if the NAT table is
registered. Even if no NAT rules are installed, the nul-binding ensures
that a unique tuple is used, thus, the packet that loses race gets a
different source port number, as described by:

http://marc.info/?l=netfilter-devel&m=146818011604484&w=2

Clash resolution with NAT is also problematic if addresses/port range
ports are used since the conntrack that wins race may describe a
different mangling that we may have earlier applied to the packet via
nf_nat_setup_info().

Fixes: 71d8c47fc653 ("netfilter: conntrack: introduce clash resolution on insertion race")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Marc Dionne <marc.c.dionne@gmail.com>


# 3101e0fc 12-Jul-2016 Liping Zhang <liping.zhang@spreadtrum.com>

netfilter: conntrack: protect early_drop by rcu read lock

User can add ct entry via nfnetlink(IPCTNL_MSG_CT_NEW), and if the total
number reach the nf_conntrack_max, we will try to drop some ct entries.

But in this case(the main function call path is ctnetlink_create_conntrack
-> nf_conntrack_alloc -> early_drop), rcu_read_lock is not held, so race
with hash resize will happen.

Fixes: 242922a02717 ("netfilter: conntrack: simplify early_drop")
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 242922a0 03-Jul-2016 Florian Westphal <fw@strlen.de>

netfilter: conntrack: simplify early_drop

We don't need to acquire the bucket lock during early drop, we can
use lockless traveral just like ____nf_conntrack_find.

The timer deletion serves as synchronization point, if another cpu
attempts to evict same entry, only one will succeed with timer deletion.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 64b87639 02-Jul-2016 Liping Zhang <liping.zhang@spreadtrum.com>

netfilter: conntrack: fix race between nf_conntrack proc read and hash resize

When we do "cat /proc/net/nf_conntrack", and meanwhile resize the conntrack
hash table via /sys/module/nf_conntrack/parameters/hashsize, race will
happen, because reader can observe a newly allocated hash but the old size
(or vice versa). So oops will happen like follows:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000017
IP: [<ffffffffa0418e21>] seq_print_acct+0x11/0x50 [nf_conntrack]
Call Trace:
[<ffffffffa0412f4e>] ? ct_seq_show+0x14e/0x340 [nf_conntrack]
[<ffffffff81261a1c>] seq_read+0x2cc/0x390
[<ffffffff812a8d62>] proc_reg_read+0x42/0x70
[<ffffffff8123bee7>] __vfs_read+0x37/0x130
[<ffffffff81347980>] ? security_file_permission+0xa0/0xc0
[<ffffffff8123cf75>] vfs_read+0x95/0x140
[<ffffffff8123e475>] SyS_read+0x55/0xc0
[<ffffffff817c2572>] entry_SYSCALL_64_fastpath+0x1a/0xa4

It is very easy to reproduce this kernel crash.
1. open one shell and input the following cmds:
while : ; do
echo $RANDOM > /sys/module/nf_conntrack/parameters/hashsize
done
2. open more shells and input the following cmds:
while : ; do
cat /proc/net/nf_conntrack
done
3. just wait a monent, oops will happen soon.

The solution in this patch is based on Florian's Commit 5e3c61f98175
("netfilter: conntrack: fix lookup race during hash resize"). And
add a wrapper function nf_conntrack_get_ht to get hash and hsize
suggested by Florian Westphal.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 9cc1c73a 23-Apr-2016 Florian Westphal <fw@strlen.de>

netfilter: conntrack: avoid integer overflow when resizing

Can overflow so we might allocate very small table when bucket count is
high on a 32bit platform.

Note: resize is only possible from init_netns.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 3183ab89 22-Jun-2016 Florian Westphal <fw@strlen.de>

netfilter: conntrack: allow increasing bucket size via sysctl too

No need to restrict this to module parameter.

We export a copy of the real hash size -- when user alters the value we
allocate the new table, copy entries etc before we update the real size
to the requested one.

This is also needed because the real size is used by concurrent readers
and cannot be changed without synchronizing the conntrack generation
seqcnt.

We only allow changing this value from the initial net namespace.

Tested using http-client-benchmark vs. httpterm with concurrent

while true;do
echo $RANDOM > /proc/sys/net/netfilter/nf_conntrack_buckets
done

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 6c8dee98 11-Jun-2016 Florian Westphal <fw@strlen.de>

netfilter: move zone info into struct nf_conn

Curently we store zone information as a conntrack extension.
This has one drawback: for every lookup we need to fetch the zone data
from the extension area.

This change place the zone data directly into the main conntrack object
structure and then removes the zone conntrack extension.

The zone data is just 4 bytes, it fits into a padding hole before
the tuplehash info, so we do not even increase the nf_conn structure size.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 5a75cdeb 10-Jun-2016 Florian Westphal <fw@strlen.de>

netfilter: conntrack: align nf_conn on cacheline boundary

increases struct size by 32 bytes (288 -> 320), but it is the right thing,
else any attempt to (re-)arrange nf_conn members by cacheline won't work.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 77571149 10-Jun-2016 Florian Westphal <fw@strlen.de>

netfilter: conntrack: destroy kmemcache on module removal

I forgot to move the kmem_cache_destroy into the exit path.

Fixes: 0c5366b3a8c7 ("netfilter: conntrack: use single slab cache)
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b316ff78 24-May-2016 Peter Zijlstra <peterz@infradead.org>

locking/spinlock, netfilter: Fix nf_conntrack_lock() barriers

Even with spin_unlock_wait() fixed, nf_conntrack_lock{,_all}() is
borken as it misses a bunch of memory barriers to order the whole
global vs local locks scheme.

Even x86 (and other TSO archs) are affected.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ Updated the comments. ]
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 31b0b385 14-May-2016 Linus Torvalds <torvalds@linux-foundation.org>

nf_conntrack: avoid kernel pointer value leak in slab name

The slab name ends up being visible in the directory structure under
/sys, and even if you don't have access rights to the file you can see
the filenames.

Just use a 64-bit counter instead of the pointer to the 'net' structure
to generate a unique name.

This code will go away in 4.7 when the conntrack code moves to a single
kmemcache, but this is the backportable simple solution to avoiding
leaking kernel pointers to user space.

Fixes: 5b3501faa874 ("netfilter: nf_conntrack: per netns nf_conntrack_cachep")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>


# c047c3b1 09-May-2016 Arnd Bergmann <arnd@arndb.de>

netfilter: conntrack: remove uninitialized shadow variable

A recent commit introduced an unconditional use of an uninitialized
variable, as reported in this gcc warning:

net/netfilter/nf_conntrack_core.c: In function '__nf_conntrack_confirm':
net/netfilter/nf_conntrack_core.c:632:33: error: 'ctinfo' may be used uninitialized in this function [-Werror=maybe-uninitialized]
bytes = atomic64_read(&counter[CTINFO2DIR(ctinfo)].bytes);
^
net/netfilter/nf_conntrack_core.c:628:26: note: 'ctinfo' was declared here
enum ip_conntrack_info ctinfo;

The problem is that a local variable shadows the function parameter.
This removes the local variable, which looks like what Pablo originally
intended.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 71d8c47fc653 ("netfilter: conntrack: introduce clash resolution on insertion race")
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0c5366b3 09-May-2016 Florian Westphal <fw@strlen.de>

netfilter: conntrack: use single slab cache

An earlier patch changed lookup side to also net_eq() namespaces after
obtaining a reference on the conntrack, so a single kmemcache can be used.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 71d8c47f 30-Apr-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: conntrack: introduce clash resolution on insertion race

This patch introduces nf_ct_resolve_clash() to resolve race condition on
conntrack insertions.

This is particularly a problem for connection-less protocols such as
UDP, with no initial handshake. Two or more packets may race to insert
the entry resulting in packet drops.

Another problematic scenario are packets enqueued to userspace via
NFQUEUE after the raw table, that make it easier to trigger this
race.

To resolve this, the idea is to reset the conntrack entry to the one
that won race. Packet and bytes counters are also merged.

The 'insert_failed' stats still accounts for this situation, after
this patch, the drop counter is bumped whenever we drop packets, so we
can watch for unresolved clashes.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# ba76738c 02-May-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: conntrack: introduce nf_ct_acct_update()

Introduce a helper function to update conntrack counters.
__nf_ct_kill_acct() was unnecessarily subtracting skb_network_offset()
that is expected to be zero from the ipv4/ipv6 hooks.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 4b4ceb9d 30-Apr-2016 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: conntrack: __nf_ct_l4proto_find() always returns valid pointer

Remove unnecessary check for non-nul pointer in destroy_conntrack()
given that __nf_ct_l4proto_find() returns the generic protocol tracker
if the protocol is not supported.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 3e86638e 02-May-2016 Florian Westphal <fw@strlen.de>

netfilter: conntrack: consider ct netns in early_drop logic

When iterating, skip conntrack entries living in a different netns.

We could ignore netns and kill some other non-assured one, but it
has two problems:

- a netns can kill non-assured conntracks in other namespace
- we would start to 'over-subscribe' the affected/overlimit netns.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 56d52d48 02-May-2016 Florian Westphal <fw@strlen.de>

netfilter: conntrack: use a single hashtable for all namespaces

We already include netns address in the hash and compare the netns pointers
during lookup, so even if namespaces have overlapping addresses entries
will be spread across the table.

Assuming 64k bucket size, this change saves 0.5 mbyte per namespace on a
64bit system.

NAT bysrc and expectation hash is still per namespace, those will
changed too soon.

Future patch will also make conntrack object slab cache global again.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 1b8c8a9f 02-May-2016 Florian Westphal <fw@strlen.de>

netfilter: conntrack: make netns address part of hash

Once we place all conntracks into a global hash table we want them to be
spread across entire hash table, even if namespaces have overlapping ip
addresses.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e0c7d472 28-Apr-2016 Florian Westphal <fw@strlen.de>

netfilter: conntrack: check netns when comparing conntrack objects

Once we place all conntracks in the same hash table we must also compare
the netns pointer to skip conntracks that belong to a different namespace.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 86804348 28-Apr-2016 Florian Westphal <fw@strlen.de>

netfilter: conntrack: use nf_ct_key_equal() in more places

This prepares for upcoming change that places all conntracks into a
single, global table. For this to work we will need to also compare
net pointer during lookup. To avoid open-coding such check use the
nf_ct_key_equal helper and then later extend it to also consider net_eq.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 88b68bc5 28-Apr-2016 Florian Westphal <fw@strlen.de>

netfilter: conntrack: don't attempt to iterate over empty table

Once we place all conntracks into same table iteration becomes more
costly because the table contains conntracks that we are not interested
in (belonging to other netns).

So don't bother scanning if the current namespace has no entries.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 5e3c61f9 28-Apr-2016 Florian Westphal <fw@strlen.de>

netfilter: conntrack: fix lookup race during hash resize

When resizing the conntrack hash table at runtime via
echo 42 > /sys/module/nf_conntrack/parameters/hashsize, we are racing with
the conntrack lookup path -- reads can happen in parallel and nothing
prevents readers from observing a the newly allocated hash but the old
size (or vice versa).

So access to hash[bucket] can trigger OOB read access in case the table got
expanded and we saw the new size but the old hash pointer (or it got shrunk
and we got new hash ptr but the size of the old and larger table):

kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN
CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.6.0-rc2+ #107
[..]
Call Trace:
[<ffffffff822c3d6a>] ? nf_conntrack_tuple_taken+0x12a/0xe90
[<ffffffff822c3ac1>] ? nf_ct_invert_tuplepr+0x221/0x3a0
[<ffffffff8230e703>] get_unique_tuple+0xfb3/0x2760

Use generation counter to obtain the address/length of the same table.

Also add a synchronize_net before freeing the old hash.
AFAICS, without it we might access ct_hash[bucket] after ct_hash has been
freed, provided that lockless reader got delayed by another event:

CPU1 CPU2
seq_begin
seq_retry
<delay> resize occurs
free oldhash
for_each(oldhash[size])

Note that resize is only supported in init_netns, it took over 2 minutes
of constant resizing+flooding to produce the warning, so this isn't a
big problem in practice.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 2cf12348 28-Apr-2016 Florian Westphal <fw@strlen.de>

netfilter: conntrack: keep BH enabled during lookup

No need to disable BH here anymore:

stats are switched to _ATOMIC variant (== this_cpu_inc()), which
nowadays generates same code as the non _ATOMIC NF_STAT, at least on x86.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 70d72b7e 23-Apr-2016 Florian Westphal <fw@strlen.de>

netfilter: conntrack: init all_locks to avoid debug warning

Else we get 'BUG: spinlock bad magic on CPU#' on resize when
spin lock debugging is enabled.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 141658fb 18-Apr-2016 Florian Westphal <fw@strlen.de>

netfilter: conntrack: use get_random_once for conntrack hash seed

As earlier commit removed accessed to the hash from other files we can
also make it static.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# a3efd812 18-Apr-2016 Florian Westphal <fw@strlen.de>

netfilter: conntrack: move generation seqcnt out of netns_ct

We only allow rehash in init namespace, so we only use
init_ns.generation. And even if we would allow it, it makes no sense
as the conntrack locks are global; any ongoing rehash prevents insert/
delete.

So make this private to nf_conntrack_core instead.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# ccd63c20 15-Mar-2016 Weongyo Jeong <weongyo.linux@gmail.com>

netfilter: nf_conntrack: Uses pr_fmt() for logging.

Uses pr_fmt() macro for debugging messages of nf_conntrack module.

Signed-off-by: Weongyo Jeong <weongyo.linux@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e39365be 13-Mar-2016 Nicholas Mc Guire <hofrat@osadl.org>

netfilter: nf_conntrack: consolidate lock/unlock into unlock_wait

The spin_lock()/spin_unlock() is synchronizing on the
nf_conntrack_locks_all_lock which is equivalent to
spin_unlock_wait() but the later should be more efficient.

Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# d93c6258 20-Jan-2016 Florian Westphal <fw@strlen.de>

netfilter: conntrack: resched in nf_ct_iterate_cleanup

Ulrich reports soft lockup with following (shortened) callchain:

NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s!
__netif_receive_skb_core+0x6e4/0x774
process_backlog+0x94/0x160
net_rx_action+0x88/0x178
call_do_softirq+0x24/0x3c
do_softirq+0x54/0x6c
__local_bh_enable_ip+0x7c/0xbc
nf_ct_iterate_cleanup+0x11c/0x22c [nf_conntrack]
masq_inet_event+0x20/0x30 [nf_nat_masquerade_ipv6]
atomic_notifier_call_chain+0x1c/0x2c
ipv6_del_addr+0x1bc/0x220 [ipv6]

Problem is that nf_ct_iterate_cleanup can run for a very long time
since it can be interrupted by softirq processing.
Moreover, atomic_notifier_call_chain runs with rcu readlock held.

So lets call cond_resched() in nf_ct_iterate_cleanup and defer
the call to a work queue for the atomic_notifier_call_chain case.

We also need another cond_resched in get_next_corpse, since we
have to deal with iter() always returning false, in that case
get_next_corpse will walk entire conntrack table.

Reported-by: Ulrich Weber <uw@ocedo.com>
Tested-by: Ulrich Weber <uw@ocedo.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b16c2919 18-Jan-2016 Sasha Levin <sasha.levin@oracle.com>

netfilter: nf_conntrack: use safer way to lock all buckets

When we need to lock all buckets in the connection hashtable we'd attempt to
lock 1024 spinlocks, which is way more preemption levels than supported by
the kernel. Furthermore, this behavior was hidden by checking if lockdep is
enabled, and if it was - use only 8 buckets(!).

Fix this by using a global lock and synchronize all buckets on it when we
need to lock them all. This is pretty heavyweight, but is only done when we
need to resize the hashtable, and that doesn't happen often enough (or at all).

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# ae2d708e 05-Oct-2015 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: conntrack: fix crash on timeout object removal

The object and module refcounts are updated for each conntrack template,
however, if we delete the iptables rules and we flush the timeout
database, we may end up with invalid references to timeout object that
are just gone.

Resolve this problem by setting the timeout reference to NULL when the
custom timeout entry is removed from our base. This patch requires some
RCU trickery to ensure safe pointer handling.

This handling is similar to what we already do with conntrack helpers,
the idea is to avoid bumping the timeout object reference counter from
the packet path to avoid the cost of atomic ops.

Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# a31f1adc 18-Sep-2015 Eric W. Biederman <ebiederm@xmission.com>

netfilter: nf_conntrack: Add a struct net parameter to l4_pkt_to_tuple

As gre does not have the srckey in the packet gre_pkt_to_tuple
needs to perform a lookup in it's per network namespace tables.

Pass in the proper network namespace to all pkt_to_tuple
implementations to ensure gre (and any similar protocols) can get this
right.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 62da9865 02-Sep-2015 Daniel Borkmann <daniel@iogearbox.net>

netfilter: nf_conntrack: make nf_ct_zone_dflt built-in

Fengguang reported, that some randconfig generated the following linker
issue with nf_ct_zone_dflt object involved:

[...]
CC init/version.o
LD init/built-in.o
net/built-in.o: In function `ipv4_conntrack_defrag':
nf_defrag_ipv4.c:(.text+0x93e95): undefined reference to `nf_ct_zone_dflt'
net/built-in.o: In function `ipv6_defrag':
nf_defrag_ipv6_hooks.c:(.text+0xe3ffe): undefined reference to `nf_ct_zone_dflt'
make: *** [vmlinux] Error 1

Given that configurations exist where we have a built-in part, which is
accessing nf_ct_zone_dflt such as the two handlers nf_ct_defrag_user()
and nf_ct6_defrag_user(), and a part that configures nf_conntrack as a
module, we must move nf_ct_zone_dflt into a fixed, guaranteed built-in
area when netfilter is configured in general.

Therefore, split the more generic parts into a common header under
include/linux/netfilter/ and move nf_ct_zone_dflt into the built-in
section that already holds parts related to CONFIG_NF_CONNTRACK in the
netfilter core. This fixes the issue on my side.

Fixes: 308ac9143ee2 ("netfilter: nf_conntrack: push zone object into functions")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9cf94eab 31-Aug-2015 Daniel Borkmann <daniel@iogearbox.net>

netfilter: conntrack: use nf_ct_tmpl_free in CT/synproxy error paths

Commit 0838aa7fcfcd ("netfilter: fix netns dependencies with conntrack
templates") migrated templates to the new allocator api, but forgot to
update error paths for them in CT and synproxy to use nf_ct_tmpl_free()
instead of nf_conntrack_free().

Due to that, memory is being freed into the wrong kmemcache, but also
we drop the per net reference count of ct objects causing an imbalance.

In Brad's case, this leads to a wrap-around of net->ct.count and thus
lets __nf_conntrack_alloc() refuse to create a new ct object:

[ 10.340913] xt_addrtype: ipv6 does not support BROADCAST matching
[ 10.810168] nf_conntrack: table full, dropping packet
[ 11.917416] r8169 0000:07:00.0 eth0: link up
[ 11.917438] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 12.815902] nf_conntrack: table full, dropping packet
[ 15.688561] nf_conntrack: table full, dropping packet
[ 15.689365] nf_conntrack: table full, dropping packet
[ 15.690169] nf_conntrack: table full, dropping packet
[ 15.690967] nf_conntrack: table full, dropping packet
[...]

With slab debugging, it also reports the wrong kmemcache (kmalloc-512 vs.
nf_conntrack_ffffffff81ce75c0) and reports poison overwrites, etc. Thus,
to fix the problem, export and use nf_ct_tmpl_free() instead.

Fixes: 0838aa7fcfcd ("netfilter: fix netns dependencies with conntrack templates")
Reported-by: Brad Jackson <bjackson0971@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 5e8018fc 14-Aug-2015 Daniel Borkmann <daniel@iogearbox.net>

netfilter: nf_conntrack: add efficient mark to zone mapping

This work adds the possibility of deriving the zone id from the skb->mark
field in a scalable manner. This allows for having only a single template
serving hundreds/thousands of different zones, for example, instead of the
need to have one match for each zone as an extra CT jump target.

Note that we'd need to have this information attached to the template as at
the time when we're trying to lookup a possible ct object, we already need
to know zone information for a possible match when going into
__nf_conntrack_find_get(). This work provides a minimal implementation for
a possible mapping.

In order to not add/expose an extra ct->status bit, the zone structure has
been extended to carry a flag for deriving the mark.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# deedb590 14-Aug-2015 Daniel Borkmann <daniel@iogearbox.net>

netfilter: nf_conntrack: add direction support for zones

This work adds a direction parameter to netfilter zones, so identity
separation can be performed only in original/reply or both directions
(default). This basically opens up the possibility of doing NAT with
conflicting IP address/port tuples from multiple, isolated tenants
on a host (e.g. from a netns) without requiring each tenant to NAT
twice resp. to use its own dedicated IP address to SNAT to, meaning
overlapping tuples can be made unique with the zone identifier in
original direction, where the NAT engine will then allocate a unique
tuple in the commonly shared default zone for the reply direction.
In some restricted, local DNAT cases, also port redirection could be
used for making the reply traffic unique w/o requiring SNAT.

The consensus we've reached and discussed at NFWS and since the initial
implementation [1] was to directly integrate the direction meta data
into the existing zones infrastructure, as opposed to the ct->mark
approach we proposed initially.

As we pass the nf_conntrack_zone object directly around, we don't have
to touch all call-sites, but only those, that contain equality checks
of zones. Thus, based on the current direction (original or reply),
we either return the actual id, or the default NF_CT_DEFAULT_ZONE_ID.
CT expectations are direction-agnostic entities when expectations are
being compared among themselves, so we can only use the identifier
in this case.

Note that zone identifiers can not be included into the hash mix
anymore as they don't contain a "stable" value that would be equal
for both directions at all times, f.e. if only zone->id would
unconditionally be xor'ed into the table slot hash, then replies won't
find the corresponding conntracking entry anymore.

If no particular direction is specified when configuring zones, the
behaviour is exactly as we expect currently (both directions).

Support has been added for the CT netlink interface as well as the
x_tables raw CT target, which both already offer existing interfaces
to user space for the configuration of zones.

Below a minimal, simplified collision example (script in [2]) with
netperf sessions:

+--- tenant-1 ---+ mark := 1
| netperf |--+
+----------------+ | CT zone := mark [ORIGINAL]
[ip,sport] := X +--------------+ +--- gateway ---+
| mark routing |--| SNAT |-- ... +
+--------------+ +---------------+ |
+--- tenant-2 ---+ | ~~~|~~~
| netperf |--+ +-----------+ |
+----------------+ mark := 2 | netserver |------ ... +
[ip,sport] := X +-----------+
[ip,port] := Y
On the gateway netns, example:

iptables -t raw -A PREROUTING -j CT --zone mark --zone-dir ORIGINAL
iptables -t nat -A POSTROUTING -o <dev> -j SNAT --to-source <ip> --random-fully

iptables -t mangle -A PREROUTING -m conntrack --ctdir ORIGINAL -j CONNMARK --save-mark
iptables -t mangle -A POSTROUTING -m conntrack --ctdir REPLY -j CONNMARK --restore-mark

conntrack dump from gateway netns:

netperf -H 10.1.1.2 -t TCP_STREAM -l60 -p12865,5555 from each tenant netns

tcp 6 431995 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=1
src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=1024
[ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1

tcp 6 431994 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=2
src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=5555
[ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=1

tcp 6 299 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=39438 dport=33768 zone-orig=1
src=10.1.1.2 dst=10.1.1.1 sport=33768 dport=39438
[ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1

tcp 6 300 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=32889 dport=40206 zone-orig=2
src=10.1.1.2 dst=10.1.1.1 sport=40206 dport=32889
[ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=2

Taking this further, test script in [2] creates 200 tenants and runs
original-tuple colliding netperf sessions each. A conntrack -L dump in
the gateway netns also confirms 200 overlapping entries, all in ESTABLISHED
state as expected.

I also did run various other tests with some permutations of the script,
to mention some: SNAT in random/random-fully/persistent mode, no zones (no
overlaps), static zones (original, reply, both directions), etc.

[1] http://thread.gmane.org/gmane.comp.security.firewalls.netfilter.devel/57412/
[2] https://paste.fedoraproject.org/242835/65657871/

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 308ac914 08-Aug-2015 Daniel Borkmann <daniel@iogearbox.net>

netfilter: nf_conntrack: push zone object into functions

This patch replaces the zone id which is pushed down into functions
with the actual zone object. It's a bigger one-time change, but
needed for later on extending zones with a direction parameter, and
thus decoupling this additional information from all call-sites.

No functional changes in this patch.

The default zone becomes a global const object, namely nf_ct_zone_dflt
and will be returned directly in various cases, one being, when there's
f.e. no zoning support.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# f58e5aa7 04-Aug-2015 Joe Stringer <joestringer@nicira.com>

netfilter: conntrack: Use flags in nf_ct_tmpl_alloc()

The flags were ignored for this function when it was introduced. Also
fix the style problem in kzalloc.

Fixes: 0838aa7fc (netfilter: fix netns dependencies with conntrack
templates)
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# f0ad4621 23-Jul-2015 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conntrack: silence warning on falling back to vmalloc()

Since 88eab472ec21 ("netfilter: conntrack: adjust nf_conntrack_buckets default
value"), the hashtable can easily hit this warning. We got reports from users
that are getting this message in a quite spamming fashion, so better silence
this.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Florian Westphal <fw@strlen.de>


# 0838aa7f 13-Jul-2015 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: fix netns dependencies with conntrack templates

Quoting Daniel Borkmann:

"When adding connection tracking template rules to a netns, f.e. to
configure netfilter zones, the kernel will endlessly busy-loop as soon
as we try to delete the given netns in case there's at least one
template present, which is problematic i.e. if there is such bravery that
the priviledged user inside the netns is assumed untrusted.

Minimal example:

ip netns add foo
ip netns exec foo iptables -t raw -A PREROUTING -d 1.2.3.4 -j CT --zone 1
ip netns del foo

What happens is that when nf_ct_iterate_cleanup() is being called from
nf_conntrack_cleanup_net_list() for a provided netns, we always end up
with a net->ct.count > 0 and thus jump back to i_see_dead_people. We
don't get a soft-lockup as we still have a schedule() point, but the
serving CPU spins on 100% from that point onwards.

Since templates are normally allocated with nf_conntrack_alloc(), we
also bump net->ct.count. The issue why they are not yet nf_ct_put() is
because the per netns .exit() handler from x_tables (which would eventually
invoke xt_CT's xt_ct_tg_destroy() that drops reference on info->ct) is
called in the dependency chain at a *later* point in time than the per
netns .exit() handler for the connection tracker.

This is clearly a chicken'n'egg problem: after the connection tracker
.exit() handler, we've teared down all the connection tracking
infrastructure already, so rightfully, xt_ct_tg_destroy() cannot be
invoked at a later point in time during the netns cleanup, as that would
lead to a use-after-free. At the same time, we cannot make x_tables depend
on the connection tracker module, so that the xt_ct_tg_destroy() would
be invoked earlier in the cleanup chain."

Daniel confirms this has to do with the order in which modules are loaded or
having compiled nf_conntrack as modules while x_tables built-in. So we have no
guarantees regarding the order in which netns callbacks are executed.

Fix this by allocating the templates through kmalloc() from the respective
SYNPROXY and CT targets, so they don't depend on the conntrack kmem cache.
Then, release then via nf_ct_tmpl_free() from destroy_conntrack(). This branch
is marked as unlikely since conntrack templates are rarely allocated and only
from the configuration plane path.

Note that templates are not kept in any list to avoid further dependencies with
nf_conntrack anymore, thus, the tmpl larval list is removed.

Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Daniel Borkmann <daniel@iogearbox.net>


# ae406bd0 24-Dec-2014 Kristian Evensen <kristian.evensen@gmail.com>

netfilter: conntrack: Remove nf_ct_conntrack_flush_report

The only user of nf_ct_conntrack_flush_report() was ctnetlink_del_conntrack().
After adding support for flushing connections with a given mark, this function
is no longer called.

Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 8ca3f5e9 24-Nov-2014 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: conntrack: fix race between confirmation and flush

Commit 5195c14c8b27c ("netfilter: conntrack: fix race in
__nf_conntrack_confirm against get_next_corpse") aimed to resolve the
race condition between the confirmation (packet path) and the flush
command (from control plane). However, it introduced a crash when
several packets race to add a new conntrack, which seems easier to
reproduce when nf_queue is in place.

Fix this race, in __nf_conntrack_confirm(), by removing the CT
from unconfirmed list before checking the DYING bit. In case
race occured, re-add the CT to the dying list

This patch also changes the verdict from NF_ACCEPT to NF_DROP when
we lose race. Basically, the confirmation happens for the first packet
that we see in a flow. If you just invoked conntrack -F once (which
should be the common case), then this is likely to be the first packet
of the flow (unless you already called flush anytime soon in the past).
This should be hard to trigger, but better drop this packet, otherwise
we leave things in inconsistent state since the destination will likely
reply to this packet, but it will find no conntrack, unless the origin
retransmits.

The change of the verdict has been discussed in:
https://www.marc.info/?l=linux-netdev&m=141588039530056&w=2

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 88eab472 03-Dec-2014 Marcelo Leitner <mleitner@redhat.com>

netfilter: conntrack: adjust nf_conntrack_buckets default value

Manually bumping either nf_conntrack_buckets or nf_conntrack_max has
become a common task as our Linux servers tend to serve more and more
clients/applications, so let's adjust nf_conntrack_buckets this to a
more updated value.

Now for systems with more than 4GB of memory, nf_conntrack_buckets
becomes 65536 instead of 16384, resulting in nf_conntrack_max=256k
entries.

Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c41884ce 24-Nov-2014 Florian Westphal <fw@strlen.de>

netfilter: conntrack: avoid zeroing timer

add a __nfct_init_offset annotation member to struct nf_conn to make
it clear which members are covered by the memset when the conntrack
is allocated.

This avoids zeroing timer_list and ct_net; both are already inited
explicitly.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 43612d7c 25-Nov-2014 Pablo Neira <pablo@netfilter.org>

Revert "netfilter: conntrack: fix race in __nf_conntrack_confirm against get_next_corpse"

This reverts commit 5195c14c8b27cc0b18220ddbf0e5ad3328a04187.

If the conntrack clashes with an existing one, it is left out of
the unconfirmed list, thus, crashing when dropping the packet and
releasing the conntrack since golden rule is that conntracks are
always placed in any of the existing lists for traceability reasons.

Reported-by: Daniel Borkmann <dborkman@redhat.com>
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=88841
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5195c14c 06-Nov-2014 bill bonaparte <programme110@gmail.com>

netfilter: conntrack: fix race in __nf_conntrack_confirm against get_next_corpse

After removal of the central spinlock nf_conntrack_lock, in
commit 93bb0ceb75be2 ("netfilter: conntrack: remove central
spinlock nf_conntrack_lock"), it is possible to race against
get_next_corpse().

The race is against the get_next_corpse() cleanup on
the "unconfirmed" list (a per-cpu list with seperate locking),
which set the DYING bit.

Fix this race, in __nf_conntrack_confirm(), by removing the CT
from unconfirmed list before checking the DYING bit. In case
race occured, re-add the CT to the dying list.

While at this, fix coding style of the comment that has been
updated.

Fixes: 93bb0ceb75be2 ("netfilter: conntrack: remove central spinlock nf_conntrack_lock")
Reported-by: bill bonaparte <programme110@gmail.com>
Signed-off-by: bill bonaparte <programme110@gmail.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 8fc54f68 23-Aug-2014 Daniel Borkmann <daniel@iogearbox.net>

net: use reciprocal_scale() helper

Replace open codings of (((u64) <x> * <y>) >> 32) with reciprocal_scale().

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d2de875c 22-Aug-2014 Eric Dumazet <edumazet@google.com>

net: use ktime_get_ns() and ktime_get_real_ns() helpers

ktime_get_ns() replaces ktime_to_ns(ktime_get())

ktime_get_real_ns() replaces ktime_to_ns(ktime_get_real())

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9500507c 10-Jun-2014 Florian Westphal <fw@strlen.de>

netfilter: conntrack: remove timer from ecache extension

This brings the (per-conntrack) ecache extension back to 24 bytes in size
(was 152 byte on x86_64 with lockdep on).

When event delivery fails, re-delivery is attempted via work queue.

Redelivery is attempted at least every 0.1 seconds, but can happen
more frequently if userspace is not congested.

The nf_ct_release_dying_list() function is removed.
With this patch, ownership of the to-be-redelivered conntracks
(on-dying-list-with-DYING-bit not yet set) is with the work queue,
which will release the references once event is out.

Joint work with Pablo Neira Ayuso.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 4e857c58 17-Mar-2014 Peter Zijlstra <peterz@infradead.org>

arch: Mass conversion of smp_mb__*()

Mostly scripted conversion of the smp_mb__* barriers.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# ee214d54 11-Apr-2014 Andrey Vagin <avagin@openvz.org>

netfilter: nf_conntrack: initialize net.ct.generation

[ 251.920788] INFO: trying to register non-static key.
[ 251.921386] the code is fine but needs lockdep annotation.
[ 251.921386] turning off the locking correctness validator.
[ 251.921386] CPU: 2 PID: 15715 Comm: socket_listen Not tainted 3.14.0+ #294
[ 251.921386] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 251.921386] 0000000000000000 000000009d18c210 ffff880075f039b8 ffffffff816b7ecd
[ 251.921386] ffffffff822c3b10 ffff880075f039c8 ffffffff816b36f4 ffff880075f03aa0
[ 251.921386] ffffffff810c65ff ffffffff810c4a85 00000000fffffe01 ffffffffa0075172
[ 251.921386] Call Trace:
[ 251.921386] [<ffffffff816b7ecd>] dump_stack+0x45/0x56
[ 251.921386] [<ffffffff816b36f4>] register_lock_class.part.24+0x38/0x3c
[ 251.921386] [<ffffffff810c65ff>] __lock_acquire+0x168f/0x1b40
[ 251.921386] [<ffffffff810c4a85>] ? trace_hardirqs_on_caller+0x105/0x1d0
[ 251.921386] [<ffffffffa0075172>] ? nf_nat_setup_info+0x252/0x3a0 [nf_nat]
[ 251.921386] [<ffffffff816c1215>] ? _raw_spin_unlock_bh+0x35/0x40
[ 251.921386] [<ffffffffa0075172>] ? nf_nat_setup_info+0x252/0x3a0 [nf_nat]
[ 251.921386] [<ffffffff810c7272>] lock_acquire+0xa2/0x120
[ 251.921386] [<ffffffffa008ab90>] ? ipv4_confirm+0x90/0xf0 [nf_conntrack_ipv4]
[ 251.921386] [<ffffffffa0055989>] __nf_conntrack_confirm+0x129/0x410 [nf_conntrack]
[ 251.921386] [<ffffffffa008ab90>] ? ipv4_confirm+0x90/0xf0 [nf_conntrack_ipv4]
[ 251.921386] [<ffffffffa008ab90>] ipv4_confirm+0x90/0xf0 [nf_conntrack_ipv4]
[ 251.921386] [<ffffffff815e7b00>] ? ip_fragment+0x9f0/0x9f0
[ 251.921386] [<ffffffff815d8c5a>] nf_iterate+0xaa/0xc0
[ 251.921386] [<ffffffff815e7b00>] ? ip_fragment+0x9f0/0x9f0
[ 251.921386] [<ffffffff815d8d14>] nf_hook_slow+0xa4/0x190
[ 251.921386] [<ffffffff815e7b00>] ? ip_fragment+0x9f0/0x9f0
[ 251.921386] [<ffffffff815e98f2>] ip_output+0x92/0x100
[ 251.921386] [<ffffffff815e8df9>] ip_local_out+0x29/0x90
[ 251.921386] [<ffffffff815e9240>] ip_queue_xmit+0x170/0x4c0
[ 251.921386] [<ffffffff815e90d5>] ? ip_queue_xmit+0x5/0x4c0
[ 251.921386] [<ffffffff81601208>] tcp_transmit_skb+0x498/0x960
[ 251.921386] [<ffffffff81602d82>] tcp_connect+0x812/0x960
[ 251.921386] [<ffffffff810e3dc5>] ? ktime_get_real+0x25/0x70
[ 251.921386] [<ffffffff8159ea2a>] ? secure_tcp_sequence_number+0x6a/0xc0
[ 251.921386] [<ffffffff81606f57>] tcp_v4_connect+0x317/0x470
[ 251.921386] [<ffffffff8161f645>] __inet_stream_connect+0xb5/0x330
[ 251.921386] [<ffffffff8158dfc3>] ? lock_sock_nested+0x33/0xa0
[ 251.921386] [<ffffffff810c4b5d>] ? trace_hardirqs_on+0xd/0x10
[ 251.921386] [<ffffffff81078885>] ? __local_bh_enable_ip+0x75/0xe0
[ 251.921386] [<ffffffff8161f8f8>] inet_stream_connect+0x38/0x50
[ 251.921386] [<ffffffff8158b157>] SYSC_connect+0xe7/0x120
[ 251.921386] [<ffffffff810e3789>] ? current_kernel_time+0x69/0xd0
[ 251.921386] [<ffffffff810c4a85>] ? trace_hardirqs_on_caller+0x105/0x1d0
[ 251.921386] [<ffffffff810c4b5d>] ? trace_hardirqs_on+0xd/0x10
[ 251.921386] [<ffffffff8158c36e>] SyS_connect+0xe/0x10
[ 251.921386] [<ffffffff816caf69>] system_call_fastpath+0x16/0x1b
[ 312.014104] INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 0, t=60003 jiffies, g=42359, c=42358, q=333)
[ 312.015097] INFO: Stall ended before state dump start

Fixes: 93bb0ceb75be ("netfilter: conntrack: remove central spinlock nf_conntrack_lock")
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# d5d20912 17-Mar-2014 Eric Dumazet <edumazet@google.com>

netfilter: conntrack: Fix UP builds

ARRAY_SIZE(nf_conntrack_locks) is undefined if spinlock_t is an
empty structure. Replace it by CONNTRACK_LOCKS

Fixes: 93bb0ceb75be ("netfilter: conntrack: remove central spinlock nf_conntrack_lock")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 93bb0ceb 03-Mar-2014 Jesper Dangaard Brouer <brouer@redhat.com>

netfilter: conntrack: remove central spinlock nf_conntrack_lock

nf_conntrack_lock is a monolithic lock and suffers from huge contention
on current generation servers (8 or more core/threads).

Perf locking congestion is clear on base kernel:

- 72.56% ksoftirqd/6 [kernel.kallsyms] [k] _raw_spin_lock_bh
- _raw_spin_lock_bh
+ 25.33% init_conntrack
+ 24.86% nf_ct_delete_from_lists
+ 24.62% __nf_conntrack_confirm
+ 24.38% destroy_conntrack
+ 0.70% tcp_packet
+ 2.21% ksoftirqd/6 [kernel.kallsyms] [k] fib_table_lookup
+ 1.15% ksoftirqd/6 [kernel.kallsyms] [k] __slab_free
+ 0.77% ksoftirqd/6 [kernel.kallsyms] [k] inet_getpeer
+ 0.70% ksoftirqd/6 [nf_conntrack] [k] nf_ct_delete
+ 0.55% ksoftirqd/6 [ip_tables] [k] ipt_do_table

This patch change conntrack locking and provides a huge performance
improvement. SYN-flood attack tested on a 24-core E5-2695v2(ES) with
10Gbit/s ixgbe (with tool trafgen):

Base kernel: 810.405 new conntrack/sec
After patch: 2.233.876 new conntrack/sec

Notice other floods attack (SYN+ACK or ACK) can easily be deflected using:
# iptables -A INPUT -m state --state INVALID -j DROP
# sysctl -w net/netfilter/nf_conntrack_tcp_loose=0

Use an array of hashed spinlocks to protect insertions/deletions of
conntracks into the hash table. 1024 spinlocks seem to give good
results, at minimal cost (4KB memory). Due to lockdep max depth,
1024 becomes 8 if CONFIG_LOCKDEP=y

The hash resize is a bit tricky, because we need to take all locks in
the array. A seqcount_t is used to synchronize the hash table users
with the resizing process.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# ca7433df 03-Mar-2014 Jesper Dangaard Brouer <brouer@redhat.com>

netfilter: conntrack: seperate expect locking from nf_conntrack_lock

Netfilter expectations are protected with the same lock as conntrack
entries (nf_conntrack_lock). This patch split out expectations locking
to use it's own lock (nf_conntrack_expect_lock).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e1b207da 03-Mar-2014 Jesper Dangaard Brouer <brouer@redhat.com>

netfilter: avoid race with exp->master ct

Preparation for disconnecting the nf_conntrack_lock from the
expectations code. Once the nf_conntrack_lock is lifted, a race
condition is exposed.

The expectations master conntrack exp->master, can race with
delete operations, as the refcnt increment happens too late in
init_conntrack(). Race is against other CPUs invoking
->destroy() (destroy_conntrack()), or nf_ct_delete() (via timeout
or early_drop()).

Avoid this race in nf_ct_find_expectation() by using atomic_inc_not_zero(),
and checking if nf_ct_is_dying() (path via nf_ct_delete()).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b7779d06 03-Mar-2014 Jesper Dangaard Brouer <brouer@redhat.com>

netfilter: conntrack: spinlock per cpu to protect special lists.

One spinlock per cpu to protect dying/unconfirmed/template special lists.
(These lists are now per cpu, a bit like the untracked ct)
Add a @cpu field to nf_conn, to make sure we hold the appropriate
spinlock at removal time.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b476b72a 03-Mar-2014 Jesper Dangaard Brouer <brouer@redhat.com>

netfilter: trivial code cleanup and doc changes

Changes while reading through the netfilter code.

Added hint about how conntrack nf_conn refcnt is accessed.
And renamed repl_hash to reply_hash for readability

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e53376be 03-Feb-2014 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conntrack: don't release a conntrack with non-zero refcnt

With this patch, the conntrack refcount is initially set to zero and
it is bumped once it is added to any of the list, so we fulfill
Eric's golden rule which is that all released objects always have a
refcount that equals zero.

Andrey Vagin reports that nf_conntrack_free can't be called for a
conntrack with non-zero ref-counter, because it can race with
nf_conntrack_find_get().

A conntrack slab is created with SLAB_DESTROY_BY_RCU. Non-zero
ref-counter says that this conntrack is used. So when we release
a conntrack with non-zero counter, we break this assumption.

CPU1 CPU2
____nf_conntrack_find()
nf_ct_put()
destroy_conntrack()
...
init_conntrack
__nf_conntrack_alloc (set use = 1)
atomic_inc_not_zero(&ct->use) (use = 2)
if (!l4proto->new(ct, skb, dataoff, timeouts))
nf_conntrack_free(ct); (use = 2 !!!)
...
__nf_conntrack_alloc (set use = 1)
if (!nf_ct_key_equal(h, tuple, zone))
nf_ct_put(ct); (use = 0)
destroy_conntrack()
/* continue to work with CT */

After applying the path "[PATCH] netfilter: nf_conntrack: fix RCU
race in nf_conntrack_find_get" another bug was triggered in
destroy_conntrack():

<4>[67096.759334] ------------[ cut here ]------------
<2>[67096.759353] kernel BUG at net/netfilter/nf_conntrack_core.c:211!
...
<4>[67096.759837] Pid: 498649, comm: atdd veid: 666 Tainted: G C --------------- 2.6.32-042stab084.18 #1 042stab084_18 /DQ45CB
<4>[67096.759932] RIP: 0010:[<ffffffffa03d99ac>] [<ffffffffa03d99ac>] destroy_conntrack+0x15c/0x190 [nf_conntrack]
<4>[67096.760255] Call Trace:
<4>[67096.760255] [<ffffffff814844a7>] nf_conntrack_destroy+0x17/0x30
<4>[67096.760255] [<ffffffffa03d9bb5>] nf_conntrack_find_get+0x85/0x130 [nf_conntrack]
<4>[67096.760255] [<ffffffffa03d9fb2>] nf_conntrack_in+0x352/0xb60 [nf_conntrack]
<4>[67096.760255] [<ffffffffa048c771>] ipv4_conntrack_local+0x51/0x60 [nf_conntrack_ipv4]
<4>[67096.760255] [<ffffffff81484419>] nf_iterate+0x69/0xb0
<4>[67096.760255] [<ffffffff814b5b00>] ? dst_output+0x0/0x20
<4>[67096.760255] [<ffffffff814845d4>] nf_hook_slow+0x74/0x110
<4>[67096.760255] [<ffffffff814b5b00>] ? dst_output+0x0/0x20
<4>[67096.760255] [<ffffffff814b66d5>] raw_sendmsg+0x775/0x910
<4>[67096.760255] [<ffffffff8104c5a8>] ? flush_tlb_others_ipi+0x128/0x130
<4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255] [<ffffffff814c136a>] inet_sendmsg+0x4a/0xb0
<4>[67096.760255] [<ffffffff81444e93>] ? sock_sendmsg+0x13/0x140
<4>[67096.760255] [<ffffffff81444f97>] sock_sendmsg+0x117/0x140
<4>[67096.760255] [<ffffffff8102e299>] ? native_smp_send_reschedule+0x49/0x60
<4>[67096.760255] [<ffffffff81519beb>] ? _spin_unlock_bh+0x1b/0x20
<4>[67096.760255] [<ffffffff8109d930>] ? autoremove_wake_function+0x0/0x40
<4>[67096.760255] [<ffffffff814960f0>] ? do_ip_setsockopt+0x90/0xd80
<4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255] [<ffffffff814457c9>] sys_sendto+0x139/0x190
<4>[67096.760255] [<ffffffff810efa77>] ? audit_syscall_entry+0x1d7/0x200
<4>[67096.760255] [<ffffffff810ef7c5>] ? __audit_syscall_exit+0x265/0x290
<4>[67096.760255] [<ffffffff81474daf>] compat_sys_socketcall+0x13f/0x210
<4>[67096.760255] [<ffffffff8104dea3>] ia32_sysret+0x0/0x5

I have reused the original title for the RFC patch that Andrey posted and
most of the original patch description.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Andrew Vagin <avagin@parallels.com>
Cc: Florian Westphal <fw@strlen.de>
Reported-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Andrew Vagin <avagin@parallels.com>


# c6825c09 29-Jan-2014 Andrey Vagin <avagin@openvz.org>

netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get

Lets look at destroy_conntrack:

hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
...
nf_conntrack_free(ct)
kmem_cache_free(net->ct.nf_conntrack_cachep, ct);

net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU.

The hash is protected by rcu, so readers look up conntracks without
locks.
A conntrack is removed from the hash, but in this moment a few readers
still can use the conntrack. Then this conntrack is released and another
thread creates conntrack with the same address and the equal tuple.
After this a reader starts to validate the conntrack:
* It's not dying, because a new conntrack was created
* nf_ct_tuple_equal() returns true.

But this conntrack is not initialized yet, so it can not be used by two
threads concurrently. In this case BUG_ON may be triggered from
nf_nat_setup_info().

Florian Westphal suggested to check the confirm bit too. I think it's
right.

task 1 task 2 task 3
nf_conntrack_find_get
____nf_conntrack_find
destroy_conntrack
hlist_nulls_del_rcu
nf_conntrack_free
kmem_cache_free
__nf_conntrack_alloc
kmem_cache_alloc
memset(&ct->tuplehash[IP_CT_DIR_MAX],
if (nf_ct_is_dying(ct))
if (!nf_ct_tuple_equal()

I'm not sure, that I have ever seen this race condition in a real life.
Currently we are investigating a bug, which is reproduced on a few nodes.
In our case one conntrack is initialized from a few tasks concurrently,
we don't have any other explanation for this.

<2>[46267.083061] kernel BUG at net/ipv4/netfilter/nf_nat_core.c:322!
...
<4>[46267.083951] RIP: 0010:[<ffffffffa01e00a4>] [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590 [nf_nat]
...
<4>[46267.085549] Call Trace:
<4>[46267.085622] [<ffffffffa023421b>] alloc_null_binding+0x5b/0xa0 [iptable_nat]
<4>[46267.085697] [<ffffffffa02342bc>] nf_nat_rule_find+0x5c/0x80 [iptable_nat]
<4>[46267.085770] [<ffffffffa0234521>] nf_nat_fn+0x111/0x260 [iptable_nat]
<4>[46267.085843] [<ffffffffa0234798>] nf_nat_out+0x48/0xd0 [iptable_nat]
<4>[46267.085919] [<ffffffff814841b9>] nf_iterate+0x69/0xb0
<4>[46267.085991] [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
<4>[46267.086063] [<ffffffff81484374>] nf_hook_slow+0x74/0x110
<4>[46267.086133] [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
<4>[46267.086207] [<ffffffff814b5890>] ? dst_output+0x0/0x20
<4>[46267.086277] [<ffffffff81495204>] ip_output+0xa4/0xc0
<4>[46267.086346] [<ffffffff814b65a4>] raw_sendmsg+0x8b4/0x910
<4>[46267.086419] [<ffffffff814c10fa>] inet_sendmsg+0x4a/0xb0
<4>[46267.086491] [<ffffffff814459aa>] ? sock_update_classid+0x3a/0x50
<4>[46267.086562] [<ffffffff81444d67>] sock_sendmsg+0x117/0x140
<4>[46267.086638] [<ffffffff8151997b>] ? _spin_unlock_bh+0x1b/0x20
<4>[46267.086712] [<ffffffff8109d370>] ? autoremove_wake_function+0x0/0x40
<4>[46267.086785] [<ffffffff81495e80>] ? do_ip_setsockopt+0x90/0xd80
<4>[46267.086858] [<ffffffff8100be0e>] ? call_function_interrupt+0xe/0x20
<4>[46267.086936] [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
<4>[46267.087006] [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
<4>[46267.087081] [<ffffffff8118f2e8>] ? kmem_cache_alloc+0xd8/0x1e0
<4>[46267.087151] [<ffffffff81445599>] sys_sendto+0x139/0x190
<4>[46267.087229] [<ffffffff81448c0d>] ? sock_setsockopt+0x16d/0x6f0
<4>[46267.087303] [<ffffffff810efa47>] ? audit_syscall_entry+0x1d7/0x200
<4>[46267.087378] [<ffffffff810ef795>] ? __audit_syscall_exit+0x265/0x290
<4>[46267.087454] [<ffffffff81474885>] ? compat_sys_setsockopt+0x75/0x210
<4>[46267.087531] [<ffffffff81474b5f>] compat_sys_socketcall+0x13f/0x210
<4>[46267.087607] [<ffffffff8104dea3>] ia32_sysret+0x0/0x5
<4>[46267.087676] Code: 91 20 e2 01 75 29 48 89 de 4c 89 f7 e8 56 fa ff ff 85 c0 0f 84 68 fc ff ff 0f b6 4d c6 41 8b 45 00 e9 4d fb ff ff e8 7c 19 e9 e0 <0f> 0b eb fe f6 05 17 91 20 e2 80 74 ce 80 3d 5f 2e 00 00 00 74
<1>[46267.088023] RIP [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# dcd93ed4 30-Dec-2013 stephen hemminger <stephen@networkplumber.org>

netfilter: nf_conntrack: remove dead code

The following code is not used in current upstream code.
Some of this seems to be old hooks, other might be used by some
out of tree module (which I don't care about breaking), and
the need_ipv4_conntrack was used by old NAT code but no longer
called.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 0c3c6c00 17-Nov-2013 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conntrack: decrement global counter after object release

nf_conntrack_free() decrements our counter (net->ct.count)
before releasing the conntrack object. That counter is used in the
nf_conntrack_cleanup_net_list path to check if it's time to
kmem_cache_destroy our cache of conntrack objects. I think we have
a race there that should be easier to trigger (although still hard)
with CONFIG_DEBUG_OBJECTS_FREE as object releases become slowier
according to the following splat:

[ 1136.321305] WARNING: CPU: 2 PID: 2483 at lib/debugobjects.c:260
debug_print_object+0x83/0xa0()
[ 1136.321311] ODEBUG: free active (active state 0) object type:
timer_list hint: delayed_work_timer_fn+0x0/0x20
...
[ 1136.321390] Call Trace:
[ 1136.321398] [<ffffffff8160d4a2>] dump_stack+0x45/0x56
[ 1136.321405] [<ffffffff810514e8>] warn_slowpath_common+0x78/0xa0
[ 1136.321410] [<ffffffff81051557>] warn_slowpath_fmt+0x47/0x50
[ 1136.321414] [<ffffffff812f8883>] debug_print_object+0x83/0xa0
[ 1136.321420] [<ffffffff8106aa90>] ? execute_in_process_context+0x90/0x90
[ 1136.321424] [<ffffffff812f99fb>] debug_check_no_obj_freed+0x20b/0x250
[ 1136.321429] [<ffffffff8112e7f2>] ? kmem_cache_destroy+0x92/0x100
[ 1136.321433] [<ffffffff8115d945>] kmem_cache_free+0x125/0x210
[ 1136.321436] [<ffffffff8112e7f2>] kmem_cache_destroy+0x92/0x100
[ 1136.321443] [<ffffffffa046b806>] nf_conntrack_cleanup_net_list+0x126/0x160 [nf_conntrack]
[ 1136.321449] [<ffffffffa046c43d>] nf_conntrack_pernet_exit+0x6d/0x80 [nf_conntrack]
[ 1136.321453] [<ffffffff81511cc3>] ops_exit_list.isra.3+0x53/0x60
[ 1136.321457] [<ffffffff815124f0>] cleanup_net+0x100/0x1b0
[ 1136.321460] [<ffffffff8106b31e>] process_one_work+0x18e/0x430
[ 1136.321463] [<ffffffff8106bf49>] worker_thread+0x119/0x390
[ 1136.321467] [<ffffffff8106be30>] ? manage_workers.isra.23+0x2a0/0x2a0
[ 1136.321470] [<ffffffff8107210b>] kthread+0xbb/0xc0
[ 1136.321472] [<ffffffff81072050>] ? kthread_create_on_node+0x110/0x110
[ 1136.321477] [<ffffffff8161b8fc>] ret_from_fork+0x7c/0xb0
[ 1136.321479] [<ffffffff81072050>] ? kthread_create_on_node+0x110/0x110
[ 1136.321481] ---[ end trace 25f53c192da70825 ]---

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# f7b13e43 26-Sep-2013 Holger Eitzenberger <holger@eitzenberger.org>

netfilter: introduce nf_conn_acct structure

Encapsulate counters for both directions into nf_conn_acct. During
that process also consistently name pointers to the extend 'acct',
not 'counters'. This patch is a cleanup.

Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 48b1de4c 27-Aug-2013 Patrick McHardy <kaber@trash.net>

netfilter: add SYNPROXY core/target

Add a SYNPROXY for netfilter. The code is split into two parts, the synproxy
core with common functions and an address family specific target.

The SYNPROXY receives the connection request from the client, responds with
a SYN/ACK containing a SYN cookie and announcing a zero window and checks
whether the final ACK from the client contains a valid cookie.

It then establishes a connection to the original destination and, if
successful, sends a window update to the client with the window size
announced by the server.

Support for timestamps, SACK, window scaling and MSS options can be
statically configured as target parameters if the features of the server
are known. If timestamps are used, the timestamp value sent back to
the client in the SYN/ACK will be different from the real timestamp of
the server. In order to now break PAWS, the timestamps are translated in
the direction server->client.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Tested-by: Martin Topholm <mph@one.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 41d73ec0 27-Aug-2013 Patrick McHardy <kaber@trash.net>

netfilter: nf_conntrack: make sequence number adjustments usuable without NAT

Split out sequence number adjustments from NAT and move them to the conntrack
core to make them usable for SYN proxying. The sequence number adjustment
information is moved to a seperate extend. The extend is added to new
conntracks when a NAT mapping is set up for a connection using a helper.

As a side effect, this saves 24 bytes per connection with NAT in the common
case that a connection does not have a helper assigned.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Tested-by: Martin Topholm <mph@one.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c655bc68 29-Jul-2013 Florian Westphal <fw@strlen.de>

netfilter: nf_conntrack: don't send destroy events from iterator

Let nf_ct_delete handle delivery of the DESTROY event.

Based on earlier patch from Pablo Neira.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 2d89c68a 28-Jul-2013 Patrick McHardy <kaber@trash.net>

netfilter: nf_nat: change sequence number adjustments to 32 bits

Using 16 bits is too small, when many adjustments happen the offsets might
overflow and break the connection.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 02982c27 29-Jul-2013 Florian Westphal <fw@strlen.de>

netfilter: nf_conntrack: remove duplicate code in ctnetlink

ctnetlink contains copy-paste code from death_by_timeout. In order to
avoid changing both places in upcoming event delivery patch,
export death_by_timeout functionality and use it in the ctnetlink code.

Based on earlier patch from Pablo Neira.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 312a0c16 28-Jul-2013 Patrick McHardy <kaber@trash.net>

netfilter: nf_conntrack: constify sk_buff argument to nf_ct_attach()

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# ca3d41a5 29-Apr-2013 Akinobu Mita <akinobu.mita@gmail.com>

net/netfilter: rename random32() to prandom_u32()

Use preferable function name which implies using a pseudo-random
number generator.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ec464e5d 17-Apr-2013 Patrick McHardy <kaber@trash.net>

netfilter: rename netlink related "pid" variables to "portid"

Get rid of the confusing mix of pid and portid and use portid consistently
for all netlink related socket identities.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f229f6ce 06-Apr-2013 Patrick McHardy <kaber@trash.net>

netfilter: add my copyright statements

Add copyright statements to all netfilter files which have had significant
changes done by myself in the past.

Some notes:

- nf_conntrack_ecache.c was incorrectly attributed to Rusty and Netfilter
Core Team when it got split out of nf_conntrack_core.c. The copyrights
even state a date which lies six years before it was written. It was
written in 2005 by Harald and myself.

- net/ipv{4,6}/netfilter.c, net/netfitler/nf_queue.c were missing copyright
statements. I've added the copyright statement from net/netfilter/core.c,
where this code originated

- for nf_conntrack_proto_tcp.c I've also added Jozsef, since I didn't want
it to give the wrong impression

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# dece40e8 13-Mar-2013 Vladimir Davydov <vdavydov.dev@gmail.com>

netfilter: nf_conntrack: speed up module removal path if netns in use

The patch introduces nf_conntrack_cleanup_net_list(), which cleanups
nf_conntrack for a list of netns and calls synchronize_net() only once
for them all. This should reduce netns destruction time.

I've measured cleanup time for 1k dummy net ns. Here are the results:

<without the patch>
# modprobe nf_conntrack
# time modprobe -r nf_conntrack

real 0m10.337s
user 0m0.000s
sys 0m0.376s

<with the patch>
# modprobe nf_conntrack
# time modprobe -r nf_conntrack

real 0m5.661s
user 0m0.000s
sys 0m0.216s

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 49376368 16-Mar-2013 stephen hemminger <stephen@networkplumber.org>

netfilter: nf_conntrack: add include to fix sparse warning

Include header file to pickup prototype of nf_nat_seq_adjust_hook

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 04d87001 21-Jan-2013 Gao feng <gaofeng@cn.fujitsu.com>

netfilter: nf_ct_proto: move initialization out of pernet_operations

Move the global initial codes to the module_init/exit context.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 5f69b8f5 21-Jan-2013 Gao feng <gaofeng@cn.fujitsu.com>

netfilter: nf_ct_labels: move initialization out of pernet_operations

Move the global initial codes to the module_init/exit context.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 5e615b22 21-Jan-2013 Gao feng <gaofeng@cn.fujitsu.com>

netfilter: nf_ct_helper: move initialization out of pernet_operations

Move the global initial codes to the module_init/exit context.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 8684094c 21-Jan-2013 Gao feng <gaofeng@cn.fujitsu.com>

netfilter: nf_ct_timeout: move initialization out of pernet_operations

Move the global initial codes to the module_init/exit context.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 3fe0f943 21-Jan-2013 Gao feng <gaofeng@cn.fujitsu.com>

netfilter: nf_ct_ecache: move initialization out of pernet_operations

Move the global initial codes to the module_init/exit context.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 73f4001a 21-Jan-2013 Gao feng <gaofeng@cn.fujitsu.com>

netfilter: nf_ct_tstamp: move initialization out of pernet_operations

Move the global initial codes to the module_init/exit context.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b7ff3a1f 21-Jan-2013 Gao feng <gaofeng@cn.fujitsu.com>

netfilter: nf_ct_acct: move initialization out of pernet_operations

Move the global initial codes to the module_init/exit context.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 83b4dbe1 21-Jan-2013 Gao feng <gaofeng@cn.fujitsu.com>

netfilter: nf_ct_expect: move initialization out of pernet_operations

Move the global initial codes to the module_init/exit context.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# f94161c1 21-Jan-2013 Gao feng <gaofeng@cn.fujitsu.com>

netfilter: nf_conntrack: move initialization out of pernet operations

nf_conntrack initialization and cleanup codes happens in pernet
operations function. This task should be done in module_init/exit.
We can't use init_net to identify if it's the right time to initialize
or cleanup since we cannot make assumption on the order netns are
created/destroyed.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c539f017 10-Jan-2013 Florian Westphal <fw@strlen.de>

netfilter: add connlabel conntrack extension

similar to connmarks, except labels are bit-based; i.e.
all labels may be attached to a flow at the same time.

Up to 128 labels are supported. Supporting more labels
is possible, but requires increasing the ct offset delta
from u8 to u16 type due to increased extension sizes.

Mapping of bit-identifier to label name is done in userspace.

The extension is enabled at run-time once "-m connlabel" netfilter
rules are added.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 1e47ee83 10-Jan-2013 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conntrack: fix BUG_ON while removing nf_conntrack with netns

canqun zhang reported that we're hitting BUG_ON in the
nf_conntrack_destroy path when calling kfree_skb while
rmmod'ing the nf_conntrack module.

Currently, the nf_ct_destroy hook is being set to NULL in the
destroy path of conntrack.init_net. However, this is a problem
since init_net may be destroyed before any other existing netns
(we cannot assume any specific ordering while releasing existing
netns according to what I read in recent emails).

Thanks to Gao feng for initial patch to address this issue.

Reported-by: canqun zhang <canqunzhang@gmail.com>
Acked-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 252b3e8c 10-Dec-2012 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: xt_CT: fix crash while destroy ct templates

In (d871bef netfilter: ctnetlink: dump entries from the dying and
unconfirmed lists), we assume that all conntrack objects are
inserted in any of the existing lists. However, template conntrack
objects were not. This results in hitting BUG_ON in the
destroy_conntrack path while removing a rule that uses the CT target.

This patch fixes the situation by adding the template lists, which
is where template conntrack objects reside now.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# a71258d7 10-Dec-2012 Abhijit Pawar <abhi.c.pawar@gmail.com>

net: remove obsolete simple_strto<foo>

This patch removes the redundant occurences of simple_strto<foo>

Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4b5511eb 09-Dec-2012 Abhijit Pawar <abhi.c.pawar@gmail.com>

net: remove obsolete simple_strto<foo>

This patch replace the obsolete simple_strto<foo> with kstrto<foo>

Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 04dac011 27-Nov-2012 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conntrack: improve nf_conn object traceability

This patch modifies the conntrack subsystem so that all existing
allocated conntrack objects can be found in any of the following
places:

* the hash table, this is the typical place for alive conntrack objects.
* the unconfirmed list, this is the place for newly created conntrack objects
that are still traversing the stack.
* the dying list, this is where you can find conntrack objects that are dying
or that should die anytime soon (eg. once the destroy event is delivered to
the conntrackd daemon).

Thus, we make sure that we follow the track for all existing conntrack
objects. This patch, together with some extension of the ctnetlink interface
to dump the content of the dying and unconfirmed lists, will help in case
to debug suspected nf_conn object leaks.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# b0cdb1d9 19-Sep-2012 Patrick McHardy <kaber@trash.net>

netfilter: nf_nat: fix oops when unloading protocol modules

When unloading a protocol module nf_ct_iterate_cleanup() is used to
remove all conntracks using the protocol from the bysource hash and
clean their NAT sections. Since the conntrack isn't actually killed,
the NAT callback is invoked twice, once for each direction, which
causes an oops when trying to delete it from the bysource hash for
the second time.

The same oops can also happen when removing both an L3 and L4 protocol
since the cleanup function doesn't check whether the conntrack has
already been cleaned up.

Pid: 4052, comm: modprobe Not tainted 3.6.0-rc3-test-nat-unload-fix+ #32 Red Hat KVM
RIP: 0010:[<ffffffffa002c303>] [<ffffffffa002c303>] nf_nat_proto_clean+0x73/0xd0 [nf_nat]
RSP: 0018:ffff88007808fe18 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8800728550c0 RCX: ffff8800756288b0
RDX: dead000000200200 RSI: ffff88007808fe88 RDI: ffffffffa002f208
RBP: ffff88007808fe28 R08: ffff88007808e000 R09: 0000000000000000
R10: dead000000200200 R11: dead000000100100 R12: ffffffff81c6dc00
R13: ffff8800787582b8 R14: ffff880078758278 R15: ffff88007808fe88
FS: 00007f515985d700(0000) GS:ffff88007cd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f515986a000 CR3: 000000007867a000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe (pid: 4052, threadinfo ffff88007808e000, task ffff8800756288b0)
Stack:
ffff88007808fe68 ffffffffa002c290 ffff88007808fe78 ffffffff815614e3
ffffffff00000000 00000aeb00000246 ffff88007808fe68 ffffffff81c6dc00
ffff88007808fe88 ffffffffa00358a0 0000000000000000 000000000040f5b0
Call Trace:
[<ffffffffa002c290>] ? nf_nat_net_exit+0x50/0x50 [nf_nat]
[<ffffffff815614e3>] nf_ct_iterate_cleanup+0xc3/0x170
[<ffffffffa002c55a>] nf_nat_l3proto_unregister+0x8a/0x100 [nf_nat]
[<ffffffff812a0303>] ? compat_prepare_timeout+0x13/0xb0
[<ffffffffa0035848>] nf_nat_l3proto_ipv4_exit+0x10/0x23 [nf_nat_ipv4]
...

To fix this,

- check whether the conntrack has already been cleaned up in
nf_nat_proto_clean

- change nf_ct_iterate_cleanup() to only invoke the callback function
once for each conntrack (IP_CT_DIR_ORIGINAL).

The second change doesn't affect other callers since when conntracks are
actually killed, both directions are removed from the hash immediately
and the callback is already only invoked once. If it is not killed, the
second callback invocation will always return the same decision not to
kill it.

Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 84b5ee93 27-Aug-2012 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conntrack: add nf_ct_timeout_lookup

This patch adds the new nf_ct_timeout_lookup function to encapsulate
the timeout policy attachment that is called in the nf_conntrack_in
path.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 5b423f6a 29-Aug-2012 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conntrack: fix racy timer handling with reliable events

Existing code assumes that del_timer returns true for alive conntrack
entries. However, this is not true if reliable events are enabled.
In that case, del_timer may return true for entries that were
just inserted in the dying list. Note that packets / ctnetlink may
hold references to conntrack entries that were just inserted to such
list.

This patch fixes the issue by adding an independent timer for
event delivery. This increases the size of the ecache extension.
Still we can revisit this later and use variable size extensions
to allocate this area on demand.

Tested-by: Oliver Smith <olipro@8.c.9.b.0.7.4.0.1.0.0.2.ip6.arpa>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c7232c99 26-Aug-2012 Patrick McHardy <kaber@trash.net>

netfilter: add protocol independent NAT core

Convert the IPv4 NAT implementation to a protocol independent core and
address family specific modules.

Signed-off-by: Patrick McHardy <kaber@trash.net>


# 1afc5679 06-Jun-2012 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_ct_helper: implement variable length helper private data

This patch uses the new variable length conntrack extensions.

Instead of using union nf_conntrack_help that contain all the
helper private data information, we allocate variable length
area to store the private helper data.

This patch includes the modification of all existing helpers.
It also includes a couple of include header to avoid compilation
warnings.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 15f585bd 28-May-2012 Gao feng <gaofeng@cn.fujitsu.com>

netfilter: nf_ct_generic: add namespace support

This patch adds namespace support for the generic layer 4 protocol
tracker.

Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# e3192690 03-Jun-2012 Joe Perches <joe@perches.com>

net: Remove casts to same type

Adding casts of objects to the same type is unnecessary
and confusing for a human reader.

For example, this cast:

int y;
int *p = (int *)&y;

I used the coccinelle script below to find and remove these
unnecessary casts. I manually removed the conversions this
script produces of casts with __force and __user.

@@
type T;
T *p;
@@

- (T *)p
+ p

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e87cc472 13-May-2012 Joe Perches <joe@perches.com>

net: Convert net_ratelimit uses to net_<level>_ratelimited

Standardize the net core ratelimited logging functions.

Coalesce formats, align arguments.
Change a printk then vprintk sequence to use printf extension %pV.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a9006892 18-Apr-2012 Eric Leblond <eric@regit.org>

netfilter: nf_ct_helper: allow to disable automatic helper assignment

This patch allows you to disable automatic conntrack helper
lookup based on TCP/UDP ports, eg.

echo 0 > /proc/sys/net/netfilter/nf_conntrack_helper

[ Note: flows that already got a helper will keep using it even
if automatic helper assignment has been disabled ]

Once this behaviour has been disabled, you have to explicitly
use the iptables CT target to attach helper to flows.

There are good reasons to stop supporting automatic helper
assignment, for further information, please read:

http://www.netfilter.org/news.html#2012-04-03

This patch also adds one message to inform that automatic helper
assignment is deprecated and it will be removed soon (this is
spotted only once, with the first flow that gets a helper attached
to make it as less annoying as possible).

Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 6ba90067 07-Apr-2012 Gao feng <gaofeng@cn.fujitsu.com>

netfilter: nf_conntrack: fix incorrect logic in nf_conntrack_init_net

in function nf_conntrack_init_net,when nf_conntrack_timeout_init falied,
we should call nf_conntrack_ecache_fini to do rollback.
but the current code calls nf_conntrack_timeout_fini.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# d96fc659 03-Apr-2012 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conntrack: fix count leak in error path of __nf_conntrack_alloc

We have to decrement the conntrack counter if we fail to access the
zone extension.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bae65be8 01-Apr-2012 David S. Miller <davem@davemloft.net>

nf_conntrack_core: Stop using NLA_PUT*().

These macros contain a hidden goto, and are thus extremely error
prone and make code hard to audit.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 60b5f8f7 22-Mar-2012 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conntrack: permanently attach timeout policy to conntrack

We need to permanently attach the timeout policy to the conntrack,
otherwise we may apply the custom timeout policy inconsistently.

Without this patch, the following example:

nfct timeout add test inet icmp timeout 100
iptables -I PREROUTING -t raw -p icmp -s 1.1.1.1 -j CT --timeout test

Will only apply the custom timeout policy to outgoing packets from
1.1.1.1, but not to reply packets from 2.2.2.2 going to 1.1.1.1.

To fix this issue, this patch modifies the current logic to attach the
timeout policy when the first packet is seen (which is when the
conntrack entry is created). Then, we keep using the attached timeout
policy until the conntrack entry is destroyed.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 24de58f4 28-Feb-2012 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: xt_CT: allow to attach timeout policy + glue code

This patch allows you to attach the timeout policy via the
CT target, it adds a new revision of the target to ensure
backward compatibility. Moreover, it also contains the glue
code to stick the timeout object defined via nfnetlink_cttimeout
to the given flow.

Example usage (it requires installing the nfct tool and
libnetfilter_cttimeout):

1) create the timeout policy:

nfct timeout add tcp-policy0 inet tcp \
established 1000 close 10 time_wait 10 last_ack 10

2) attach the timeout policy to the packet:

iptables -I PREROUTING -t raw -p tcp -j CT --timeout tcp-policy0

You have to install the following user-space software:

a) libnetfilter_cttimeout:
git://git.netfilter.org/libnetfilter_cttimeout

b) nfct:
git://git.netfilter.org/nfct

You also have to get iptables with -j CT --timeout support.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# dd705072 28-Feb-2012 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_ct_ext: add timeout extension

This patch adds the timeout extension, which allows you to attach
specific timeout policies to flows.

This extension is only used by the template conntrack.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 2c8503f5 28-Feb-2012 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conntrack: pass timeout array to l4->new and l4->packet

This patch defines a new interface for l4 protocol trackers:

unsigned int *(*get_timeouts)(struct net *net);

that is used to return the array of unsigned int that contains
the timeouts that will be applied for this flow. This is passed
to the l4proto->new(...) and l4proto->packet(...) functions to
specify the timeout policy.

This interface allows per-net global timeout configuration
(although only DCCP supports this by now) and it will allow
custom custom timeout configuration by means of follow-up
patches.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 74138511 05-Mar-2012 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conntrack: fix early_drop with reliable event delivery

If reliable event delivery is enabled and ctnetlink fails to deliver
the destroy event in early_drop, the conntrack subsystem cannot
drop any the candidate flow that was planned to be evicted.

Reported-by: Kerin Millar <kerframil@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7d367e06 24-Feb-2012 Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

netfilter: ctnetlink: fix soft lockup when netlink adds new entries (v2)

Marcell Zambo and Janos Farago noticed and reported that when
new conntrack entries are added via netlink and the conntrack table
gets full, soft lockup happens. This is because the nf_conntrack_lock
is held while nf_conntrack_alloc is called, which is in turn wants
to lock nf_conntrack_lock while evicting entries from the full table.

The patch fixes the soft lockup with limiting the holding of the
nf_conntrack_lock to the minimum, where it's absolutely required.
It required to extend (and thus change) nf_conntrack_hash_insert
so that it makes sure conntrack and ctnetlink do not add the same entry
twice to the conntrack table.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# cf778b00 11-Jan-2012 Eric Dumazet <eric.dumazet@gmail.com>

net: reintroduce missing rcu_assign_pointer() calls

commit a9b3cd7f32 (rcu: convert uses of rcu_assign_pointer(x, NULL) to
RCU_INIT_POINTER) did a lot of incorrect changes, since it did a
complete conversion of rcu_assign_pointer(x, y) to RCU_INIT_POINTER(x,
y).

We miss needed barriers, even on x86, when y is not NULL.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4d4e61c6 23-Dec-2011 Patrick McHardy <kaber@trash.net>

netfilter: nf_nat: use hash random for bysource hash

Use nf_conntrack_hash_rnd in NAT bysource hash to avoid hash chain attacks.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 966567b7 19-Dec-2011 Eric Dumazet <eric.dumazet@gmail.com>

net: two vzalloc() cleanups

We can use vzalloc() helper now instead of __vmalloc() trick

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b3e0bfa7 14-Dec-2011 Eric Dumazet <eric.dumazet@gmail.com>

netfilter: nf_conntrack: use atomic64 for accounting counters

We can use atomic64_t infrastructure to avoid taking a spinlock in fast
path, and remove inaccuracies while reading values in
ctnetlink_dump_counters() and connbytes_mt() on 32bit arches.

Suggested by Pablo.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# c0cd1156 11-Dec-2011 Igor Maravić <igorm@etf.rs>

net:netfilter: use IS_ENABLED

Use IS_ENABLED(CONFIG_FOO)
instead of defined(CONFIG_FOO) || defined (CONFIG_FOO_MODULE)

Signed-off-by: Igor Maravić <igorm@etf.rs>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0a9ee813 29-Aug-2011 Joe Perches <joe@perches.com>

netfilter: Remove unnecessary OOM logging messages

Site specific OOM messages are duplications of a generic MM
out of memory message and aren't really useful, so just
delete them.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# a9b3cd7f 01-Aug-2011 Stephen Hemminger <shemminger@vyatta.com>

rcu: convert uses of rcu_assign_pointer(x, NULL) to RCU_INIT_POINTER

When assigning a NULL value to an RCU protected pointer, no barrier
is needed. The rcu_assign_pointer, used to handle that but will soon
change to not handle the special case.

Convert all rcu_assign_pointer of NULL value.

//smpl
@@ expression P; @@

- rcu_assign_pointer(P, NULL)
+ RCU_INIT_POINTER(P, NULL)

// </smpl>

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 88ed01d1 02-Jun-2011 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conntrack: fix ct refcount leak in l4proto->error()

This patch fixes a refcount leak of ct objects that may occur if
l4proto->error() assigns one conntrack object to one skbuff. In
that case, we have to skip further processing in nf_conntrack_in().

With this patch, we can also fix wrong return values (-NF_ACCEPT)
for special cases in ICMP[v6] that should not bump the invalid/error
statistic counters.

Reported-by: Zoltan Menyhart <Zoltan.Menyhart@bull.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# fb048833 19-May-2011 Eric Dumazet <eric.dumazet@gmail.com>

netfilter: add more values to enum ip_conntrack_info

Following error is raised (and other similar ones) :

net/ipv4/netfilter/nf_nat_standalone.c: In function ‘nf_nat_fn’:
net/ipv4/netfilter/nf_nat_standalone.c:119:2: warning: case value ‘4’
not in enumerated type ‘enum ip_conntrack_info’

gcc barfs on adding two enum values and getting a not enumerated
result :

case IP_CT_RELATED+IP_CT_IS_REPLY:

Add missing enum values

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Miller <davem@davemloft.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 25985edc 30-Mar-2011 Lucas De Marchi <lucas.demarchi@profusion.mobi>

Fix common misspellings

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>


# fe8f661f 14-Mar-2011 Stephen Hemminger <shemminger@vyatta.com>

netfilter: nf_conntrack: fix sysctl memory leak

Message in log because sysctl table was not empty at netns exit
WARNING: at net/sysctl_net.c:84 sysctl_net_exit+0x2a/0x2c()

Instrumenting showed that the nf_conntrack_timestamp was the entry
that was being created but not cleared.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# c3174286 09-Feb-2011 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conntrack: set conntrack templates again if we return NF_REPEAT

The TCP tracking code has a special case that allows to return
NF_REPEAT if we receive a new SYN packet while in TIME_WAIT state.

In this situation, the TCP tracking code destroys the existing
conntrack to start a new clean session.

[DESTROY] tcp 6 src=192.168.0.2 dst=192.168.1.2 sport=38925 dport=8000 src=192.168.1.2 dst=192.168.1.100 sport=8000 dport=38925 [ASSURED]
[NEW] tcp 6 120 SYN_SENT src=192.168.0.2 dst=192.168.1.2 sport=38925 dport=8000 [UNREPLIED] src=192.168.1.2 dst=192.168.1.100 sport=8000 dport=38925

However, this is a problem for the iptables' CT target event filtering
which will not work in this case since the conntrack template will not
be there for the new session. To fix this, we reassign the conntrack
template to the packet if we return NF_REPEAT.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# a992ca2a 19-Jan-2011 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conntrack_tstamp: add flow-based timestamp extension

This patch adds flow-based timestamping for conntracks. This
conntrack extension is disabled by default. Basically, we use
two 64-bits variables to store the creation timestamp once the
conntrack has been confirmed and the other to store the deletion
time. This extension is disabled by default, to enable it, you
have to:

echo 1 > /proc/sys/net/netfilter/nf_conntrack_timestamp

This patch allows to save memory for user-space flow-based
loogers such as ulogd2. In short, ulogd2 does not need to
keep a hashtable with the conntrack in user-space to know
when they were created and destroyed, instead we use the
kernel timestamp. If we want to have a sane IPFIX implementation
in user-space, this nanosecs resolution timestamps are also
useful. Other custom user-space applications can benefit from
this via libnetfilter_conntrack.

This patch modifies the /proc output to display the delta time
in seconds since the flow start. You can also obtain the
flow-start date by means of the conntrack-tools.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 45eec341 18-Jan-2011 Changli Gao <xiaosuo@gmail.com>

netfilter: nf_conntrack: remove an atomic bit operation

As this ct won't be seen by the others, we don't need to set the
IPS_CONFIRMED_BIT in atomic way.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Cc: Tim Gardner <tim.gardner@canonical.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# d862a662 14-Jan-2011 Patrick McHardy <kaber@trash.net>

netfilter: nf_conntrack: use is_vmalloc_addr()

Use is_vmalloc_addr() in nf_ct_free_hashtable() and get rid of
the vmalloc flags to indicate that a hash table has been allocated
using vmalloc().

Signed-off-by: Patrick McHardy <kaber@trash.net>


# f682cefa 04-Jan-2011 Changli Gao <xiaosuo@gmail.com>

netfilter: fix the race when initializing nf_ct_expect_hash_rnd

Since nf_ct_expect_dst_hash() may be called without nf_conntrack_lock
locked, nf_ct_expect_hash_rnd should be initialized in the atomic way.

In this patch, we use nf_conntrack_hash_rnd instead of
nf_ct_expect_hash_rnd.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e5fc9e7a 12-Nov-2010 Changli Gao <xiaosuo@gmail.com>

netfilter: nf_conntrack: don't always initialize ct->proto

ct->proto is big(60 bytes) due to structure ip_ct_tcp, and we don't need
to initialize the whole for all the other protocols. This patch moves
proto to the end of structure nf_conn, and pushes the initialization down
to the individual protocols.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 6b1686a7 27-Oct-2010 Eric Dumazet <eric.dumazet@gmail.com>

netfilter: nf_conntrack: allow nf_ct_alloc_hashtable() to get highmem pages

commit ea781f197d6a8 (use SLAB_DESTROY_BY_RCU and get rid of call_rcu())
did a mistake in __vmalloc() call in nf_ct_alloc_hashtable().

I forgot to add __GFP_HIGHMEM, so pages were taken from LOWMEM only.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 99f07e91 21-Sep-2010 Changli Gao <xiaosuo@gmail.com>

netfilter: save the hash of the tuple in the original direction for latter use

Since we don't change the tuple in the original direction, we can save it
in ct->tuplehash[IP_CT_DIR_REPLY].hnode.pprev for __nf_conntrack_confirm()
use.

__hash_conntrack() is split into two steps: hash_conntrack_raw() is used
to get the raw hash, and __hash_bucket() is used to get the bucket id.

In SYN-flood case, early_drop() doesn't need to recompute the hash again.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# b2390969 16-Sep-2010 Changli Gao <xiaosuo@gmail.com>

netfilter: nf_conntrack: fix the hash random initializing race

nf_conntrack_alloc() isn't called with nf_conntrack_lock locked, so hash
random initializing code maybe executed more than once on different
CPUs.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 6661481d 02-Aug-2010 Changli Gao <xiaosuo@gmail.com>

netfilter: nf_conntrack_acct: use skb->len for accounting

use skb->len for accounting as xt_quota does. Since nf_conntrack works
at the network layer, skb_network_offset should always returns ZERO.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# b3c5163f 09-Jun-2010 Eric Dumazet <eric.dumazet@gmail.com>

netfilter: nf_conntrack: per_cpu untracking

NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
twice per packet, slowing down performance.

This patch converts it to a per_cpu variable.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 5bfddbd4 08-Jun-2010 Eric Dumazet <eric.dumazet@gmail.com>

netfilter: nf_conntrack: IPS_UNTRACKED bit

NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
twice per packet. This is bad for performance.
__read_mostly annotation is also a bad choice.

This patch introduces IPS_UNTRACKED bit so that we can use later a
per_cpu untrack structure more easily.

A new helper, nf_ct_untracked_get() returns a pointer to
nf_conntrack_untracked.

Another one, nf_ct_untracked_status_or() is used by nf_nat_init() to add
IPS_NAT_DONE_MASK bits to untracked status.

nf_ct_is_untracked() prototype is changed to work on a nf_conn pointer.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# c2d9ba9b 01-Jun-2010 Eric Dumazet <eric.dumazet@gmail.com>

net: CONFIG_NET_NS reduction

Use read_pnet() and write_pnet() to reduce number of ifdef CONFIG_NET_NS

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fc350777 20-May-2010 Joerg Marx <joerg.marx@secunet.com>

netfilter: nf_conntrack: fix a race in __nf_conntrack_confirm against nf_ct_get_next_corpse()

This race was triggered by a 'conntrack -F' command running in parallel
to the insertion of a hash for a new connection. Losing this race led to
a dead conntrack entry effectively blocking traffic for a particular
connection until timeout or flushing the conntrack hashes again.
Now the check for an already dying connection is done inside the lock.

Signed-off-by: Joerg Marx <joerg.marx@secunet.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 654d0fbd 13-May-2010 Stephen Hemminger <shemminger@vyatta.com>

netfilter: cleanup printk messages

Make sure all printk messages have a severity level.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# af740b2c 22-Apr-2010 Jesper Dangaard Brouer <hawk@comx.dk>

netfilter: nf_conntrack: extend with extra stat counter

I suspect an unfortunatly series of events occuring under a DDoS
attack, in function __nf_conntrack_find() nf_contrack_core.c.

Adding a stats counter to see if the search is restarted too often.

Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 5d0aa2cc 15-Feb-2010 Patrick McHardy <kaber@trash.net>

netfilter: nf_conntrack: add support for "conntrack zones"

Normally, each connection needs a unique identity. Conntrack zones allow
to specify a numerical zone using the CT target, connections in different
zones can use the same identity.

Example:

iptables -t raw -A PREROUTING -i veth0 -j CT --zone 1
iptables -t raw -A OUTPUT -o veth1 -j CT --zone 1

Signed-off-by: Patrick McHardy <kaber@trash.net>


# 8fea97ec 15-Feb-2010 Patrick McHardy <kaber@trash.net>

netfilter: nf_conntrack: pass template to l4proto ->error() handler

The error handlers might need the template to get the conntrack zone
introduced in the next patches to perform a conntrack lookup.

Signed-off-by: Patrick McHardy <kaber@trash.net>


# d696c7bd 08-Feb-2010 Patrick McHardy <kaber@trash.net>

netfilter: nf_conntrack: fix hash resizing with namespaces

As noticed by Jon Masters <jonathan@jonmasters.org>, the conntrack hash
size is global and not per namespace, but modifiable at runtime through
/sys/module/nf_conntrack/hashsize. Changing the hash size will only
resize the hash in the current namespace however, so other namespaces
will use an invalid hash size. This can cause crashes when enlarging
the hashsize, or false negative lookups when shrinking it.

Move the hash size into the per-namespace data and only use the global
hash size to initialize the per-namespace value when instanciating a
new namespace. Additionally restrict hash resizing to init_net for
now as other namespaces are not handled currently.

Cc: stable@kernel.org
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5b3501fa 08-Feb-2010 Eric Dumazet <eric.dumazet@gmail.com>

netfilter: nf_conntrack: per netns nf_conntrack_cachep

nf_conntrack_cachep is currently shared by all netns instances, but
because of SLAB_DESTROY_BY_RCU special semantics, this is wrong.

If we use a shared slab cache, one object can instantly flight between
one hash table (netns ONE) to another one (netns TWO), and concurrent
reader (doing a lookup in netns ONE, 'finding' an object of netns TWO)
can be fooled without notice, because no RCU grace period has to be
observed between object freeing and its reuse.

We dont have this problem with UDP/TCP slab caches because TCP/UDP
hashtables are global to the machine (and each object has a pointer to
its netns).

If we use per netns conntrack hash tables, we also *must* use per netns
conntrack slab caches, to guarantee an object can not escape from one
namespace to another one.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
[Patrick: added unique slab name allocation]
Cc: stable@kernel.org
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 9edd7ca0 08-Feb-2010 Patrick McHardy <kaber@trash.net>

netfilter: nf_conntrack: fix memory corruption with multiple namespaces

As discovered by Jon Masters <jonathan@jonmasters.org>, the "untracked"
conntrack, which is located in the data section, might be accidentally
freed when a new namespace is instantiated while the untracked conntrack
is attached to a skb because the reference count it re-initialized.

The best fix would be to use a seperate untracked conntrack per
namespace since it includes a namespace pointer. Unfortunately this is
not possible without larger changes since the namespace is not easily
available everywhere we need it. For now move the untracked conntrack
initialization to the init_net setup function to make sure the reference
count is not re-initialized and handle cleanup in the init_net cleanup
function to make sure namespaces can exit properly while the untracked
conntrack is in use in other namespaces.

Cc: stable@kernel.org
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9ab48ddc 08-Feb-2010 Patrick McHardy <kaber@trash.net>

netfilter: nf_conntrack: fix hash resizing with namespaces

As noticed by Jon Masters <jonathan@jonmasters.org>, the conntrack hash
size is global and not per namespace, but modifiable at runtime through
/sys/module/nf_conntrack/hashsize. Changing the hash size will only
resize the hash in the current namespace however, so other namespaces
will use an invalid hash size. This can cause crashes when enlarging
the hashsize, or false negative lookups when shrinking it.

Move the hash size into the per-namespace data and only use the global
hash size to initialize the per-namespace value when instanciating a
new namespace. Additionally restrict hash resizing to init_net for
now as other namespaces are not handled currently.

Cc: stable@kernel.org
Signed-off-by: Patrick McHardy <kaber@trash.net>


# ab59b19b 04-Feb-2010 Eric Dumazet <eric.dumazet@gmail.com>

netfilter: nf_conntrack: per netns nf_conntrack_cachep

nf_conntrack_cachep is currently shared by all netns instances, but
because of SLAB_DESTROY_BY_RCU special semantics, this is wrong.

If we use a shared slab cache, one object can instantly flight between
one hash table (netns ONE) to another one (netns TWO), and concurrent
reader (doing a lookup in netns ONE, 'finding' an object of netns TWO)
can be fooled without notice, because no RCU grace period has to be
observed between object freeing and its reuse.

We dont have this problem with UDP/TCP slab caches because TCP/UDP
hashtables are global to the machine (and each object has a pointer to
its netns).

If we use per netns conntrack hash tables, we also *must* use per netns
conntrack slab caches, to guarantee an object can not escape from one
namespace to another one.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
[Patrick: added unique slab name allocation]
Signed-off-by: Patrick McHardy <kaber@trash.net>


# b2a15a60 03-Feb-2010 Patrick McHardy <kaber@trash.net>

netfilter: nf_conntrack: support conntrack templates

Support initializing selected parameters of new conntrack entries from a
"conntrack template", which is a specially marked conntrack entry attached
to the skb.

Currently the helper and the event delivery masks can be initialized this
way.

Signed-off-by: Patrick McHardy <kaber@trash.net>


# 0cebe4b4 03-Feb-2010 Patrick McHardy <kaber@trash.net>

netfilter: ctnetlink: support selective event delivery

Add two masks for conntrack end expectation events to struct nf_conntrack_ecache
and use them to filter events. Their default value is "all events" when the
event sysctl is on and "no events" when it is off. A following patch will add
specific initializations. Expectation events depend on the ecache struct of
their master conntrack.

Signed-off-by: Patrick McHardy <kaber@trash.net>


# 858b3133 03-Feb-2010 Patrick McHardy <kaber@trash.net>

netfilter: nf_conntrack: split up IPCT_STATUS event

Split up the IPCT_STATUS event into an IPCT_REPLY event, which is generated
when the IPS_SEEN_REPLY bit is set, and an IPCT_ASSURED event, which is
generated when the IPS_ASSURED bit is set.

In combination with a following patch to support selective event delivery,
this can be used for "sparse" conntrack replication: start replicating the
conntrack entry after it reached the ASSURED state and that way it's SYN-flood
resistant.

Signed-off-by: Patrick McHardy <kaber@trash.net>


# 056ff3e3 02-Feb-2010 Patrick McHardy <kaber@trash.net>

netfilter: nf_conntrack: fix memory corruption with multiple namespaces

As discovered by Jon Masters <jonathan@jonmasters.org>, the "untracked"
conntrack, which is located in the data section, might be accidentally
freed when a new namespace is instantiated while the untracked conntrack
is attached to a skb because the reference count it re-initialized.

The best fix would be to use a seperate untracked conntrack per
namespace since it includes a namespace pointer. Unfortunately this is
not possible without larger changes since the namespace is not easily
available everywhere we need it. For now move the untracked conntrack
initialization to the init_net setup function to make sure the reference
count is not re-initialized and handle cleanup in the init_net cleanup
function to make sure namespaces can exit properly while the untracked
conntrack is in use in other namespaces.

Signed-off-by: Patrick McHardy <kaber@trash.net>


# f9dd09c7 06-Nov-2009 Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

netfilter: nf_nat: fix NAT issue in 2.6.30.4+

Vitezslav Samel discovered that since 2.6.30.4+ active FTP can not work
over NAT. The "cause" of the problem was a fix of unacknowledged data
detection with NAT (commit a3a9f79e361e864f0e9d75ebe2a0cb43d17c4272).
However, actually, that fix uncovered a long standing bug in TCP conntrack:
when NAT was enabled, we simply updated the max of the right edge of
the segments we have seen (td_end), by the offset NAT produced with
changing IP/port in the data. However, we did not update the other parameter
(td_maxend) which is affected by the NAT offset. Thus that could drift
away from the correct value and thus resulted breaking active FTP.

The patch below fixes the issue by *not* updating the conntrack parameters
from NAT, but instead taking into account the NAT offsets in conntrack in a
consistent way. (Updating from NAT would be more harder and expensive because
it'd need to re-calculate parameters we already calculated in conntrack.)

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5ae27aa2 05-Nov-2009 Changli Gao <xiaosuo@gmail.com>

netfilter: nf_conntrack: avoid additional compare.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# d43c36dc 07-Oct-2009 Alexey Dobriyan <adobriyan@gmail.com>

headers: remove sched.h from interrupt.h

After m68k's task_thread_info() doesn't refer to current,
it's possible to remove sched.h from interrupt.h and not break m68k!
Many thanks to Heiko Carstens for allowing this.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>


# 4481374c 21-Sep-2009 Jan Beulich <JBeulich@novell.com>

mm: replace various uses of num_physpages by totalram_pages

Sizing of memory allocations shouldn't depend on the number of physical
pages found in a system, as that generally includes (perhaps a huge amount
of) non-RAM pages. The amount of what actually is usable as storage
should instead be used as a basis here.

Some of the calculations (i.e. those not intending to use high memory)
should likely even use (totalram_pages - totalhigh_pages).

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Dave Airlie <airlied@linux.ie>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ee254fa4 31-Aug-2009 Alexey Dobriyan <adobriyan@gmail.com>

netfilter: nf_conntrack: netns fix re reliable conntrack event delivery

Conntracks in netns other than init_net dying list were never killed.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 39938324 25-Aug-2009 Patrick McHardy <kaber@trash.net>

netfilter: nfnetlink: constify message attributes and headers

Signed-off-by: Patrick McHardy <kaber@trash.net>


# 941297f4 16-Jul-2009 Eric Dumazet <eric.dumazet@gmail.com>

netfilter: nf_conntrack: nf_conntrack_alloc() fixes

When a slab cache uses SLAB_DESTROY_BY_RCU, we must be careful when allocating
objects, since slab allocator could give a freed object still used by lockless
readers.

In particular, nf_conntrack RCU lookups rely on ct->tuplehash[xxx].hnnode.next
being always valid (ie containing a valid 'nulls' value, or a valid pointer to next
object in hash chain.)

kmem_cache_zalloc() setups object with NULL values, but a NULL value is not valid
for ct->tuplehash[xxx].hnnode.next.

Fix is to call kmem_cache_alloc() and do the zeroing ourself.

As spotted by Patrick, we also need to make sure lookup keys are committed to
memory before setting refcount to 1, or a lockless reader could get a reference
on the old version of the object. Its key re-check could then pass the barrier.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 8d8890b7 22-Jun-2009 Patrick McHardy <kaber@trash.net>

netfilter: nf_conntrack: fix conntrack lookup race

The RCU protected conntrack hash lookup only checks whether the entry
has a refcount of zero to decide whether it is stale. This is not
sufficient, entries are explicitly removed while there is at least
one reference left, possibly more. Explicitly check whether the entry
has been marked as dying to fix this.

Signed-off-by: Patrick McHardy <kaber@trash.net>


# 5c8ec910 22-Jun-2009 Patrick McHardy <kaber@trash.net>

netfilter: nf_conntrack: fix confirmation race condition

New connection tracking entries are inserted into the hash before they
are fully set up, namely the CONFIRMED bit is not set and the timer not
started yet. This can theoretically lead to a race with timer, which
would set the timeout value to a relative value, most likely already in
the past.

Perform hash insertion as the final step to fix this.

Signed-off-by: Patrick McHardy <kaber@trash.net>


# 8cc20198 22-Jun-2009 Eric Dumazet <eric.dumazet@gmail.com>

netfilter: nf_conntrack: death_by_timeout() fix

death_by_timeout() might delete a conntrack from hash list
and insert it in dying list.

nf_ct_delete_from_lists(ct);
nf_ct_insert_dying_list(ct);

I believe a (lockless) reader could *catch* ct while doing a lookup
and miss the end of its chain.
(nulls lookup algo must check the null value at the end of lookup and
should restart if the null value is not the expected one.
cf Documentation/RCU/rculist_nulls.txt for details)

We need to change nf_conntrack_init_net() and use a different "null" value,
guaranteed not being used in regular lists. Choose very large values, since
hash table uses [0..size-1] null values.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# dd7669a9 12-Jun-2009 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: conntrack: optional reliable conntrack event delivery

This patch improves ctnetlink event reliability if one broadcast
listener has set the NETLINK_BROADCAST_ERROR socket option.

The logic is the following: if an event delivery fails, we keep
the undelivered events in the missed event cache. Once the next
packet arrives, we add the new events (if any) to the missed
events in the cache and we try a new delivery, and so on. Thus,
if ctnetlink fails to deliver an event, we try to deliver them
once we see a new packet. Therefore, we may lose state
transitions but the userspace process gets in sync at some point.

At worst case, if no events were delivered to userspace, we make
sure that destroy events are successfully delivered. Basically,
if ctnetlink fails to deliver the destroy event, we remove the
conntrack entry from the hashes and we insert them in the dying
list, which contains inactive entries. Then, the conntrack timer
is added with an extra grace timeout of random32() % 15 seconds
to trigger the event again (this grace timeout is tunable via
/proc). The use of a limited random timeout value allows
distributing the "destroy" resends, thus, avoiding accumulating
lots "destroy" events at the same time. Event delivery may
re-order but we can identify them by means of the tuple plus
the conntrack ID.

The maximum number of conntrack entries (active or inactive) is
still handled by nf_conntrack_max. Thus, we may start dropping
packets at some point if we accumulate a lot of inactive conntrack
entries that did not successfully report the destroy event to
userspace.

During my stress tests consisting of setting a very small buffer
of 2048 bytes for conntrackd and the NETLINK_BROADCAST_ERROR socket
flag, and generating lots of very small connections, I noticed
very few destroy entries on the fly waiting to be resend.

A simple way to test this patch consist of creating a lot of
entries, set a very small Netlink buffer in conntrackd (+ a patch
which is not in the git tree to set the BROADCAST_ERROR flag)
and invoke `conntrack -F'.

For expectations, no changes are introduced in this patch.
Currently, event delivery is only done for new expectations (no
events from expectation expiration, removal and confirmation).
In that case, they need a per-expectation event cache to implement
the same idea that is exposed in this patch.

This patch can be useful to provide reliable flow-accouting. We
still have to add a new conntrack extension to store the creation
and destroy time.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 9858a3ae 12-Jun-2009 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: conntrack: move helper destruction to nf_ct_helper_destroy()

This patch moves the helper destruction to a function that lives
in nf_conntrack_helper.c. This new function is used in the patch
to add ctnetlink reliable event delivery.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# a0891aa6 12-Jun-2009 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: conntrack: move event caching to conntrack extension infrastructure

This patch reworks the per-cpu event caching to use the conntrack
extension infrastructure.

The main drawback is that we consume more memory per conntrack
if event delivery is enabled. This patch is required by the
reliable event delivery that follows to this patch.

BTW, this patch allows you to enable/disable event delivery via
/proc/sys/net/netfilter/nf_conntrack_events in runtime, although
you can still disable event caching as compilation option.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 65cb9fda 12-Jun-2009 Patrick McHardy <kaber@trash.net>

netfilter: nf_conntrack: use mod_timer_pending() for conntrack refresh

Use mod_timer_pending() instead of atomic sequence of del_timer()/
add_timer(). mod_timer_pending() does not rearm an inactive timer,
so we don't need the conntrack lock anymore to make sure we don't
accidentally rearm a timer of a conntrack which is in the process
of being destroyed.

With this change, we don't need to take the global lock anymore at all,
counter updates can be performed under the per-conntrack lock.

Signed-off-by: Patrick McHardy <kaber@trash.net>


# 440f0d58 10-Jun-2009 Patrick McHardy <kaber@trash.net>

netfilter: nf_conntrack: use per-conntrack locks for protocol data

Introduce per-conntrack locks and use them instead of the global protocol
locks to avoid contention. Especially tcp_lock shows up very high in
profiles on larger machines.

This will also allow to simplify the upcoming reliable event delivery patches.

Signed-off-by: Patrick McHardy <kaber@trash.net>


# 17e6e4ea 02-Jun-2009 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: conntrack: simplify event caching system

This patch simplifies the conntrack event caching system by removing
several events:

* IPCT_[*]_VOLATILE, IPCT_HELPINFO and IPCT_NATINFO has been deleted
since the have no clients.
* IPCT_COUNTER_FILLING which is a leftover of the 32-bits counter
days.
* IPCT_REFRESH which is not of any use since we always include the
timeout in the messages.

After this patch, the existing events are:

* IPCT_NEW, IPCT_RELATED and IPCT_DESTROY, that are used to identify
addition and deletion of entries.
* IPCT_STATUS, that notes that the status bits have changes,
eg. IPS_SEEN_REPLY and IPS_ASSURED.
* IPCT_PROTOINFO, that reports that internal protocol information has
changed, eg. the TCP, DCCP and SCTP protocol state.
* IPCT_HELPER, that a helper has been assigned or unassigned to this
entry.
* IPCT_MARK and IPCT_SECMARK, that reports that the mark has changed, this
covers the case when a mark is set to zero.
* IPCT_NATSEQADJ, to report that there's updates in the NAT sequence
adjustment.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 274d383b 02-Jun-2009 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: conntrack: don't report events on module removal

During the module removal there are no possible event listeners
since ctnetlink must be removed before to allow removing
nf_conntrack. This patch removes the event reporting for the
module removal case which is not of any use in the existing code.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>


# 5c0de29d 25-Mar-2009 Holger Eitzenberger <holger@eitzenberger.org>

netfilter: nf_conntrack: add generic function to get len of generic policy

Usefull for all protocols which do not add additional data, such
as GRE or UDPlite.

Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# ea781f19 25-Mar-2009 Eric Dumazet <dada1@cosmosbay.com>

netfilter: nf_conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu()

Use "hlist_nulls" infrastructure we added in 2.6.29 for RCUification of UDP & TCP.

This permits an easy conversion from call_rcu() based hash lists to a
SLAB_DESTROY_BY_RCU one.

Avoiding call_rcu() delay at nf_conn freeing time has numerous gains.

First, it doesnt fill RCU queues (up to 10000 elements per cpu).
This reduces OOM possibility, if queued elements are not taken into account
This reduces latency problems when RCU queue size hits hilimit and triggers
emergency mode.

- It allows fast reuse of just freed elements, permitting better use of
CPU cache.

- We delete rcu_head from "struct nf_conn", shrinking size of this structure
by 8 or 16 bytes.

This patch only takes care of "struct nf_conn".
call_rcu() is still used for less critical conntrack parts, that may
be converted later if necessary.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 78f36486 25-Mar-2009 Eric Dumazet <dada1@cosmosbay.com>

netfilter: nf_conntrack: use hlist_add_head_rcu() in nf_conntrack_set_hashsize()

Using hlist_add_head() in nf_conntrack_set_hashsize() is quite dangerous.
Without any barrier, one CPU could see a loop while doing its lookup.
Its true new table cannot be seen by another cpu, but previous table is still
readable.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 1d45209d 24-Mar-2009 Eric Dumazet <dada1@cosmosbay.com>

netfilter: nf_conntrack: Reduce conntrack count in nf_conntrack_free()

We use RCU to defer freeing of conntrack structures. In DOS situation, RCU might
accumulate about 10.000 elements per CPU in its internal queues. To get accurate
conntrack counts (at the expense of slightly more RAM used), we might consider
conntrack counter not taking into account "about to be freed elements, waiting
in RCU queues". We thus decrement it in nf_conntrack_free(), not in the RCU
callback.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Tested-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# ec8d5409 16-Mar-2009 Christoph Paasch <christoph.paasch@gmail.com>

netfilter: conntrack: fix dropping packet after l4proto->packet()

We currently use the negative value in the conntrack code to encode
the packet verdict in the error. As NF_DROP is equal to 0, inverting
NF_DROP makes no sense and, as a result, no packets are ever dropped.

Signed-off-by: Christoph Paasch <christoph.paasch@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 7d1e0459 24-Feb-2009 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conntrack: account packets drop by tcp_packet()

Since tcp_packet() may return -NF_DROP in two situations, the
packet-drop stats must be increased.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# af07d241 20-Feb-2009 Hagen Paul Pfeifer <hagen@jauu.net>

netfilter: fix hardcoded size assumptions

get_random_bytes() is sometimes called with a hard coded size assumption
of an integer. This could not be true for next centuries. This patch
replace it with a compile time statement.

Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# e478075c 20-Feb-2009 Hagen Paul Pfeifer <hagen@jauu.net>

netfilter: nf_conntrack: table max size should hold at least table size

Table size is defined as unsigned, wheres the table maximum size is
defined as a signed integer. The calculation of max is 8 or 4,
multiplied the table size. Therefore the max value is aligned to
unsigned.

Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# cd7fcbf1 11-Jan-2009 Julia Lawall <julia@diku.dk>

netfilter 07/09: simplify nf_conntrack_alloc() error handling

nf_conntrack_alloc cannot return NULL, so there is no need to check for
NULL before using the value. I have also removed the initialization of ct
to NULL in nf_conntrack_alloc, since the value is never used, and since
perhaps it might lead one to think that return ct at the end might return
NULL.

The semantic patch that finds this problem is as follows:
(http://www.emn.fr/x-info/coccinelle/)

// <smpl>
@match exists@
expression x, E;
position p1,p2;
statement S1, S2;
@@

x@p1 = nf_conntrack_alloc(...)
... when != x = E
(
if (x@p2 == NULL || ...) S1 else S2
|
if (x@p2 == NULL && ...) S1 else S2
)

@other_match exists@
expression match.x, E1, E2;
position p1!=match.p1,match.p2;
@@

x@p1 = E1
... when != x = E2
x@p2

@ script:python depends on !other_match@
p1 << match.p1;
p2 << match.p2;
@@

print "%s: call to nf_conntrack_alloc %s bad test %s" % (p1[0].file,p1[0].line,p2[0].line)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b54ad409 24-Nov-2008 Patrick McHardy <kaber@trash.net>

netfilter: ctnetlink: fix conntrack creation race

Conntrack creation through ctnetlink has two races:

- the timer may expire and free the conntrack concurrently, causing an
invalid memory access when attempting to put it in the hash tables

- an identical conntrack entry may be created in the packet processing
path in the time between the lookup and hash insertion

Hold the conntrack lock between the lookup and insertion to avoid this.

Reported-by: Zoltan Borbely <bozo@andrews.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e17b666a 17-Nov-2008 Patrick McHardy <kaber@trash.net>

netfilter: nf_conntrack: fix warning and prototype mismatch

net/netfilter/nf_conntrack_core.c:46:1: warning: symbol 'nfnetlink_parse_nat_setup_hook' was not declared. Should it be static?

Including the proper header also revealed an incorrect prototype.

Signed-off-by: Patrick McHardy <kaber@trash.net>


# 19abb7b0 18-Nov-2008 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: ctnetlink: deliver events for conntracks changed from userspace

As for now, the creation and update of conntracks via ctnetlink do not
propagate an event to userspace. This can result in inconsistent situations
if several userspace processes modify the connection tracking table by means
of ctnetlink at the same time. Specifically, using the conntrack command
line tool and conntrackd at the same time can trigger unconsistencies.

This patch also modifies the event cache infrastructure to pass the
process PID and the ECHO flag to nfnetlink_send() to report back
to userspace if the process that triggered the change needs so.
Based on a suggestion from Patrick McHardy.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 226c0c0e 18-Nov-2008 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: ctnetlink: helper modules load-on-demand support

This patch adds module loading for helpers via ctnetlink.

* Creation path: We support explicit and implicit helper assignation. For
the explicit case, we try to load the module. If the module is correctly
loaded and the helper is present, we return EAGAIN to re-start the
creation. Otherwise, we return EOPNOTSUPP.
* Update path: release the spin lock, load the module and check. If it is
present, then return EAGAIN to re-start the update.

This patch provides a refactorized function to lookup-and-set the
connection tracking helper. The function removes the exported symbol
__nf_ct_helper_find as it has not clients anymore.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# e6a7d3c0 14-Oct-2008 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: ctnetlink: remove bogus module dependency between ctnetlink and nf_nat

This patch removes the module dependency between ctnetlink and
nf_nat by means of an indirect call that is initialized when
nf_nat is loaded. Now, nf_conntrack_netlink only requires
nf_conntrack and nfnetlink.

This patch puts nfnetlink_parse_nat_setup_hook into the
nf_conntrack_core to avoid dependencies between ctnetlink,
nf_conntrack_ipv4 and nf_conntrack_ipv6.

This patch also introduces the function ctnetlink_change_nat
that is only invoked from the creation path. Actually, the
nat handling cannot be invoked from the update path since
this is not allowed. By introducing this function, we remove
the useless nat handling in the update path and we avoid
deadlock-prone code.

This patch also adds the required EAGAIN logic for nfnetlink.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 08f6547d 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com>

netfilter: netns nf_conntrack: final netns tweaks

Add init_net checks to not remove kmem_caches twice and so on.

Refactor functions to split code which should be executed only for
init_net into one place.

ip_ct_attach and ip_ct_destroy assignments remain separate, because
they're separate stages in setup and teardown.

NOTE: NOTRACK code is in for-every-net part. It will be made per-netns
after we decidce how to do it correctly.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# d716a4df 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com>

netfilter: netns nf_conntrack: per-netns conntrack accounting

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# c2a2c7e0 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com>

netfilter: netns nf_conntrack: per-netns net.netfilter.nf_conntrack_log_invalid sysctl

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 0d55af87 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com>

netfilter: netns nf_conntrack: per-netns statistics

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 6058fa6b 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com>

netfilter: netns nf_conntrack: per-netns event cache

Heh, last minute proof-reading of this patch made me think,
that this is actually unneeded, simply because "ct" pointers will be
different for different conntracks in different netns, just like they
are different in one netns.

Not so sure anymore.

[Patrick: pointers will be different, flushing can only be done while
inactive though and thus it needs to be per netns]

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# a71996fc 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com>

netfilter: netns nf_conntrack: pass conntrack to nf_conntrack_event_cache() not skb

This is cleaner, we already know conntrack to which event is relevant.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 74c51a14 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com>

netfilter: netns nf_conntrack: pass netns pointer to L4 protocol's ->error hook

Again, it's deducible from skb, but we're going to use it for
nf_conntrack_checksum and statistics, so just pass it from upper layer.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# a702a65f 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com>

netfilter: netns nf_conntrack: pass netns pointer to nf_conntrack_in()

It's deducible from skb->dev or skb->dst->dev, but we know netns at
the moment of call, so pass it down and use for finding and creating
conntracks.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 63c9a262 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com>

netfilter: netns nf_conntrack: per-netns unconfirmed list

What is confirmed connection in one netns can very well be unconfirmed
in another one.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 9b03f38d 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com>

netfilter: netns nf_conntrack: per-netns expectations

Make per-netns a) expectation hash and b) expectations count.

Expectations always belongs to netns to which it's master conntrack belong.
This is natural and doesn't bloat expectation.

Proc files and leaf users are stubbed to init_net, this is temporary.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 400dad39 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com>

netfilter: netns nf_conntrack: per-netns conntrack hash

* make per-netns conntrack hash

Other solution is to add ->ct_net pointer to tuplehashes and still has one
hash, I tried that it's ugly and requires more code deep down in protocol
modules et al.

* propagate netns pointer to where needed, e. g. to conntrack iterators.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 49ac8713 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com>

netfilter: netns nf_conntrack: per-netns conntrack count

Sysctls and proc files are stubbed to init_net's one. This is temporary.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 5a1fb391 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com>

netfilter: netns nf_conntrack: add ->ct_net -- pointer from conntrack to netns

Conntrack (struct nf_conn) gets pointer to netns: ->ct_net -- netns in which
it was created. It comes from netdevice.

->ct_net is write-once field.

Every conntrack in system has ->ct_net initialized, no exceptions.

->ct_net doesn't pin netns: conntracks are recycled after timeouts and
pinning background traffic will prevent netns from even starting shutdown
sequence.

Right now every conntrack is created in init_net.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# dfdb8d79 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com>

netfilter: netns nf_conntrack: add netns boilerplate

One comment: #ifdefs around #include is necessary to overcome amazing compile
breakages in NOTRACK-in-netns patch (see below).

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 76108cea 08-Oct-2008 Jan Engelhardt <jengelh@medozas.de>

netfilter: Use unsigned types for hooknum and pf vars

and (try to) consistently use u_int8_t for the L3 family.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 9714be7d 06-Aug-2008 Krzysztof Piotr Oledzki <ole@ans.pl>

netfilter: fix two recent sysctl problems

Starting with 9043476f726802f4b00c96d0c4f418dde48d1304 ("[PATCH]
sanitize proc_sysctl") we have two netfilter releated problems:

- WARNING: at kernel/sysctl.c:1966 unregister_sysctl_table+0xcc/0x103(),
caused by wrong order of ini/fini calls

- net.netfilter is duplicated and has truncated set of records

Thanks to very useful guidelines from Al Viro, this patch fixes both
of them.

Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 58401572 21-Jul-2008 Krzysztof Piotr Oledzki <ole@ans.pl>

netfilter: accounting rework: ct_extend + 64bit counters (v4)

Initially netfilter has had 64bit counters for conntrack-based accounting, but
it was changed in 2.6.14 to save memory. Unfortunately in-kernel 64bit counters are
still required, for example for "connbytes" extension. However, 64bit counters
waste a lot of memory and it was not possible to enable/disable it runtime.

This patch:
- reimplements accounting with respect to the extension infrastructure,
- makes one global version of seq_print_acct() instead of two seq_print_counters(),
- makes it possible to enable it at boot time (for CONFIG_SYSCTL/CONFIG_SYSFS=n),
- makes it possible to enable/disable it at runtime by sysctl or sysfs,
- extends counters from 32bit to 64bit,
- renames ip_conntrack_counter -> nf_conn_counter,
- enables accounting code unconditionally (no longer depends on CONFIG_NF_CT_ACCT),
- set initial accounting enable state based on CONFIG_NF_CT_ACCT
- removes buggy IPCT_COUNTER_FILLING event handling.

If accounting is enabled newly created connections get additional acct extend.
Old connections are not changed as it is not possible to add a ct_extend area
to confirmed conntrack. Accounting is performed for all connections with
acct extend regardless of a current state of "net.netfilter.nf_conntrack_acct".

Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4c889498 14-Jul-2008 David S. Miller <davem@davemloft.net>

netfilter: Let nf_ct_kill() callers know if del_timer() returned true.

Signed-off-by: David S. Miller <davem@davemloft.net>


# b891c5a8 08-Jul-2008 Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conntrack: add allocation flag to nf_conntrack_alloc

ctnetlink does not need to allocate the conntrack entries with GFP_ATOMIC
as its code is executed in user context.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ceeff754 11-Jun-2008 Patrick McHardy <kaber@trash.net>

netfilter: nf_conntrack: fix ctnetlink related crash in nf_nat_setup_info()

When creation of a new conntrack entry in ctnetlink fails after having
set up the NAT mappings, the conntrack has an extension area allocated
that is not getting properly destroyed when freeing the conntrack again.
This means the NAT extension is still in the bysource hash, causing a
crash when walking over the hash chain the next time:

BUG: unable to handle kernel paging request at 00120fbd
IP: [<c03d394b>] nf_nat_setup_info+0x221/0x58a
*pde = 00000000
Oops: 0000 [#1] PREEMPT SMP

Pid: 2795, comm: conntrackd Not tainted (2.6.26-rc5 #1)
EIP: 0060:[<c03d394b>] EFLAGS: 00010206 CPU: 1
EIP is at nf_nat_setup_info+0x221/0x58a
EAX: 00120fbd EBX: 00120fbd ECX: 00000001 EDX: 00000000
ESI: 0000019e EDI: e853bbb4 EBP: e853bbc8 ESP: e853bb78
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process conntrackd (pid: 2795, ti=e853a000 task=f7de10f0 task.ti=e853a000)
Stack: 00000000 e853bc2c e85672ec 00000008 c0561084 63c1db4a 00000000 00000000
00000000 0002e109 61d2b1c3 00000000 00000000 00000000 01114e22 61d2b1c3
00000000 00000000 f7444674 e853bc04 00000008 c038e728 0000000a f7444674
Call Trace:
[<c038e728>] nla_parse+0x5c/0xb0
[<c0397c1b>] ctnetlink_change_status+0x190/0x1c6
[<c0397eec>] ctnetlink_new_conntrack+0x189/0x61f
[<c0119aee>] update_curr+0x3d/0x52
[<c03902d1>] nfnetlink_rcv_msg+0xc1/0xd8
[<c0390228>] nfnetlink_rcv_msg+0x18/0xd8
[<c0390210>] nfnetlink_rcv_msg+0x0/0xd8
[<c038d2ce>] netlink_rcv_skb+0x2d/0x71
[<c0390205>] nfnetlink_rcv+0x19/0x24
[<c038d0f5>] netlink_unicast+0x1b3/0x216
...

Move invocation of the extension destructors to nf_conntrack_free()
to fix this problem.

Fixes http://bugzilla.kernel.org/show_bug.cgi?id=10875

Reported-and-Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 718d4ad9 09-Jun-2008 Fabian Hugelshofer <hugelshofer2006@gmx.ch>

netfilter: nf_conntrack: properly account terminating packets

Currently the last packet of a connection isn't accounted when its causing
abnormal termination.

Introduces nf_ct_kill_acct() which increments the accounting counters on
conntrack kill. The new function was necessary, because there are calls
to nf_ct_kill() which don't need accounting:

nf_conntrack_proto_tcp.c line ~847:
Kills ct and returns NF_REPEAT. We don't want to count twice.

nf_conntrack_proto_tcp.c line ~880:
Kills ct and returns NF_DROP. I think we don't want to count dropped
packets.

nf_conntrack_netlink.c line ~824:
As far as I can see ctnetlink_del_conntrack() is used to destroy a
conntrack on behalf of the user. There is an sk_buff, but I don't think
this is an actual packet. Incrementing counters here is therefore not
desired.

Signed-off-by: Fabian Hugelshofer <hugelshofer2006@gmx.ch>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 51091764 09-Jun-2008 Patrick McHardy <kaber@trash.net>

netfilter: nf_conntrack: add nf_ct_kill()

Encapsulate the common

if (del_timer(&ct->timeout))
ct->timeout.function((unsigned long)ct)

sequence in a new function.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 443a70d5 29-Apr-2008 Philip Craig <philipc@snapgear.com>

netfilter: nf_conntrack: padding breaks conntrack hash on ARM

commit 0794935e "[NETFILTER]: nf_conntrack: optimize hash_conntrack()"
results in ARM platforms hashing uninitialised padding. This padding
doesn't exist on other architectures.

Fix this by replacing NF_CT_TUPLE_U_BLANK() with memset() to ensure
everything is initialised. There were only 4 bytes that
NF_CT_TUPLE_U_BLANK() wasn't clearing anyway (or 12 bytes on ARM).

Signed-off-by: Philip Craig <philipc@snapgear.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ef1a5a50 14-Apr-2008 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: fix incorrect check for expectations

The expectation classes changed help->expectations to an array,
fix use as scalar value.

Signed-off-by: Patrick McHardy <kaber@trash.net>


# 3c9fba65 14-Apr-2008 Jan Engelhardt <jengelh@computergmbh.de>

[NETFILTER]: nf_conntrack: replace NF_CT_DUMP_TUPLE macro indrection by function call

Directly call IPv4 and IPv6 variants where the address family is
easily known.

Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 5f2b4c90 14-Apr-2008 Jan Engelhardt <jengelh@computergmbh.de>

[NETFILTER]: nf_conntrack: use bool type in struct nf_conntrack_tuple.h

Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 5e8fbe2a 14-Apr-2008 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: add tuplehash l3num/protonum accessors

Add accessors for l3num and protonum and get rid of some overly long
expressions.

Signed-off-by: Patrick McHardy <kaber@trash.net>


# 4e29e9ec 27-Feb-2008 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: fix smp_processor_id() in preemptible code warning

Since we're using RCU for the conntrack hash now, we need to avoid
getting preempted or interrupted by BHs while changing the stats.

Fixes warning reported by Tilman Schmidt <tilman@imap.cc> when using
preemptible RCU:

[ 48.180297] BUG: using smp_processor_id() in preemptible [00000000] code: ntpdate/3562
[ 48.180297] caller is __nf_conntrack_find+0x9b/0xeb [nf_conntrack]
[ 48.180297] Pid: 3562, comm: ntpdate Not tainted 2.6.25-rc2-mm1-testing #1
[ 48.180297] [<c02015b9>] debug_smp_processor_id+0x99/0xb0
[ 48.180297] [<fac643a7>] __nf_conntrack_find+0x9b/0xeb [nf_conntrack]

Tested-by: Tilman Schmidt <tilman@imap.cc>
Tested-by: Christian Casteyde <casteyde.christian@free.fr> [Bugzilla #10097]

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d33b7c06 31-Jan-2008 Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

[NETFILTER]: nf_conntrack: kill unused static inline (do_iter)

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c88130bc 31-Jan-2008 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: naming unification

Rename all "conntrack" variables to "ct" for more consistency and
avoiding some overly long lines.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 76eb9460 31-Jan-2008 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: don't inline early_drop()

early_drop() is only called *very* rarely, unfortunately gcc inlines it
into the hotpath because there is only a single caller. Explicitly mark
it noinline.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0794935e 31-Jan-2008 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: optimize hash_conntrack()

Avoid calling jhash three times and hash the entire tuple in one go.

__hash_conntrack | -485 # 760 -> 275, # inlines: 3 -> 1, size inlines: 717 -> 252
1 function changed, 485 bytes removed

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ba419aff 31-Jan-2008 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: optimize __nf_conntrack_find()

Ignoring specific entries in __nf_conntrack_find() is only needed by NAT
for nf_conntrack_tuple_taken(). Remove it from __nf_conntrack_find()
and make nf_conntrack_tuple_taken() search the hash itself.

Saves 54 bytes of text in the hotpath on x86_64:

__nf_conntrack_find | -54 # 321 -> 267, # inlines: 3 -> 2, size inlines: 181 -> 127
nf_conntrack_tuple_taken | +305 # 15 -> 320, lexblocks: 0 -> 3, # inlines: 0 -> 3, size inlines: 0 -> 181
nf_conntrack_find_get | -2 # 90 -> 88
3 functions changed, 305 bytes added, 56 bytes removed, diff: +249

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f8ba1aff 31-Jan-2008 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: switch rwlock to spinlock

With the RCU conversion only write_lock usages of nf_conntrack_lock are
left (except one read_lock that should actually use write_lock in the
H.323 helper). Switch to a spinlock.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 76507f69 31-Jan-2008 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: use RCU for conntrack hash

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c52fbb41 31-Jan-2008 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack_core: avoid taking nf_conntrack_lock in nf_conntrack_alter_reply

The conntrack is unconfirmed, so we have an exclusive reference, which
means that the write_lock is definitely unneeded. A read_lock used to
be needed for the helper lookup, but since we're using RCU for helpers
now rcu_read_lock is enough.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 47d95045 31-Jan-2008 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: fix accounting with fixed timeouts

Don't skip accounting for conntracks with the FIXED_TIMEOUT bit.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 96eb24d7 31-Jan-2008 Stephen Hemminger <shemminger@vyatta.com>

[NETFILTER]: nf_conntrack: sparse warnings

The hashtable size is really unsigned so sparse complains when you pass
a signed integer. Change all uses to make it consistent.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b334aadc 15-Jan-2008 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: clean up a few header files

- Remove declarations of non-existing variables and functions
- Move helper init/cleanup function declarations to nf_conntrack_helper.h
- Remove unneeded __nf_conntrack_attach declaration and make it static

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 34498825 17-Dec-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: non-power-of-two jhash optimizations

Apply Eric Dumazet's jhash optimizations where applicable. Quoting Eric:

Thanks to jhash, hash value uses full 32 bits. Instead of returning
hash % size (implying a divide) we return the high 32 bits of the
(hash * size) that will give results between [0 and size-1] and same
hash distribution.

On most cpus, a multiply is less expensive than a divide, by an order
of magnitude.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 77236b6e 17-Dec-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: ctnetlink: use netlink attribute helpers

Use NLA_PUT_BE32, nla_get_be32() etc.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fae718dd 24-Dec-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack_ipv4: fix module parameter compatibility

Some users do "modprobe ip_conntrack hashsize=...". Since we have the
module aliases this loads nf_conntrack_ipv4 and nf_conntrack, the
hashsize parameter is unknown for nf_conntrack_ipv4 however and makes
it fail.

Allow to specify hashsize= for both nf_conntrack and nf_conntrack_ipv4.

Note: the nf_conntrack message in the ringbuffer will display an
incorrect hashsize since nf_conntrack is first pulled in as a
dependency and calculates the size itself, then it gets changed
through a call to nf_conntrack_set_hashsize().

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 29b67497 29-Oct-2007 Andrew Morton <akpm@linux-foundation.org>

[NETFILTER]: nf_ct_alloc_hashtable(): use __GFP_NOWARN

This allocation is expected to fail and we handle it by fallback to vmalloc().

So don't scare people with nasty messages like
http://bugzilla.kernel.org/show_bug.cgi?id=9190

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3db05fea 15-Oct-2007 Herbert Xu <herbert@gondor.apana.org.au>

[NETFILTER]: Replace sk_buff ** with sk_buff *

With all the users of the double pointers removed, this patch mops up by
finally replacing all occurances of sk_buff ** in the netfilter API by
sk_buff *.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7f85f914 28-Sep-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: kill unique ID

Remove the per-conntrack ID, its not necessary anymore for dumping.
For compatiblity reasons we send the address of the conntrack to
userspace as ID.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f73e924c 28-Sep-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: ctnetlink: use netlink policy

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fdf70832 28-Sep-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nfnetlink: rename functions containing 'nfattr'

There is no struct nfattr anymore, rename functions to 'nlattr'.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# df6fb868 28-Sep-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nfnetlink: convert to generic netlink attribute functions

Get rid of the duplicated rtnetlink macros and use the generic netlink
attribute functions. The old duplicated stuff is moved to a new header
file that exists just for userspace.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a34c4589 26-Jul-2007 Al Viro <viro@ftp.linux.org.uk>

netfilter endian regressions

no real bugs, just misannotations cropping up

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 20c2df83 19-Jul-2007 Paul Mundt <lethal@linux-sh.org>

mm: Remove slab destructors from kmem_cache_create().

Slab destructors were no longer supported after Christoph's
c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
BUGs for both slab and slub, and slob never supported them
either.

This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).

Signed-off-by: Paul Mundt <lethal@linux-sh.org>


# e2a3123f 14-Jul-2007 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>

[NETFILTER]: nf_conntrack: Introduces nf_ct_get_tuplepr and uses it

nf_ct_get_tuple() requires the offset to transport header and that bothers
callers such as icmp[v6] l4proto modules. This introduces new function
to simplify them.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ffc30690 14-Jul-2007 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>

[NETFILTER]: nf_conntrack: make l3proto->prepare() generic and renames it

The icmp[v6] l4proto modules parse headers in ICMP[v6] error to get tuple.
But they have to find the offset to transport protocol header before that.
Their processings are almost same as prepare() of l3proto modules.
This makes prepare() more generic to simplify icmp[v6] l4proto module
later.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d87d8469 14-Jul-2007 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>

[NETFILTER]: nf_conntrack: Increment error count on parsing IPv4 header

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0d53778e 07-Jul-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: Convert DEBUGP to pr_debug

Convert DEBUGP to pr_debug and fix lots of non-compiling debug statements.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7ae7730f 07-Jul-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: early_drop improvement

When the maximum number of conntrack entries is reached and a new
one needs to be allocated, conntrack tries to drop an unassured
connection from the same hash bucket the new conntrack would hash
to. Since with a properly sized hash the average number of entries
per bucket is 1, the chances of actually finding one are not very
good. This patch makes it walk the hash until a minimum number of
8 entries are checked.

Based on patch by Vasily Averin <vvs@sw.ru>.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b560580a 07-Jul-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack_expect: maintain per conntrack expectation list

This patch brings back the per-conntrack expectation list that was
removed around 2.6.10 to avoid walking all expectations on expectation
eviction and conntrack destruction.

As these were the last users of the global expectation list, this patch
also kills that.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e9c1b084 07-Jul-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: move expectaton related init code to nf_conntrack_expect.c

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6823645d 07-Jul-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack_expect: function naming unification

Currently there is a wild mix of nf_conntrack_expect_, nf_ct_exp_,
expect_, exp_, ...

Consistently use nf_ct_ as prefix for exported functions.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ac565e5f 07-Jul-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: export hash allocation/destruction functions

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 330f7db5 07-Jul-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: remove 'ignore_conntrack' argument from nf_conntrack_find_get

All callers pass NULL, this also doesn't seem very useful for modules.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f205c5e0 07-Jul-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: use hlists for conntrack hash

Convert conntrack hash to hlists to reduce its size and cache
footprint. Since the default hashsize to max. entries ratio
sucks (1:16), this patch doesn't reduce the amount of memory
used for the hash by default, but instead uses a better ratio
of 1:8, which results in the same max. entries value.

One thing worth noting is early_drop. It really should use LRU,
so it now has to iterate over the entire chain to find the last
unconfirmed entry. Since chains shouldn't be very long and the
entire operation is very rare this shouldn't be a problem.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8e5105a0 07-Jul-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: round up hashsize to next multiple of PAGE_SIZE

Don't let the rest of the page go to waste.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d8a0509a 07-Jul-2007 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>

[NETFILTER]: nf_nat: kill global 'destroy' operation

This kills the global 'destroy' operation which was used by NAT.
Instead it uses the extension infrastructure so that multiple
extensions can register own operations.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# dacd2a1a 07-Jul-2007 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>

[NETFILTER]: nf_conntrack: remove old memory allocator of conntrack

Now memory space for help and NAT are allocated by extension
infrastructure.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ceceae1b 07-Jul-2007 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>

[NETFILTER]: nf_conntrack: use extension infrastructure for helper

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ecfab2c9 07-Jul-2007 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>

[NETFILTER]: nf_conntrack: introduce extension infrastructure

Old space allocator of conntrack had problems about extensibility.
- It required slab cache per combination of extensions.
- It expected what extensions would be assigned, but it was impossible
to expect that completely, then we allocated bigger memory object than
really required.
- It needed to search helper twice due to lock issue.

Now basic informations of a connection are stored in 'struct nf_conn'.
And a storage for extension (helper, NAT) is allocated by kmalloc.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3c158f7f 05-Jun-2007 Patrick McHarrdy <kaber@trash.net>

[NETFILTER]: nf_conntrack: fix helper module unload races

When a helper module is unloaded all conntracks refering to it have their
helper pointer NULLed out, leading to lots of races. In most places this
can be fixed by proper use of RCU (they do already check for != NULL,
but in a racy way), additionally nf_conntrack_expect_related needs to
bail out when no helper is present.

Also remove two paranoid BUG_ONs in nf_conntrack_proto_gre that are racy
and not worth fixing.

Signed-off-by: Patrick McHarrdy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5397e97d 19-May-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: fix use-after-free in helper destroy callback invocation

When the helper module is removed for a master connection that has a
fulfilled expectation, but has already timed out and got removed from
the hash tables, nf_conntrack_helper_unregister can't find the master
connection to unset the helper, causing a use-after-free when the
expected connection is destroyed and releases the last reference to
the master.

The helper destroy callback was introduced for the PPtP helper to clean
up expectations and expected connections when the master connection
times out, but doing this from destroy_conntrack only works for
unfulfilled expectations since expected connections hold a reference
to the master, preventing its destruction. Move the destroy callback to
the timeout function, which fixes both problems.

Reported/tested by Gabor Burjan <buga@buvoshetes.hu>.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5d78a849 10-May-2007 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>

[NETFILTER]: nf_nat: Clears helper private area when NATing

Some helpers (eg. ftp) assume that private area in conntrack is
filled with zero. It should be cleared when helper is changed.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fda61436 10-May-2007 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>

[NETFILTER]: nf_conntrack: Removes unused destroy operation of l3proto

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# de6e05c4 23-Mar-2007 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>

[NETFILTER]: nf_conntrack: kill destroy() in struct nf_conntrack for diet

The destructor per conntrack is unnecessary, then this replaces it with
system wide destructor.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e6f689db 23-Mar-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: Use setup_timer

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1b53d904 23-Mar-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: Remove changelogs and CVS IDs

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9b887909 14-Mar-2007 Sami Farin <safari-netfilter@safari.iki.fi>

[NETFILTER]: nf_conntrack: use jhash2 in __hash_conntrack

Now it uses jhash, but using jhash2 would be around 3-4 times faster
(on P4).

Signed-off-by: Sami Farin <safari-netfilter@safari.iki.fi>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ac5357eb 14-Mar-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: remove ugly hack in l4proto registration

Remove ugly special-casing of nf_conntrack_l4proto_generic, all it
wants is its sysctl tables registered, so do that explicitly in an
init function and move the remaining protocol initialization and
cleanup code to nf_conntrack_proto.c as well.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bbe735e4 10-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com>

[SK_BUFF]: Introduce skb_network_offset()

For the quite common 'skb->nh.raw - skb->data' sequence.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e281db5c 04-Mar-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack/nf_nat: fix incorrect config ifdefs

The nf_conntrack_netlink config option is named CONFIG_NF_CT_NETLINK,
but multiple files use CONFIG_IP_NF_CONNTRACK_NETLINK or
CONFIG_NF_CONNTRACK_NETLINK for ifdefs.

Fix this and reformat all CONFIG_NF_CT_NETLINK ifdefs to only use a line.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ec68e97d 04-Mar-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: conntrack: fix {nf,ip}_ct_iterate_cleanup endless loops

Fix {nf,ip}_ct_iterate_cleanup unconfirmed list handling:

- unconfirmed entries can not be killed manually, they are removed on
confirmation or final destruction of the conntrack entry, which means
we might iterate forever without making forward progress.

This can happen in combination with the conntrack event cache, which
holds a reference to the conntrack entry, which is only released when
the packet makes it all the way through the stack or a different
packet is handled.

- taking references to an unconfirmed entry and using it outside the
locked section doesn't work, the list entries are not refcounted and
another CPU might already be waiting to destroy the entry

What the code really wants to do is make sure the references of the hash
table to the selected conntrack entries are released, so they will be
destroyed once all references from skbs and the event cache are dropped.

Since unconfirmed entries haven't even entered the hash yet, simply mark
them as dying and skip confirmation based on that.

Reported and tested by Chuck Ebbert <cebbert@redhat.com>

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 601e68e1 12-Feb-2007 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

[NETFILTER]: Fix whitespace errors

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 982d9a9c 12-Feb-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: properly use RCU for nf_conntrack_destroyed callback

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c0e912d7 12-Feb-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: fix invalid conntrack statistics RCU assumption

NF_CT_STAT_INC assumes rcu_read_lock in nf_hook_slow disables
preemption as well, making it legal to use __get_cpu_var without
disabling preemption manually. The assumption is not correct anymore
with preemptable RCU, additionally we need to protect against softirqs
when not holding nf_conntrack_lock.

Add NF_CT_STAT_INC_ATOMIC macro, which disables local softirqs,
and use where necessary.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 923f4902 12-Feb-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: properly use RCU API for nf_ct_protos/nf_ct_l3protos arrays

Replace preempt_{enable,disable} based RCU by proper use of the
RCU API and add missing rcu_read_lock/rcu_read_unlock calls in
all paths not obviously only used within packet process context
(nfnetlink_conntrack).

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c3a47ab3 12-Feb-2007 Patrick McHardy <kaber@trash.net>

[NETFILTER]: Properly use RCU in nf_ct_attach

Use rcu_assign_pointer/rcu_dereference for ip_ct_attach pointer instead
of self-made RCU and use rcu_read_lock to make sure the conntrack module
doesn't disappear below us while calling it, since this function can be
called from outside the netfilter hooks.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e18b890b 06-Dec-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] slab: remove kmem_cache_t

Replace all uses of kmem_cache_t with struct kmem_cache.

The patch was generated using the following script:

#!/bin/sh
#
# Replace one string by another in all the kernel sources.
#

set -e

for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
quilt add $file
sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
mv /tmp/$$ $file
quilt refresh
done

The script was run like this

sh replace kmem_cache_t "struct kmem_cache"

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 272491ef 07-Dec-2006 Randy Dunlap <randy.dunlap@oracle.com>

[NETFILTER]: Fix non-ANSI func. decl.

Fix non-ANSI function declaration:
net/netfilter/nf_conntrack_core.c:1096:25: warning: non-ANSI function declaration of function 'nf_conntrack_flush'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d7fe0f24 03-Dec-2006 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] severing skbuff.h -> mm.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 13b18339 02-Dec-2006 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: EXPORT_SYMBOL cleanup

- move EXPORT_SYMBOL next to exported symbol
- use EXPORT_SYMBOL_GPL since this is what the original code used

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f09943fe 02-Dec-2006 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack/nf_nat: add PPTP helper port

Add nf_conntrack port of the PPtP conntrack/NAT helper. Since there seems
to be no IPv6-capable PPtP implementation the helper only support IPv4.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5b1158e9 02-Dec-2006 Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

[NETFILTER]: Add NAT support for nf_conntrack

Add NAT support for nf_conntrack. Joint work of Jozsef Kadlecsik,
Yasuyuki Kozakai, Martin Josefsson and myself.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9457d851 02-Dec-2006 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: automatic helper assignment for expectations

Some helpers (namely H.323) manually assign further helpers to expected
connections. This is not possible with nf_conntrack anymore since we
need to know whether a helper is used at allocation time.

Handle the helper assignment centrally, which allows to perform the
correct allocation and as a nice side effect eliminates the need
for the H.323 helper to fiddle with nf_conntrack_lock.

Mid term the allocation scheme really needs to be redesigned since
we do both the helper and expectation lookup _twice_ for every new
connection.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bff9a89b 02-Dec-2006 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: endian annotations

Resync with Al Viro's ip_conntrack annotations and fix a missed
spot in ip_nat_proto_icmp.c.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a999e683 28-Nov-2006 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: sysctl compatibility with old connection tracking

This patch adds an option to keep the connection tracking sysctls visible
under their old names.

Signed-off-by: Patrick McHardy <kaber@trash.net>


# 933a41e7 28-Nov-2006 Patrick McHardy <kaber@trash.net>

[NETFILTER]: nf_conntrack: move conntrack protocol sysctls to individual modules

Signed-off-by: Patrick McHardy <kaber@trash.net>


# be00c8e4 28-Nov-2006 Martin Josefsson <gandalf@wlug.westbo.se>

[NETFILTER]: nf_conntrack: reduce timer updates in __nf_ct_refresh_acct()

Only update the conntrack timer if there's been at least HZ jiffies since
the last update. Reduces the number of del_timer/add_timer cycles from one
per packet to one per connection per second (plus once for each state change
of a connection)

Should handle timer wraparounds and connection timeout changes.

Signed-off-by: Martin Josefsson <gandalf@wlug.westbo.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 3ffd5eeb 28-Nov-2006 Martin Josefsson <gandalf@wlug.westbo.se>

[NETFILTER]: nf_conntrack: minor __nf_ct_refresh_acct() whitespace cleanup

Minor whitespace cleanup.

Signed-off-by: Martin Josefsson <gandalf@wlug.westbo.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 951d36ca 28-Nov-2006 Martin Josefsson <gandalf@wlug.westbo.se>

[NETFILTER]: nf_conntrack: remove ASSERT_{READ,WRITE}_LOCK

Remove the usage of ASSERT_READ_LOCK/ASSERT_WRITE_LOCK in nf_conntrack,
it didn't do anything, it was just an empty define and it uglified the code.

Signed-off-by: Martin Josefsson <gandalf@wlug.westbo.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# ae5718fb 28-Nov-2006 Martin Josefsson <gandalf@wlug.westbo.se>

[NETFILTER]: nf_conntrack: more sanity checks in protocol registration/unregistration

Add some more sanity checks when registering/unregistering l3/l4 protocols.

Signed-off-by: Martin Josefsson <gandalf@wlug.westbo.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 605dcad6 28-Nov-2006 Martin Josefsson <gandalf@wlug.westbo.se>

[NETFILTER]: nf_conntrack: rename struct nf_conntrack_protocol

Rename 'struct nf_conntrack_protocol' to 'struct nf_conntrack_l4proto' in
order to help distinguish it from 'struct nf_conntrack_l3proto'. It gets
rather confusing with 'nf_conntrack_protocol'.

Signed-off-by: Martin Josefsson <gandalf@wlug.westbo.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# e2b7606c 28-Nov-2006 Martin Josefsson <gandalf@wlug.westbo.se>

[NETFILTER]: More __read_mostly annotations

Place rarely written variables in the read-mostly section by using
__read_mostly

Signed-off-by: Martin Josefsson <gandalf@wlug.westbo.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 8f03dea5 28-Nov-2006 Martin Josefsson <gandalf@wlug.westbo.se>

[NETFILTER]: nf_conntrack: split out protocol handling

This patch splits out L3/L4 protocol handling into its own file
nf_conntrack_proto.c

Signed-off-by: Martin Josefsson <gandalf@wlug.westbo.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# f6180121 28-Nov-2006 Martin Josefsson <gandalf@wlug.westbo.se>

[NETFILTER]: nf_conntrack: split out the event cache

This patch splits out the event cache into its own file
nf_conntrack_ecache.c

Signed-off-by: Martin Josefsson <gandalf@wlug.westbo.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 7e5d03bb 28-Nov-2006 Martin Josefsson <gandalf@wlug.westbo.se>

[NETFILTER]: nf_conntrack: split out helper handling

This patch splits out handling of helpers into its own file
nf_conntrack_helper.c

Signed-off-by: Martin Josefsson <gandalf@wlug.westbo.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 77ab9cff 28-Nov-2006 Martin Josefsson <gandalf@wlug.westbo.se>

[NETFILTER]: nf_conntrack: split out expectation handling

This patch splits out expectation handling into its own file
nf_conntrack_expect.c

Signed-off-by: Martin Josefsson <gandalf@wlug.westbo.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>


# 2e47c264 27-Nov-2006 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>

[NETFILTER]: conntrack: fix refcount leak when finding expectation

All users of __{ip,nf}_conntrack_expect_find() don't expect that
it increments the reference count of expectation.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 22e7410b 27-Nov-2006 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>

[NETFILTER]: nf_conntrack: fix the race on assign helper to new conntrack

The found helper cannot be assigned to conntrack after unlocking
nf_conntrack_lock. This tries to find helper to assign again.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c073e3fa 30-Oct-2006 Martin Josefsson <gandalf@wlug.westbo.se>

[NETFILTER]: nf_conntrack: add missing unlock in get_next_corpse()

Add missing unlock in get_next_corpse() in nf_conntrack. It was missed
during the removal of listhelp.h . Also remove an unneeded use of
nf_ct_tuplehash_to_ctrack() in the same function.

Should be applied before 2.6.19 is released.

Signed-off-by: Martin Josefsson <gandalf@wlug.westbo.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1192e403 20-Sep-2006 Brian Haley <brian.haley@hp.com>

[NETFILTER]: make some netfilter globals __read_mostly

Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5251e2d2 20-Sep-2006 Pablo Neira Ayuso <pablo@netfilter.org>

[NETFILTER]: conntrack: fix race condition in early_drop

On SMP environments the maximum number of conntracks can be overpassed
under heavy stress situations due to an existing race condition.

CPU A CPU B
atomic_read() ...
early_drop() ...
... atomic_read()
allocate conntrack allocate conntrack
atomic_inc() atomic_inc()

This patch moves the counter incrementation before the early drop stage.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# df0933dc 20-Sep-2006 Patrick McHardy <kaber@trash.net>

[NETFILTER]: kill listhelp.h

Kill listhelp.h and use the list.h functions instead.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 94aec08e 18-Sep-2006 Brian Haley <brian.haley@hp.com>

[NETFILTER]: Change tunables to __read_mostly

Change some netfilter tunables to __read_mostly. Also fixed some
incorrect file reference comments while I was in there.

(this will be my last __read_mostly patch unless someone points out
something else that needs it)

Signed-off-by: Brian Haley <brian.haley@hp.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6ab3d562 30-Jun-2006 Jörn Engel <joern@wohnheim.fh-wedel.de>

Remove obsolete #include <linux/config.h>

Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>


# 7c9728c3 09-Jun-2006 James Morris <jmorris@namei.org>

[SECMARK]: Add secmark support to conntrack

Add a secmark field to IP and NF conntracks, so that security markings
on packets can be copied to their associated connections, and also
copied back to packets as required. This is similar to the network
mark field currently used with conntrack, although it is intended for
enforcement of security policy rather than network policy.

Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 997ae831 29-May-2006 Eric Leblond <eric@inl.fr>

[NETFILTER]: conntrack: add fixed timeout flag in connection tracking

Add a flag in a connection status to have a non updated timeout.
This permits to have connection that automatically die at a given
time.

Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2c16b774 24-Apr-2006 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>

[NETFILTER]: nf_conntrack: kill unused callback init_conntrack

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e1bbdebd 24-Apr-2006 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>

[NETFILTER]: nf_conntrack: Fix module refcount dropping too far

If nf_ct_l3proto_find_get() fails to get the refcount of
nf_ct_l3proto_generic, nf_ct_l3proto_put() will drop the refcount
too far.

This gets rid of '.me = THIS_MODULE' of nf_ct_l3proto_generic so that
nf_ct_l3proto_find_get() doesn't try to get refcount of it.
It's OK because its symbol is usable until nf_conntrack.ko is unloaded.

This also kills unnecessary NULL pointer check as well.
__nf_ct_proto_find() allways returns non-NULL pointer.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6f912042 10-Apr-2006 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

[PATCH] for_each_possible_cpu: network codes

for_each_cpu() actually iterates across all possible CPUs. We've had mistakes
in the past where people were using for_each_cpu() where they should have been
iterating across only online or present CPUs. This is inefficient and
possibly buggy.

We're renaming for_each_cpu() to for_each_possible_cpu() to avoid this in the
future.

This patch replaces for_each_cpu with for_each_possible_cpu under /net

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e041c683 27-Mar-2006 Alan Stern <stern@rowland.harvard.edu>

[PATCH] Notifier chain update: API changes

The kernel's implementation of notifier chains is unsafe. There is no
protection against entries being added to or removed from a chain while the
chain is in use. The issues were discussed in this thread:

http://marc.theaimsgroup.com/?l=linux-kernel&m=113018709002036&w=2

We noticed that notifier chains in the kernel fall into two basic usage
classes:

"Blocking" chains are always called from a process context
and the callout routines are allowed to sleep;

"Atomic" chains can be called from an atomic context and
the callout routines are not allowed to sleep.

We decided to codify this distinction and make it part of the API. Therefore
this set of patches introduces three new, parallel APIs: one for blocking
notifiers, one for atomic notifiers, and one for "raw" notifiers (which is
really just the old API under a new name). New kinds of data structures are
used for the heads of the chains, and new routines are defined for
registration, unregistration, and calling a chain. The three APIs are
explained in include/linux/notifier.h and their implementation is in
kernel/sys.c.

With atomic and blocking chains, the implementation guarantees that the chain
links will not be corrupted and that chain callers will not get messed up by
entries being added or removed. For raw chains the implementation provides no
guarantees at all; users of this API must provide their own protections. (The
idea was that situations may come up where the assumptions of the atomic and
blocking APIs are not appropriate, so it should be possible for users to
handle these things in their own way.)

There are some limitations, which should not be too hard to live with. For
atomic/blocking chains, registration and unregistration must always be done in
a process context since the chain is protected by a mutex/rwsem. Also, a
callout routine for a non-raw chain must not try to register or unregister
entries on its own chain. (This did happen in a couple of places and the code
had to be changed to avoid it.)

Since atomic chains may be called from within an NMI handler, they cannot use
spinlocks for synchronization. Instead we use RCU. The overhead falls almost
entirely in the unregister routine, which is okay since unregistration is much
less frequent that calling a chain.

Here is the list of chains that we adjusted and their classifications. None
of them use the raw API, so for the moment it is only a placeholder.

ATOMIC CHAINS
-------------
arch/i386/kernel/traps.c: i386die_chain
arch/ia64/kernel/traps.c: ia64die_chain
arch/powerpc/kernel/traps.c: powerpc_die_chain
arch/sparc64/kernel/traps.c: sparc64die_chain
arch/x86_64/kernel/traps.c: die_chain
drivers/char/ipmi/ipmi_si_intf.c: xaction_notifier_list
kernel/panic.c: panic_notifier_list
kernel/profile.c: task_free_notifier
net/bluetooth/hci_core.c: hci_notifier
net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_chain
net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_expect_chain
net/ipv6/addrconf.c: inet6addr_chain
net/netfilter/nf_conntrack_core.c: nf_conntrack_chain
net/netfilter/nf_conntrack_core.c: nf_conntrack_expect_chain
net/netlink/af_netlink.c: netlink_chain

BLOCKING CHAINS
---------------
arch/powerpc/platforms/pseries/reconfig.c: pSeries_reconfig_chain
arch/s390/kernel/process.c: idle_chain
arch/x86_64/kernel/process.c idle_notifier
drivers/base/memory.c: memory_chain
drivers/cpufreq/cpufreq.c cpufreq_policy_notifier_list
drivers/cpufreq/cpufreq.c cpufreq_transition_notifier_list
drivers/macintosh/adb.c: adb_client_list
drivers/macintosh/via-pmu.c sleep_notifier_list
drivers/macintosh/via-pmu68k.c sleep_notifier_list
drivers/macintosh/windfarm_core.c wf_client_list
drivers/usb/core/notify.c usb_notifier_list
drivers/video/fbmem.c fb_notifier_list
kernel/cpu.c cpu_chain
kernel/module.c module_notify_list
kernel/profile.c munmap_notifier
kernel/profile.c task_exit_notifier
kernel/sys.c reboot_notifier_list
net/core/dev.c netdev_chain
net/decnet/dn_dev.c: dnaddr_chain
net/ipv4/devinet.c: inetaddr_chain

It's possible that some of these classifications are wrong. If they are,
please let us know or submit a patch to fix them. Note that any chain that
gets called very frequently should be atomic, because the rwsem read-locking
used for blocking chains is very likely to incur cache misses on SMP systems.
(However, if the chain's callout routines may sleep then the chain cannot be
atomic.)

The patch set was written by Alan Stern and Chandra Seetharaman, incorporating
material written by Keith Owens and suggestions from Paul McKenney and Andrew
Morton.

[jes@sgi.com: restructure the notifier chain initialization macros]
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# b9f78f9f 22-Mar-2006 Pablo Neira Ayuso <pablo@netfilter.org>

[NETFILTER]: nf_conntrack: support for layer 3 protocol load on demand

x_tables matches and targets that require nf_conntrack_ipv[4|6] to work
don't have enough information to load on demand these modules. This
patch introduces the following changes to solve this issue:

o nf_ct_l3proto_try_module_get: try to load the layer 3 connection
tracker module and increases the refcount.
o nf_ct_l3proto_module put: drop the refcount of the module.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4e3882f7 22-Mar-2006 Pablo Neira Ayuso <pablo@netfilter.org>

[NETFILTER]: conntrack: cleanup the conntrack ID initialization

Currently the first conntrack ID assigned is 2, use 1 instead.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 57b47a53 20-Mar-2006 Ingo Molnar <mingo@elte.hu>

[NET]: sem2mutex part 2

Semaphore to mutex conversion.

The conversion was generated via scripts, and the result was validated
automatically via a script as well.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# dc808fe2 20-Mar-2006 Harald Welte <laforge@netfilter.org>

[NETFILTER] nf_conntrack: clean up to reduce size of 'struct nf_conn'

This patch moves all helper related data fields of 'struct nf_conn'
into a separate structure 'struct nf_conn_help'. This new structure
is only present in conntrack entries for which we actually have a
helper loaded.

Also, this patch cleans up the nf_conntrack 'features' mechanism to
resemble what the original idea was: Just glue the feature-specific
data structures at the end of 'struct nf_conn', and explicitly
re-calculate the pointer to it when needed rather than keeping
pointers around.

Saves 20 bytes per conntrack on my x86_64 box. A non-helped conntrack
is 276 bytes. We still need to save another 20 bytes in order to fit
into to target of 256bytes.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7d3cdc6b 15-Feb-2006 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>

[NETFILTER]: nf_conntrack: move registration of __nf_ct_attach

Move registration of __nf_ct_attach to nf_conntrack_core to make it usable
for IPv6 connection tracking as well.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ddc8d029 04-Feb-2006 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>

[NETFILTER]: nf_conntrack: check address family when finding protocol module

__nf_conntrack_{l3}proto_find() doesn't check the passed protocol family,
then it's possible to touch out of the array which has only AF_MAX items.

Spotted by Pablo Neira Ayuso.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c1d10adb 05-Jan-2006 Pablo Neira Ayuso <pablo@netfilter.org>

[NETFILTER]: Add ctnetlink port for nf_conntrack

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d695aa8a 05-Jan-2006 Jesper Juhl <jesper.juhl@gmail.com>

[NETFILTER]: Decrease number of pointer derefs in nf_conntrack_core.c

Benefits of the patch:
- Fewer pointer dereferences should make the code slightly faster.
- Size of generated code is smaller
- improved readability

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6636568c 05-Dec-2005 Patrick McHardy <kaber@trash.net>

[NETFILTER]: Wait for untracked references in nf_conntrack module unload

Noticed by Pablo Neira <pablo@eurodev.net>.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4a59a810 17-Nov-2005 Patrick McHardy <kaber@trash.net>

[NETFILTER]: Fix nf_conntrack compilation with CONFIG_NETFILTER_DEBUG

CC [M] net/netfilter/nf_conntrack_core.o
net/netfilter/nf_conntrack_core.c: In function 'nf_ct_unlink_expect':
net/netfilter/nf_conntrack_core.c:390: error: 'exp_timeout' undeclared (first use in this function)
net/netfilter/nf_conntrack_core.c:390: error: (Each undeclared identifier is reported only once
net/netfilter/nf_conntrack_core.c:390: error: for each function it appears in.)

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5a6f294e4 15-Nov-2005 KOVACS Krisztian <hidden@balabit.hu>

[NETFILTER] Free layer-3 specific protocol tables at cleanup

Although the comment around the allocation code tells us that
the layer-3 specific protocol tables will be freed when cleaning up,
they aren't. And this makes nfsim complain loudly...

Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9fb9cbb1 09-Nov-2005 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>

[NETFILTER]: Add nf_conntrack subsystem.

The existing connection tracking subsystem in netfilter can only
handle ipv4. There were basically two choices present to add
connection tracking support for ipv6. We could either duplicate all
of the ipv4 connection tracking code into an ipv6 counterpart, or (the
choice taken by these patches) we could design a generic layer that
could handle both ipv4 and ipv6 and thus requiring only one sub-protocol
(TCP, UDP, etc.) connection tracking helper module to be written.

In fact nf_conntrack is capable of working with any layer 3
protocol.

The existing ipv4 specific conntrack code could also not deal
with the pecularities of doing connection tracking on ipv6,
which is also cured here. For example, these issues include:

1) ICMPv6 handling, which is used for neighbour discovery in
ipv6 thus some messages such as these should not participate
in connection tracking since effectively they are like ARP
messages

2) fragmentation must be handled differently in ipv6, because
the simplistic "defrag, connection track and NAT, refrag"
(which the existing ipv4 connection tracking does) approach simply
isn't feasible in ipv6

3) ipv6 extension header parsing must occur at the correct spots
before and after connection tracking decisions, and there were
no provisions for this in the existing connection tracking
design

4) ipv6 has no need for stateful NAT

The ipv4 specific conntrack layer is kept around, until all of
the ipv4 specific conntrack helpers are ported over to nf_conntrack
and it is feature complete. Once that occurs, the old conntrack
stuff will get placed into the feature-removal-schedule and we will
fully kill it off 6 months later.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>