History log of /freebsd-current/sys/netinet6/ip6_input.c
Revision Date Author Comments
# 60d8dbbe 18-Jan-2024 Kristof Provost <kp@FreeBSD.org>

netinet: add a probe point for IP, IP6, ICMP, ICMP6, UDP and TCP stats counters

When debugging network issues one common clue is an unexpectedly
incrementing error counter. This is helpful, in that it gives us an
idea of what might be going wrong, but often these counters may be
incremented in different functions.

Add a static probe point for them so that we can use dtrace to get
futher information (e.g. a stack trace).

For example:
dtrace -n 'mib:ip:count: { printf("%d", arg0); stack(); }'

This can be disabled by setting the following kernel option:
options KDTRACE_NO_MIB_SDT

Reviewed by: gallatin, tuexen (previous version), gnn (previous version)
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D43504


# ffeab76b 26-Jan-2024 Kristof Provost <kp@FreeBSD.org>

pfil: PFIL_PASS never frees the mbuf

pfil hooks (i.e. firewalls) may pass, modify or free the mbuf passed
to them. (E.g. when rejecting a packet, or when gathering up packets
for reassembly).

If the hook returns PFIL_PASS the mbuf must still be present. Assert
this in pfil_mem_common() and ensure that ipfilter follows this
convention. pf and ipfw already did.
Similarly, if the hook returns PFIL_DROPPED or PFIL_CONSUMED the mbuf
must have been freed (or now be owned by the firewall for further
processing, like packet scheduling or reassembly).

This allows us to remove a few extraneous NULL checks.

Suggested by: tuexen
Reviewed by: tuexen, zlei
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D43617


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 5ab15157 24-May-2023 Doug Rabson <dfr@FreeBSD.org>

netinet*: Fix redirects for connections from localhost

Redirect rules use PFIL_IN and PFIL_OUT events to allow packet filter
rules to change the destination address and port for a connection.
Typically, the rule triggers on an input event when a packet is received
by a router and the destination address and/or port is changed to
implement the redirect. When a reply packet on this connection is output
to the network, the rule triggers again, reversing the modification.

When the connection is initiated on the same host as the packet filter,
it is initially output via lo0 which queues it for input processing.
This causes an input event on the lo0 interface, allowing redirect
processing to rewrite the destination and create state for the
connection. However, when the reply is received, no corresponding output
event is generated; instead, the packet is delivered to the higher level
protocol (e.g. tcp or udp) without reversing the redirect, the reply is
not matched to the connection and the packet is dropped (for tcp, a
connection reset is also sent).

This commit fixes the problem by adding a second packet filter call in
the input path. The second call happens right before the handoff to
higher level processing and provides the missing output event to allow
the redirect's reply processing to perform its rewrite. This extra
processing is disabled by default and can be enabled using pfilctl:

pfilctl link -o pf:default-out inet-local
pfilctl link -o pf:default-out6 inet6-local

PR: 268717
Reviewed-by: kp, melifaro
MFC-after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D40256


# bb55bb17 06-Mar-2023 Justin Hibbits <jhibbits@FreeBSD.org>

inet6: Include if_private.h in one more netstack file

ip6_input() and ip6_destroy() both directly reference ifnet members.
This file was missed in 3d0d5b21

Fixes: 3d0d5b21 ("IfAPI: Explicitly include <net/if_private.h>...")
Sponsored by: Juniper Networks, Inc.


# fcb3f813 03-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

netinet*: remove PRC_ constants and streamline ICMP processing

In the original design of the network stack from the protocol control
input method pr_ctlinput was used notify the protocols about two very
different kinds of events: internal system events and receival of an
ICMP messages from outside. These events were coded with PRC_ codes.
Today these methods are removed from the protosw(9) and are isolated
to IPv4 and IPv6 stacks and are called only from icmp*_input(). The
PRC_ codes now just create a shim layer between ICMP codes and errors
or actions taken by protocols.

- Change ipproto_ctlinput_t to pass just pointer to ICMP header. This
allows protocols to not deduct it from the internal IP header.
- Change ip6proto_ctlinput_t to pass just struct ip6ctlparam pointer.
It has all the information needed to the protocols. In the structure,
change ip6c_finaldst fields to sockaddr_in6. The reason is that
icmp6_input() already has this address wrapped in sockaddr, and the
protocols want this address as sockaddr.
- For UDP tunneling control input, as well as for IPSEC control input,
change the prototypes to accept a transparent union of either ICMP
header pointer or struct ip6ctlparam pointer.
- In icmp_input() and icmp6_input() do only validation of ICMP header and
count bad packets. The translation of ICMP codes to errors/actions is
done by protocols.
- Provide icmp_errmap() and icmp6_errmap() as substitute to inetctlerrmap,
inet6ctlerrmap arrays.
- In protocol ctlinput methods either trust what icmp_errmap() recommend,
or do our own logic based on the ICMP header.

Differential revision: https://reviews.freebsd.org/D36731


# 53807a8a 03-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

netinet*: use sparse C99 initializer for inetctlerrmap

and mark those PRC_* codes, that are used. The rest are dead code.
This is not a functional change, but illustrative to make easier
review of following changes.


# 46ddeb6b 03-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

netinet6: retire ip6protosw.h

The netinet/ipprotosw.h and netinet6/ip6protosw.h were KAME relics, with
the former removed in f0ffb944d25 in 2001 and the latter survived until
today. It has been reduced down to only one useful declaration that
moves to ip6_var.h

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36726


# 24b96f35 03-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

netinet*: move ipproto_register() and co to ip_var.h and ip6_var.h

This is a FreeBSD KPI and belongs to private header not netinet/in.h.

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36723


# dda6376b 08-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

net: employ newly added pfil_mbuf_{in,out} where approriate

Reviewed by: glebius
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D36454


# 223a73a1 06-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

net: remove stale altq_input reference

Code setting it was removed in:
commit 325fab802e1f40c992141f945d0788c0edfdb1a4
Author: Eric van Gyzen <vangyzen@FreeBSD.org>
Date: Tue Dec 4 23:46:43 2018 +0000

altq: remove ALTQ3_COMPAT code

Reviewed by: glebius, kp
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D36471


# 6080e073 17-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

ip6_input: explicitly include <sys/eventhandler.h>

On most architectures/kernels it was included implicitly, but powerpc
MPC85XX got broken.

Fixes: 81a34d374ed6e5a7b14f24583bc8e3abfdc66306


# 81a34d37 17-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protosw: retire pr_drain and use EVENTHANDLER(9) directly

The method was called for two different conditions: 1) the VM layer is
low on pages or 2) one of UMA zones of mbuf allocator exhausted.
This change 2) into a new event handler, but all affected network
subsystems modified to subscribe to both, so this change shall not
bring functional changes under different low memory situations.

There were three subsystems still using pr_drain: TCP, SCTP and frag6.
The latter had its protosw entry for the only reason to register its
pr_drain method.

Reviewed by: tuexen, melifaro
Differential revision: https://reviews.freebsd.org/D36164


# 78b1fc05 17-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protosw: separate pr_input and pr_ctlinput out of protosw

The protosw KPI historically has implemented two quite orthogonal
things: protocols that implement a certain kind of socket, and
protocols that are IPv4/IPv6 protocol. These two things do not
make one-to-one correspondence. The pr_input and pr_ctlinput methods
were utilized only in IP protocols. This strange duality required
IP protocols that doesn't have a socket to declare protosw, e.g.
carp(4). On the other hand developers of socket protocols thought
that they need to define pr_input/pr_ctlinput always, which lead to
strange dead code, e.g. div_input() or sdp_ctlinput().

With this change pr_input and pr_ctlinput as part of protosw disappear
and IPv4/IPv6 get their private single level protocol switch table
ip_protox[] and ip6_protox[] respectively, pointing at array of
ipproto_input_t functions. The pr_ctlinput that was used for
control input coming from the network (ICMP, ICMPv6) is now represented
by ip_ctlprotox[] and ip6_ctlprotox[].

ipproto_register() becomes the only official way to register in the
table. Those protocols that were always static and unlikely anybody
is interested in making them loadable, are now registered by ip_init(),
ip6_init(). An IP protocol that considers itself unloadable shall
register itself within its own private SYSINIT().

Reviewed by: tuexen, melifaro
Differential revision: https://reviews.freebsd.org/D36157


# 50fa27e7 09-Jul-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

netinet6: fix interface handling for loopback traffic

Currently, processing of IPv6 local traffic is partially broken:
link-local connection fails and global unicast connect() takes
3 seconds to complete.
This happens due to the combination of multiple factors.
IPv6 code passes original interface "origifp" when passing
traffic via loopack to retain the scope that is mandatory for the
correct hadling of link-local traffic. First problem is that the logic
of passing source interface is not working correcly for TCP connections,
resulting in passing "origifp" on the first 2 connection attempts and
lo0 on the subsequent ones. Second problem is that source address
validation logic skips its checks iff the source interface is loopback,
which doesn't cover "origifp" case.
More detailed description is available at https://reviews.freebsd.org/D35732

Fix the first problem by untangling&simplifying ifp/origifp logic.
Fix the second problem by switching source address validation check to
using M_LOOP mbuf flag instead of interface type.

PR: 265089
Reviewed by: ae, bz(previous version)
Differential Revision: https://reviews.freebsd.org/D35732
MFC after: 2 weeks


# 0ed72537 04-Jul-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

netinet6: perform out-of-bounds check for loX multicast statistics

Currently, some per-mbuf multicast statistics is stored in
the per-interface ip6stat.ip6s_m2m[] array of size 32 (IP6S_M2MMAX).
Check that loopback ifindex falls within 0.. IP6S_M2MMAX-1 range to
avoid silent data corruption. The latter cat happen with large
number of VNETs.

Reviewed by: glebius
Differential Revision: https://reviews.freebsd.org/D35715
MFC after: 2 weeks


# 6890b588 17-May-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockbuf: improve sbcreatecontrol()

o Constify memory pointer. Make length unsigned.
o Make it never fail with M_WAITOK and assert that length is sane.


# b46667c6 17-May-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockbuf: merge two versions of sbcreatecontrol() into one

No functional change.


# 89128ff3 03-Jan-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protocols: init with standard SYSINIT(9) or VNET_SYSINIT

The historical BSD network stack loop that rolls over domains and
over protocols has no advantages over more modern SYSINIT(9).
While doing the sweep, split global and per-VNET initializers.

Getting rid of pr_init allows to achieve several things:
o Get rid of ifdef's that protect against double foo_init() when
both INET and INET6 are compiled in.
o Isolate initializers statically to the module they init.
o Makes code easier to understand and maintain.

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D33537


# 1817be48 12-Nov-2021 Gleb Smirnoff <glebius@FreeBSD.org>

Add net.inet6.ip6.source_address_validation

Drop packets arriving from the network that have our source IPv6
address. If maliciously crafted they can create evil effects
like an RST exchange between two of our listening TCP ports.
Such packets just can't be legitimate. Enable the tunable
by default. Long time due for a modern Internet host.

Reviewed by: melifaro, donner, kp
Differential revision: https://reviews.freebsd.org/D32915


# 7045b160 28-Jul-2021 Roy Marples <roy@marples.name>

socket: Implement SO_RERROR

SO_RERROR indicates that receive buffer overflows should be handled as
errors. Historically receive buffer overflows have been ignored and
programs could not tell if they missed messages or messages had been
truncated because of overflows. Since programs historically do not
expect to get receive overflow errors, this behavior is not the
default.

This is really really important for programs that use route(4) to keep
in sync with the system. If we loose a message then we need to reload
the full system state, otherwise the behaviour from that point is
undefined and can lead to chasing bogus bug reports.

Reviewed by: philip (network), kbowling (transport), gbe (manpages)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26652


# b1d63265 08-Mar-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Flush remaining routes from the routing table during VNET shutdown.

Summary:
This fixes rtentry leak for the cloned interfaces created inside the
VNET.

PR: 253998
Reported by: rashey at superbox.pl
MFC after: 3 days

Loopback teardown order is `SI_SUB_INIT_IF`, which happens after `SI_SUB_PROTO_DOMAIN` (route table teardown).
Thus, any route table operations are too late to schedule.
As the intent of the vnet teardown procedures to minimise the amount of effort by doing global cleanups instead of per-interface ones, address this by adding a relatively light-weight routing table cleanup function, `rib_flush_routes()`.
It removes all remaining routes from the routing table and schedules the deletion, which will happen later, when `rtables_destroy()` waits for the current epoch to finish.

Test Plan:
```
set_skip:set_skip_group_lo -> passed [0.053s]
tail -n 200 /var/log/messages | grep rtentry
```

Reviewers: #network, kp, bz

Reviewed By: kp

Subscribers: imp, ae

Differential Revision: https://reviews.freebsd.org/D29116


# 8268d82c 15-Feb-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Remove per-packet ifa refcounting from IPv6 fast path.

Currently ip6_input() calls in6ifa_ifwithaddr() for
every local packet, in order to check if the target ip
belongs to the local ifa in proper state and increase
its counters.

in6ifa_ifwithaddr() references found ifa.
With epoch changes, both `ip6_input()` and all other current callers
of `in6ifa_ifwithaddr()` do not need this reference
anymore, as epoch provides stability guarantee.

Given that, update `in6ifa_ifwithaddr()` to allow
it to return ifa without referencing it, while preserving
option for getting referenced ifa if so desired.

MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28648


# 924d1c9a 08-Feb-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Revert "SO_RERROR indicates that receive buffer overflows should be handled as errors."
Wrong version of the change was pushed inadvertenly.

This reverts commit 4a01b854ca5c2e5124958363b3326708b913af71.


# 4a01b854 07-Feb-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

SO_RERROR indicates that receive buffer overflows should be handled as errors.
Historically receive buffer overflows have been ignored and programs
could not tell if they missed messages or messages had been truncated
because of overflows. Since programs historically do not expect to get
receive overflow errors, this behavior is not the default.

This is really really important for programs that use route(4) to keep in sync
with the system. If we loose a message then we need to reload the full system
state, otherwise the behaviour from that point is undefined and can lead
to chasing bogus bug reports.


# 662c1305 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

net: clean up empty lines in .c and .h files


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# 74ff87cd 06-Dec-2019 Bjoern A. Zeeb <bz@FreeBSD.org>

Update comment.

Update the comment related to SIIT and v4mapped addresses being rejected
by us when coming from the wire given we have supported IPv6-only kernels
for a few years now.
See also draft-itojun-v6ops-v4mapped-harmful.

Suggested by: melifaro
MFC after: 2 weeks


# b745e762 06-Dec-2019 Bjoern A. Zeeb <bz@FreeBSD.org>

ip6_input: remove redundant v4mapped check

In ip6_input() we apply the same v4mapped address check twice. The only
case which skipps the first one is M_FASTFWD_OURS which should have passed
the check on the firstinput pass and passed the firewall.
Remove the 2nd redundant check.

Reviewed by: kp, melifaro
MFC after: 2 weeks
Sponsored by: Netflix (originally)
Differential Revision: https://reviews.freebsd.org/D22462


# a4adf6cc 30-Nov-2019 Bjoern A. Zeeb <bz@FreeBSD.org>

Fix m_pullup() problem after removing PULLDOWN_TESTs and KAME EXT_*macros.

r354748-354750 replaced the KAME macros with m_pulldown() calls.
Contrary to the rest of the network stack m_len checks before m_pulldown()
were not put in placed (see r354748).
Put these m_len checks in place for now (to go along with the style of the
network stack since the initial commits). These are not put in for
performance but to avoid an error scenario (even though it also will help
performance at the moment as it avoid allocating an extra mbuf; not because
of the unconditional function call).

The observed error case went like this:
(1) an mbuf with M_EXT arrives and we call m_pullup() unconditionally on it.
(2) m_pullup() will call m_get() unless the requested length is larger than
MHLEN (in which case it'll m_freem() the perfectly fine mbuf) and migrate the
requested length of data and pkthdr into the new mbuf.
(3) If m_get() succeeds, a further m_pullup() call going over MHLEN will fail.
This was observed with failing auto-configuration as an RA packet of
200 bytes exceeded MHLEN and the m_pullup() called from nd6_ra_input()
dropped the mbuf.
(Re-)adding the m_len checks before m_pullup() calls avoids this problems
with mbufs using external storage for now.

MFC after: 3 weeks
Sponsored by: Netflix


# a61b5cfb 15-Nov-2019 Bjoern A. Zeeb <bz@FreeBSD.org>

netinet6: Remove PULLDOWN_TESTs.

Remove the KAME introduced PULLDOWN_TESTs which did not even
have a compile-time option in sys/conf to turn them on for a
custom kernel build. They made the code a lot harder to read
or more complicated in a few cases.

Convert the IP6_EXTHDR_CHECK() calls into FreeBSD looking code.
Rather than throwing the packet away if it would not fit the
KAME mbuf expectations, convert the macros to m_pullup() calls.
Do not do any extra manual conditional checks upfront as to
whether the m_len would suffice (*), simply let m_pullup() do
its work (incl. an early check).

Remove extra m_pullup() calls where earlier in the function or
the only caller has already done the pullup.

Discussed with: rwatson (*)
Reviewed by: ae
MFC after: 8 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D22334


# a8fe77d8 12-Nov-2019 Bjoern A. Zeeb <bz@FreeBSD.org>

netinet*: update *mp to pass the proper value back

In ip6_[direct_]input() we are looping over the extension headers
to deal with the next header. We pass a pointer to an mbuf pointer
to the handling functions. In certain cases the mbuf can be updated
there and we need to pass the new one back. That missing in
dest6_input() and route6_input(). In tcp6_input() we should also
update it before we call tcp_input().

In addition to that mark the mbuf NULL all the times when we return
that we are done with handling the packet and no next header should
be checked (IPPROTO_DONE). This will eventually allow us to assert
proper behaviour and catch the above kind of errors more easily,
expecting *mp to always be set.

This change is extracted from a larger patch and not an exhaustive
change across the entire stack yet.

PR: 240135
Reported by: prabhakar.lakhera gmail.com
MFC after: 3 weeks
Sponsored by: Netflix


# 503f4e47 07-Nov-2019 Bjoern A. Zeeb <bz@FreeBSD.org>

netinet*: variable cleanup

In preparation for another change factor out various variable cleanups.
These mainly include:
(1) do not assign values to variables during declaration: this makes
the code more readable and does allow for better grouping of
variable declarations,
(2) do not assign values to variables before need; e.g., if a variable
is only used in the 2nd half of a function and we have multiple
return paths before that, then do not set it before it is needed, and
(3) try to avoid assigning the same value multiple times.

MFC after: 3 weeks
Sponsored by: Netflix


# 67a10c46 21-Oct-2019 Bjoern A. Zeeb <bz@FreeBSD.org>

frag6: fix vnet teardown leak

When shutting down a VNET we did not cleanup the fragmentation hashes.
This has multiple problems: (1) leak memory but also (2) leak on the
global counters, which might eventually lead to a problem on a system
starting and stopping a lot of vnets and dealing with a lot of IPv6
fragments that the counters/limits would be exhausted and processing
would no longer take place.

Unfortunately we do not have a useable variable to indicate when
per-VNET initialization of frag6 has happened (or when destroy happened)
so introduce a boolean to flag this. This is needed here as well as
it was in r353635 for ip_reass.c in order to avoid tripping over the
already destroyed locks if interfaces go away after the frag6 destroy.

While splitting things up convert the TRY_LOCK to a LOCK operation in
now frag6_drain_one(). The try-lock was derived from a manual hand-rolled
implementation and carried forward all the time. We no longer can afford
not to get the lock as that would mean we would continue to leak memory.

Assert that all the buckets are empty before destroying to lock to
ensure long-term stability of a clean shutdown.

Reported by: hselasky
Reviewed by: hselasky
MFC after: 3 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D22054


# e7a541b0 19-Sep-2019 Michael Tuexen <tuexen@FreeBSD.org>

When processing an incoming IPv6 packet over the loopback interface which
contains Hop-by-Hop options, the mbuf chain is potentially changed in
ip6_hopopts_input(), called by ip6_input_hbh().
This can happen, because of the the use of IP6_EXTHDR_CHECK, which might
call m_pullup().
So provide the updated pointer back to the called of ip6_input_hbh() to
avoid using a freed mbuf chain in`ip6_input()`.

Reviewed by: markj@
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21664


# 0ecd976e 02-Aug-2019 Bjoern A. Zeeb <bz@FreeBSD.org>

IPv6 cleanup: kernel

Finish what was started a few years ago and harmonize IPv6 and IPv4
kernel names. We are down to very few places now that it is feasible
to do the change for everything remaining with causing too much disturbance.

Remove "aliases" for IPv6 names which confusingly could indicate
that we are talking about a different data structure or field or
have two fields, one for each address family.
Try to follow common conventions used in FreeBSD.

* Rename sin6p to sin6 as that is how it is spelt in most places.
* Remove "aliases" (#defines) for:
- in6pcb which really is an inpcb and nothing separate
- sotoin6pcb which is sotoinpcb (as per above)
- in6p_sp which is inp_sp
- in6p_flowinfo which is inp_flow
* Try to use ia6 for in6_addr rather than in6p.
* With all these gone also rename the in6p variables to inp as
that is what we call it in most of the network stack including
parts of netinet6.

The reasons behind this cleanup are that we try to further
unify netinet and netinet6 code where possible and that people
will less ignore one or the other protocol family when doing
code changes as they may not have spotted places due to different
names for the same thing.

No functional changes.

Discussed with: tuexen (SCTP changes)
MFC after: 3 months
Sponsored by: Netflix


# b252313f 31-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

New pfil(9) KPI together with newborn pfil API and control utility.

The KPI have been reviewed and cleansed of features that were planned
back 20 years ago and never implemented. The pfil(9) internals have
been made opaque to protocols with only returned types and function
declarations exposed. The KPI is made more strict, but at the same time
more extensible, as kernel uses same command structures that userland
ioctl uses.

In nutshell [KA]PI is about declaring filtering points, declaring
filters and linking and unlinking them together.

New [KA]PI makes it possible to reconfigure pfil(9) configuration:
change order of hooks, rehook filter from one filtering point to a
different one, disconnect a hook on output leaving it on input only,
prepend/append a filter to existing list of filters.

Now it possible for a single packet filter to provide multiple rulesets
that may be linked to different points. Think of per-interface ACLs in
Cisco or Juniper. None of existing packet filters yet support that,
however limited usage is already possible, e.g. default ruleset can
be moved to single interface, as soon as interface would pride their
filtering points.

Another future feature is possiblity to create pfil heads, that provide
not an mbuf pointer but just a memory pointer with length. That would
allow filtering at very early stages of a packet lifecycle, e.g. when
packet has just been received by a NIC and no mbuf was yet allocated.

Differential Revision: https://reviews.freebsd.org/D18951


# 62484790 14-Aug-2018 Andrey V. Elsukov <ae@FreeBSD.org>

Restore ability to send ICMP and ICMPv6 redirects.

It was lost when tryforward appeared. Now ip[6]_tryforward will be enabled
only when sending redirects for corresponding IP version is disabled via
sysctl. Otherwise will be used default forwarding function.

PR: 221137
Submitted by: mckay@
MFC after: 2 weeks


# 6040822c 30-Jul-2018 Alan Somers <asomers@FreeBSD.org>

Make timespecadd(3) and friends public

The timespecadd(3) family of macros were imported from NetBSD back in
r35029. However, they were initially guarded by #ifdef _KERNEL. In the
meantime, we have grown at least 28 syscalls that use timespecs in some
way, leading many programs both inside and outside of the base system to
redefine those macros. It's better just to make the definitions public.

Our kernel currently defines two-argument versions of timespecadd and
timespecsub. NetBSD, OpenBSD, and FreeDesktop.org's libbsd, however, define
three-argument versions. Solaris also defines a three-argument version, but
only in its kernel. This revision changes our definition to match the
common three-argument version.

Bump _FreeBSD_version due to the breaking KPI change.

Discussed with: cem, jilles, ian, bde
Differential Revision: https://reviews.freebsd.org/D14725


# 4f6c66cc 23-May-2018 Matt Macy <mmacy@FreeBSD.org>

UDP: further performance improvements on tx

Cumulative throughput while running 64
netperf -H $DUT -t UDP_STREAM -- -m 1
on a 2x8x2 SKL went from 1.1Mpps to 2.5Mpps

Single stream throughput increases from 910kpps to 1.18Mpps

Baseline:
https://people.freebsd.org/~mmacy/2018.05.11/udpsender2.svg

- Protect read access to global ifnet list with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender3.svg

- Protect short lived ifaddr references with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender4.svg

- Convert if_afdata read lock path to epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender5.svg

A fix for the inpcbhash contention is pending sufficient time
on a canary at LLNW.

Reviewed by: gallatin
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15409


# d7c5a620 18-May-2018 Matt Macy <mmacy@FreeBSD.org>

ifnet: Replace if_addr_lock rwlock with epoch + mutex

Run on LLNW canaries and tested by pho@

gallatin:
Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5
based ConnectX 4-LX NIC, I see an almost 12% improvement in received
packet rate, and a larger improvement in bytes delivered all the way
to userspace.

When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1,
I see, using nstat -I mce0 1 before the patch:

InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree
4.98 0.00 4.42 0.00 4235592 33 83.80 4720653 2149771 1235 247.32
4.73 0.00 4.20 0.00 4025260 33 82.99 4724900 2139833 1204 247.32
4.72 0.00 4.20 0.00 4035252 33 82.14 4719162 2132023 1264 247.32
4.71 0.00 4.21 0.00 4073206 33 83.68 4744973 2123317 1347 247.32
4.72 0.00 4.21 0.00 4061118 33 80.82 4713615 2188091 1490 247.32
4.72 0.00 4.21 0.00 4051675 33 85.29 4727399 2109011 1205 247.32
4.73 0.00 4.21 0.00 4039056 33 84.65 4724735 2102603 1053 247.32

After the patch

InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree
5.43 0.00 4.20 0.00 3313143 33 84.96 5434214 1900162 2656 245.51
5.43 0.00 4.20 0.00 3308527 33 85.24 5439695 1809382 2521 245.51
5.42 0.00 4.19 0.00 3316778 33 87.54 5416028 1805835 2256 245.51
5.42 0.00 4.19 0.00 3317673 33 90.44 5426044 1763056 2332 245.51
5.42 0.00 4.19 0.00 3314839 33 88.11 5435732 1792218 2499 245.52
5.44 0.00 4.19 0.00 3293228 33 91.84 5426301 1668597 2121 245.52

Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch

Reviewed by: gallatin
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15366


# effaab88 23-Mar-2018 Kristof Provost <kp@FreeBSD.org>

netpfil: Introduce PFIL_FWD flag

Forwarded packets passed through PFIL_OUT, which made it difficult for
firewalls to figure out if they were forwarding or producing packets. This in
turn is an issue for pf for IPv6 fragment handling: it needs to call
ip6_output() or ip6_forward() to handle the fragments. Figuring out which was
difficult (and until now, incorrect).
Having pfil distinguish the two removes an ugly piece of code from pf.

Introduce a new variant of the netpfil callbacks with a flags variable, which
has PFIL_FWD set for forwarded packets. This allows pf to reliably work out if
a packet is forwarded.

Reviewed by: ae, kevans
Differential Revision: https://reviews.freebsd.org/D13715


# 68e0e5a6 05-Feb-2018 Andrey V. Elsukov <ae@FreeBSD.org>

Modify ip6_get_prevhdr() to be able use it safely.

Instead of returning pointer to the previous header, return its offset.
In frag6_input() use m_copyback() and determined offset to store next
header instead of accessing to it by pointer and assuming that the memory
is contiguous.

In rip6_input() use offset returned by ip6_get_prevhdr() instead of
calculating it from pointers arithmetic, because IP header can belong
to another mbuf in the chain.

Reported by: Maxime Villard <max at m00nbsd dot net>
Reviewed by: kp
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D14158


# 2164def6 29-Jan-2018 Andrey V. Elsukov <ae@FreeBSD.org>

Do not skip scope zone violation check, when mbuf has M_FASTFWD_OURS flag.

When mbuf has M_FASTFWD_OURS flag, this means that a destination address
is our local, but we still need to pass scope zone violation check,
because protocol level expects that IPv6 link-local addresses have
embedded scope zone indexes. This should fix the problem, when ipfw is
used to forward packets to local address and source address of a packet
is IPv6 LLA.

Reported by: sbruno
MFC after: 3 weeks


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# 06193f0b 07-Nov-2017 Konstantin Belousov <kib@FreeBSD.org>

Use hardware timestamps to report packet timestamps for SO_TIMESTAMP
and other similar socket options.

Provide new control message SCM_TIME_INFO to supply information about
timestamp. Currently it indicates that the timestamp was
hardware-assisted and high-precision, for software timestamps the
message is not returned. Reserved fields are added to ABI to report
additional info about it, it is expected that raw hardware clock value
might be useful for some applications.

Reviewed by: gallatin (previous version), hselasky
Sponsored by: Mellanox Technologies
MFC after: 2 weeks
X-Differential revision: https://reviews.freebsd.org/D12638


# fbbd9655 28-Feb-2017 Warner Losh <imp@FreeBSD.org>

Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96


# fcf59617 06-Feb-2017 Andrey V. Elsukov <ae@FreeBSD.org>

Merge projects/ipsec into head/.

Small summary
-------------

o Almost all IPsec releated code was moved into sys/netipsec.
o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel
option IPSEC_SUPPORT added. It enables support for loading
and unloading of ipsec.ko and tcpmd5.ko kernel modules.
o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by
default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type
support was removed. Added TCP/UDP checksum handling for
inbound packets that were decapsulated by transport mode SAs.
setkey(8) modified to show run-time NAT-T configuration of SA.
o New network pseudo interface if_ipsec(4) added. For now it is
build as part of ipsec.ko module (or with IPSEC kernel).
It implements IPsec virtual tunnels to create route-based VPNs.
o The network stack now invokes IPsec functions using special
methods. The only one header file <netipsec/ipsec_support.h>
should be included to declare all the needed things to work
with IPsec.
o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed.
Now these protocols are handled directly via IPsec methods.
o TCP_SIGNATURE support was reworked to be more close to RFC.
o PF_KEY SADB was reworked:
- now all security associations stored in the single SPI namespace,
and all SAs MUST have unique SPI.
- several hash tables added to speed up lookups in SADB.
- SADB now uses rmlock to protect access, and concurrent threads
can do SA lookups in the same time.
- many PF_KEY message handlers were reworked to reflect changes
in SADB.
- SADB_UPDATE message was extended to support new PF_KEY headers:
SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They
can be used by IKE daemon to change SA addresses.
o ipsecrequest and secpolicy structures were cardinally changed to
avoid locking protection for ipsecrequest. Now we support
only limited number (4) of bundled SAs, but they are supported
for both INET and INET6.
o INPCB security policy cache was introduced. Each PCB now caches
used security policies to avoid SP lookup for each packet.
o For inbound security policies added the mode, when the kernel does
check for full history of applied IPsec transforms.
o References counting rules for security policies and security
associations were changed. The proper SA locking added into xform
code.
o xform code was also changed. Now it is possible to unregister xforms.
tdb_xxx structures were changed and renamed to reflect changes in
SADB/SPDB, and changed rules for locking and refcounting.

Reviewed by: gnn, wblock
Obtained from: Yandex LLC
Relnotes: yes
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D9352


# 339efd75 16-Jan-2017 Maxim Sobolev <sobomax@FreeBSD.org>

Add a new socket option SO_TS_CLOCK to pick from several different clock
sources to return timestamps when SO_TIMESTAMP is enabled. Two additional
clock sources are:

o nanosecond resolution realtime clock (equivalent of CLOCK_REALTIME);
o nanosecond resolution monotonic clock (equivalent of CLOCK_MONOTONIC).

In addition to this, this option provides unified interface to get bintime
(equivalent of using SO_BINTIME), except it also supported with IPv6 where
SO_BINTIME has never been supported. The long term plan is to depreciate
SO_BINTIME and move everything to using SO_TS_CLOCK.

Idea for this enhancement has been briefly discussed on the Net session
during dev summit in Ottawa last June and the general input was positive.

This change is believed to benefit network benchmarks/profiling as well
as other scenarios where precise time of arrival measurement is necessary.

There are two regression test cases as part of this commit: one extends unix
domain test code (unix_cmsg) to test new SCM_XXX types and another one
implementis totally new test case which exchanges UDP packets between two
processes using both conventional methods (i.e. calling clock_gettime(2)
before recv(2) and after send(2)), as well as using setsockopt()+recv() in
receive path. The resulting delays are checked for sanity for all supported
clock types.

Reviewed by: adrian, gnn
Differential Revision: https://reviews.freebsd.org/D9171


# ad9f4d6a 19-Dec-2016 Andrey V. Elsukov <ae@FreeBSD.org>

ip[6]_tryforward does inbound and outbound packet firewall processing.
This can lead to change of mbuf pointer (packet filter could do m_pullup(),
NAT, etc). Also in case of change of destination address, tryforward can
decide that packet should be handled by local system. In this case modified
mbuf can be returned to the ip[6]_input(). To handle this correctly, check
M_FASTFWD_OURS flag after return from ip[6]_tryforward. And if it is present,
update variables that depend from mbuf pointer and skip another inbound
firewall processing.

No objection from: #network
MFC after: 3 weeks
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D8764


# 8a030e9c 12-Dec-2016 Andrey V. Elsukov <ae@FreeBSD.org>

Modify IPv6 statistic accounting in ip6_input().

Add rcvif local variable to keep inbound interface pointer. Count
ifs6_in_discard errors in all "goto bad" cases. Now it will count
errors even if mbuf was freed. Modify all places where m->m_pkthdr.rcvif
is used to use local rcvif variable.

Obtained from: Yandex LLC
MFC after: 1 month


# 5a1842a2 12-Dec-2016 Andrey V. Elsukov <ae@FreeBSD.org>

Add ip6_tryforward() - a run to completion forwarding implementation
for IPv6.

It gets performance benefits from reduced number of checks. It doesn't
copy mbuf to be able send ICMPv6 error message, because it keeps mbuf
unchanged until the moment, when the route decision has been made.
It doesn't do IPsec checks, and when some IPsec security policies present,
ip6_input() uses normal slow path.

Reviewed by: bz, gnn
Obtained from: Yandex LLC
MFC after: 1 month
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D8527


# 3e146575 21-Oct-2016 Michael Tuexen <tuexen@FreeBSD.org>

Make ICMPv6 hard error handling for TCP consistent with the ICMPv4
handling. Ensure that:
* Protocol unreachable errors are handled by indicating ECONNREFUSED
to the TCP user for both IPv4 and IPv6. These were ignored for IPv6.
* Communication prohibited errors are handled by indicating ECONNREFUSED
to the TCP user for both IPv4 and IPv6. These were ignored for IPv6.
* Hop Limited exceeded errors are handled by indicating EHOSTUNREACH
to the TCP user for both IPv4 and IPv6.
For IPv6 the TCP connected was dropped but errno wasn't set.

Reviewed by: gallatin, rrs
MFC after: 1 month
Sponsored by: Netflix
Differential Revision: 7904


# 7aeccebc 15-Jul-2016 Andrey V. Elsukov <ae@FreeBSD.org>

Add net.inet6.ip6.intr_queue_maxlen sysctl. It can be used to
change netisr queue limit for IPv6 at runtime.

Obtained from: Yandex LLC
MFC after: 2 weeks
Sponsored by: Yandex LLC


# 89856f7e 21-Jun-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

Get closer to a VIMAGE network stack teardown from top to bottom rather
than removing the network interfaces first. This change is rather larger
and convoluted as the ordering requirements cannot be separated.

Move the pfil(9) framework to SI_SUB_PROTO_PFIL, move Firewalls and
related modules to their own SI_SUB_PROTO_FIREWALL.
Move initialization of "physical" interfaces to SI_SUB_DRIVERS,
move virtual (cloned) interfaces to SI_SUB_PSEUDO.
Move Multicast to SI_SUB_PROTO_MC.

Re-work parts of multicast initialisation and teardown, not taking the
huge amount of memory into account if used as a module yet.

For interface teardown we try to do as many of them as we can on
SI_SUB_INIT_IF, but for some this makes no sense, e.g., when tunnelling
over a higher layer protocol such as IP. In that case the interface
has to go along (or before) the higher layer protocol is shutdown.

Kernel hhooks need to go last on teardown as they may be used at various
higher layers and we cannot remove them before we cleaned up the higher
layers.

For interface teardown there are multiple paths:
(a) a cloned interface is destroyed (inside a VIMAGE or in the base system),
(b) any interface is moved from a virtual network stack to a different
network stack ("vmove"), or (c) a virtual network stack is being shut down.
All code paths go through if_detach_internal() where we, depending on the
vmove flag or the vnet state, make a decision on how much to shut down;
in case we are destroying a VNET the individual protocol layers will
cleanup their own parts thus we cannot do so again for each interface as
we end up with, e.g., double-frees, destroying locks twice or acquiring
already destroyed locks.
When calling into protocol cleanups we equally have to tell them
whether they need to detach upper layer protocols ("ulp") or not
(e.g., in6_ifdetach()).

Provide or enahnce helper functions to do proper cleanup at a protocol
rather than at an interface level.

Approved by: re (hrs)
Obtained from: projects/vnet
Reviewed by: gnn, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D6747


# 4c105402 08-Jun-2016 Andrey V. Elsukov <ae@FreeBSD.org>

Cleanup unneded include "opt_ipfw.h".

It was used for conditional build IPFIREWALL_FORWARD support.
But IPFIREWALL_FORWARD option was removed a long time ago.


# 484149de 03-Jun-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

Introduce a per-VNET flag to enable/disable netisr prcessing on that VNET.
Add accessor functions to toggle the state per VNET.
The base system (vnet0) will always enable itself with the normal
registration. We will share the registered protocol handlers in all
VNETs minimising duplication and management.
Upon disabling netisr processing for a VNET drain the netisr queue from
packets for that VNET.

Update netisr consumers to (de)register on a per-VNET start/teardown using
VNET_SYS(UN)INIT functionality.

The change should be transparent for non-VIMAGE kernels.

Reviewed by: gnn (, hiren)
Obtained from: projects/vnet
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D6691


# 3f58662d 01-Jun-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

The pr_destroy field does not allow us to run the teardown code in a
specific order. VNET_SYSUNINITs however are doing exactly that.
Thus remove the VIMAGE conditional field from the domain(9) protosw
structure and replace it with VNET_SYSUNINITs.
This also allows us to change some order and to make the teardown functions
file local static.
Also convert divert(4) as it uses the same mechanism ip(4) and ip6(4) use
internally.

Slightly reshuffle the SI_SUB_* fields in kernel.h and add a new ones, e.g.,
for pfil consumers (firewalls), partially for this commit and for others
to come.

Reviewed by: gnn, tuexen (sctp), jhb (kernel.h)
Obtained from: projects/vnet
MFC after: 2 weeks
X-MFC: do not remove pr_destroy
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D6652


# 9901091e 22-Mar-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

Mfp4 @180378:

Factor out nd6 and in6_attach initialization to their own files.
Also move destruction into those files though still called from
the central initialization.

Sponsored by: CK Software GmbH
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D5033


# 0ea826e0 22-Jan-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

MFp4 @180892:

With pr_destroy being gone, call ip6_destroy from an ordered
NET_SYSUNINT. Make ip6_destroy() static as well.

Sponsored by: The FreeBSD Foundation


# 1f12da0e 22-Jan-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

Just checkpoint the WIP in order to be able to make the tree update
easier. Note: this is currently not in a usable state as certain
teardown parts are not called and the DOMAIN rework is missing.
More to come soon and find its way to head.

Obtained from: P4 //depot/user/bz/vimage/...
Sponsored by: The FreeBSD Foundation


# ef91a976 25-Nov-2015 Andrey V. Elsukov <ae@FreeBSD.org>

Overhaul if_enc(4) and make it loadable in run-time.

Use hhook(9) framework to achieve ability of loading and unloading
if_enc(4) kernel module. INET and INET6 code on initialization registers
two helper hooks points in the kernel. if_enc(4) module uses these helper
hook points and registers its hooks. IPSEC code uses these hhook points
to call helper hooks implemented in if_enc(4).


# aaa46574 06-Nov-2015 Adrian Chadd <adrian@FreeBSD.org>

[netinet6]: Create a new IPv6 netisr which expects the frames to have been verified.

This is required for fragments and encapsulated data (eg tunneling) to be redistributed
to the RSS bucket based on the eventual IPv6 header and protocol (TCP, UDP, etc) header.

* Add an mbuf tag with the state of IPv6 options parsing before the frame is queued
into the direct dispatch handler;
* Continue processing and complete the frame reception in the correct RSS bucket /
netisr context.

Testing results are in the phabricator review.

Differential Revision: https://reviews.freebsd.org/D3563
Submitted by: Tiwei Bie <btw@mail.ustc.edu.cn>


# 68bb8d62 06-Sep-2015 Adrian Chadd <adrian@FreeBSD.org>

Add support for receiving flowtype, flowid and RSS bucket information as part of recvmsg().

Submitted by: Tiwei Bie <btw@mail.ustc.edu.cn>
Differential Revision: https://reviews.freebsd.org/D3562


# 0be18915 29-Aug-2015 Adrian Chadd <adrian@FreeBSD.org>

Implement RSS hashing/re-hashing for IPv6 ingress packets.

This mirrors the basic IPv4 implementation - IPv6 packets under RSS
now are checked for a correct RSS hash and if one isn't provided,
it's done in software.

This only handles the initial receive - it doesn't yet handle
reinjecting / rehashing packets after being decapsulated from
various tunneling setups. That'll come in some follow-up work.

For non-RSS users, this is almost a giant no-op.

It does change a couple of ipv6 methods to use const mbuf * instead of
mbuf * but it doesn't have any functional changes.

So, the following now occurs:

* If the NIC doesn't do any RSS hashing, it's all done in software.
Single-queue, non-RSS NICs will now have the RX path distributed
into multiple receive netisr queues.

* If the NIC provides the wrong hash (eg only IPv6 hash when we needed
an IPv6 TCP hash, or IPv6 UDP hash when we expected IPv6 hash)
then the hash is recalculated.

* .. if the hash is recalculated, it'll end up being injected into
the correct netisr queue for v6 processing.

Submitted by: Tiwei Bie <btw@mail.ustc.edu.cn>
Differential Revision: https://reviews.freebsd.org/D3504


# cc0a3c8c 29-Jul-2015 Andrey V. Elsukov <ae@FreeBSD.org>

Convert in_ifaddr_lock and in6_ifaddr_lock to rmlock.

Both are used to protect access to IP addresses lists and they can be
acquired for reading several times per packet. To reduce lock contention
it is better to use rmlock here.

Reviewed by: gnn (previous version)
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D3149


# 8f1beb88 04-Mar-2015 Andrey V. Elsukov <ae@FreeBSD.org>

Fix deadlock in IPv6 PCB code.

When several threads are trying to send datagram to the same destination,
but fragmentation is disabled and datagram size exceeds link MTU,
ip6_output() calls pfctlinput2(PRC_MSGSIZE). It does notify all
sockets wanted to know MTU to this destination. And since all threads
hold PCB lock while sending, taking the lock for each PCB in the
in6_pcbnotify() leads to deadlock.

RFC 3542 p.11.3 suggests notify all application wanted to receive
IPV6_PATHMTU ancillary data for each ICMPv6 packet too big message.
But it doesn't require this, when we don't receive ICMPv6 message.

Change ip6_notify_pmtu() function to be able use it directly from
ip6_output() to notify only one socket, and to notify all sockets
when ICMPv6 packet too big message received.

PR: 197059
Differential Revision: https://reviews.freebsd.org/D1949
Reviewed by: no objection from #network
Obtained from: Yandex LLC
MFC after: 1 week
Sponsored by: Yandex LLC


# 3e88eb90 08-Nov-2014 Andrey V. Elsukov <ae@FreeBSD.org>

Remove ip6_getdstifaddr() and all functions to work with auxiliary data.
It isn't safe to keep unreferenced ifaddrs. Use in6ifa_ifwithaddr() to
determine ifaddr corresponding to destination address. Since currently
we keep addresses with embedded scope zone, in6ifa_ifwithaddr is called
with zero zoneid and marked with XXX.

Also remove route and lle lookups from ip6_input. Use in6ifa_ifwithaddr()
instead.

Sponsored by: Yandex LLC


# 8c3cfe0b 04-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Hide 'struct rtentry' and all its macro inside new header:
net/route_internal.h
The goal is to make its opaque for all code except route/rtsock and
proto domain _rmx.


# 8f5a8818 07-Aug-2014 Kevin Lo <kevlo@FreeBSD.org>

Merge 'struct ip6protosw' and 'struct protosw' into one. Now we have
only one protocol switch structure that is shared between ipv4 and ipv6.

Phabric: D476
Reviewed by: jhb


# 36d55f0f 26-Apr-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Unify sa_equal() macro usage.

MFC after: 2 weeks


# 52c57247 17-Apr-2014 Andrey V. Elsukov <ae@FreeBSD.org>

Remove unused variable.

PR: 173521
MFC after: 1 week
Sponsored by: Yandex LLC


# e4c77ca0 13-Feb-2014 Andrey V. Elsukov <ae@FreeBSD.org>

Drop packets to multicast address whose scop field contains the
reserved value 0.

MFC after: 1 week
Sponsored by: Yandex LLC


# 5d6d7e75 07-Feb-2014 Gleb Smirnoff <glebius@FreeBSD.org>

o Revamp API between flowtable and netinet, netinet6.
- ip_output() and ip_output6() simply call flowtable_lookup(),
passing mbuf and address family. That's the only code under
#ifdef FLOWTABLE in the protocols code now.
o Revamp statistics gathering and export.
- Remove hand made pcpu stats, and utilize counter(9).
- Snapshot of statistics is available via 'netstat -rs'.
- All sysctls are moved into net.flowtable namespace, since
spreading them over net.inet isn't correct.
o Properly separate at compile time INET and INET6 parts.
o General cleanup.
- Remove chain of multiple flowtables. We simply have one for
IPv4 and one for IPv6.
- Flowtables are allocated in flowtable.c, symbols are static.
- With proper argument to SYSINIT() we no longer need flowtable_ready.
- Hash salt doesn't need to be per-VNET.
- Removed rudimentary debugging, which use quite useless in dtrace era.

The runtime behavior of flowtable shouldn't be changed by this commit.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 54366c0b 25-Nov-2013 Attilio Rao <attilio@FreeBSD.org>

- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip


# 76039bc8 26-Oct-2013 Gleb Smirnoff <glebius@FreeBSD.org>

The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 7caf4ab7 15-Oct-2013 Gleb Smirnoff <glebius@FreeBSD.org>

- Utilize counter(9) to accumulate statistics on interface addresses. Add
four counters to struct ifaddr. This kills '+=' on a variables shared
between processors for every packet.
- Nuke struct if_data from struct ifaddr.
- In ip_input() do not put a reference on ifaddr, instead update statistics
right now in place and do IN_IFADDR_RUNLOCK(). These removes atomic(9)
for every packet. [1]
- To properly support NET_RT_IFLISTL sysctl used by getifaddrs(3), in
rtsock.c fill if_data fields using counter_u64_fetch().
- Accidentially fix bug in COMPAT_32 version of NET_RT_IFLISTL, which
took if_data not from the ifaddr, but from ifaddr's ifnet. [2]

Submitted by: melifaro [1], pluknet[2]
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# ca695e08 15-Oct-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Remove useless check of ia6 against NULL, right after dereferencing it.


# 4d3dfd45 13-Sep-2013 Mikolaj Golub <trociny@FreeBSD.org>

Unregister inet/inet6 pfil hooks on vnet destroy.

Discussed with: andre
Approved by: re (rodrigc)


# 57f60867 25-Aug-2013 Mark Johnston <markj@FreeBSD.org>

Implement the ip, tcp, and udp DTrace providers. The probe definitions use
dynamic translation so that their arguments match the definitions for
these providers in Solaris and illumos. Thus, existing scripts for these
providers should work unmodified on FreeBSD.

Tested by: gnn, hiren
MFC after: 1 month


# a786f679 09-Jul-2013 Andrey V. Elsukov <ae@FreeBSD.org>

Migrate structs ip6stat, icmp6stat and rip6stat to PCPU counters.


# eca4d720 16-Apr-2013 Andrey V. Elsukov <ae@FreeBSD.org>

Use IP6S_M2MMAX macro.


# 9cb8d207 09-Apr-2013 Andrey V. Elsukov <ae@FreeBSD.org>

Use IP6STAT_INC/IP6STAT_DEC macros to update ip6 stats.

MFC after: 1 week


# 10e5acc3 15-Mar-2013 Gleb Smirnoff <glebius@FreeBSD.org>

- Use m_getcl() instead of hand allocating.
- Do not calculate constant length values at run time,
CTASSERT() their sanity.
- Remove superfluous cleaning of mbuf fields after allocation.
- Replace compat macros with function calls.

Sponsored by: Nginx, Inc.


# aa481811 14-Mar-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Use m_getcl() instead of hand made allocation.

Sponsored by: Nginx, Inc.


# 68eba526 15-Dec-2012 Andrey V. Elsukov <ae@FreeBSD.org>

In additional to the tailq of IPv6 addresses add the hash table.
For now use 256 buckets and fnv_hash function. Use xor'ed 32-bit
s6_addr32 parts of in6_addr structure as a hash key. Update
in6_localip and in6_is_addr_deprecated to use hash table for fastest
lookup.

Sponsored by: Yandex LLC
Discussed with: dwmalone, glebius, bz


# eb1b1807 05-Dec-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually


# 73cb2f38 15-Nov-2012 Andrey V. Elsukov <ae@FreeBSD.org>

Reduce the overhead of locking, use IF_AFDATA_RLOCK() when we are doing
simple lookups.

Sponsored by: Yandex LLC
MFC after: 1 week


# ffdbf9da 01-Nov-2012 Andrey V. Elsukov <ae@FreeBSD.org>

Remove the recently added sysctl variable net.pfil.forward.
Instead, add protocol specific mbuf flags M_IP_NEXTHOP and
M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain
contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup
only when this flag is set.

Suggested by: andre


# c1de64a4 25-Oct-2012 Andrey V. Elsukov <ae@FreeBSD.org>

Remove the IPFIREWALL_FORWARD kernel option and make possible to turn
on the related functionality in the runtime via the sysctl variable
net.pfil.forward. It is turned off by default.

Sponsored by: Yandex LLC
Discussed with: net@
MFC after: 2 weeks


# b36dcb9a 12-Jun-2012 Michael Tuexen <tuexen@FreeBSD.org>

Deliver IPV6_TCLASS, IPV6_HOPLIMIT and IPV6_PKTINFO cmsgs (if
requested) on IPV6 sockets, which have been marked to be not IPV6_V6ONLY,
for each received IPV4 packet.

MFC after: 3 days


# 3df0e439 03-Jun-2012 Maksim Yevmenkin <emax@FreeBSD.org>

Plug reference leak.

Interface routes are refcounted as packets move through the stack,
and there's garbage collection tied to it so that route changes can
safely propagate while traffic is flowing. In our setup, we weren't
changing or deleting any routes, but the refcounting logic in
ip6_input() was wrong and caused a reference leak on every inbound
V6 packet. This eventually caused a 32bit overflow, and the resulting
0 value caused the garbage collection to run on the active route.
That then snowballed into the panic.

Reviewed by: scottl
MFC after: 3 days


# ae145050 24-May-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

MFp4 bz_ipv6_fast:

Factor out Hop-By-Hop option processing. It's still not heavily used,
it reduces the footprint of ip6_input() and makes ip6_input() more
readable.

Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems

Reviewed by: gnn (as part of the whole)
MFC After: 3 days


# d3443481 24-May-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

MFp4 bz_ipv6_fast:

Hide the ip6aux functions. The only one referenced outside ip6_input.c
is not compiled in yet (__notyet__) in route6.c (r235954). We do have
accessor functions that should be used.

Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems

Reviewed by: gnn (as part of the whole)
MFC After: 3 days
X-MFC: KPI?


# dead1956 02-Mar-2012 Hiroki Sato <hrs@FreeBSD.org>

Allow to configure net.inet6.ip6.{accept_rtadv,no_radr} by the loader tunables
as well because they have to be configured before interface initialization for
AF_INET6.


# 81d5d46b 03-Feb-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

Add multi-FIB IPv6 support to the core network stack supplementing
the original IPv4 implementation from r178888:

- Use RT_DEFAULT_FIB in the IPv4 implementation where noticed.
- Use rt*fib() KPI with explicit RT_DEFAULT_FIB where applicable in
the NFS code.
- Use the new in6_rt* KPI in TCP, gif(4), and the IPv6 network stack
where applicable.
- Split in6_rtqtimo() and in6_mtutimo() as done in IPv4 and equally
prevent multiple initializations of callouts in in6_inithead().
- Use wrapper functions where needed to preserve the current KPI to
ease MFCs. Use BURN_BRIDGES to indicate expected future cleanup.
- Fix (related) comments (both technical or style).
- Convert to rtinit() where applicable and only use custom loops where
currently not possible otherwise.
- Multicast group, most neighbor discovery address actions and faith(4)
are locked to the default FIB. Individual IPv6 addresses will only
appear in the default FIB, however redirect information and prefixes
of connected subnets are automatically propagated to all FIBs by
default (mimicking IPv4 behavior as closely as possible).

Sponsored by: Cisco Systems, Inc.


# 137f91e8 05-Jan-2012 John Baldwin <jhb@FreeBSD.org>

Convert all users of IF_ADDR_LOCK to use new locking macros that specify
either a read lock or write lock.

Reviewed by: bz
MFC after: 2 weeks


# 8a006adb 20-Aug-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Add support for IPv6 to ipfw fwd:
Distinguish IPv4 and IPv6 addresses and optional port numbers in
user space to set the option for the correct protocol family.
Add support in the kernel for carrying the new IPv6 destination
address and port.
Add support to TCP and UDP for IPv6 and fix UDP IPv4 to not change
the address in the IP header.
Add support for IPv6 forwarding to a non-local destination.
Add a regession test uitilizing VIMAGE to check all 20 possible
combinations I could think of.

Obtained from: David Dolson at Sandvine Incorporated
(original version for ipfw fwd IPv6 support)
Sponsored by: Sandvine Incorporated
PR: bin/117214
MFC after: 4 weeks
Approved by: re (kib)


# 86905204 08-Jun-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Add the missing call to ip6_ipsec_filtertunnel() to be able to control
whether decapsulated IPsec packets will be passed to pfil again depending
on the setting of the net.ip6.ipsec6.filtertunnel sysctl.

PR: kern/157670
Submitted by: Manuel Kasper (mk neon1.net)
MFC after: 2 weeks


# 6d79f3f6 27-Nov-2010 Rebecca Cran <brucec@FreeBSD.org>

Fix more continuous/contiguous typos (cf. r215955)


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 1b48d245 02-Sep-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

MFp4 CH=183052 183053 183258:

In protosw we define pr_protocol as short, while on the wire
it is an uint8_t. That way we can have "internal" protocols
like DIVERT, SEND or gaps for modules (PROTO_SPACER).
Switch ipproto_{un,}register to accept a short protocol number(*)
and do an upfront check for valid boundries. With this we
also consistently report EPROTONOSUPPORT for out of bounds
protocols, as we did for proto == 0. This allows a caller
to not error for this case, which is especially important
if we want to automatically call these from domain handling.

(*) the functions have been without any in-tree consumer
since the initial introducation, so this is considered save.

Implement ip6proto_{un,}register() similarly to their legacy IP
counter parts to allow modules to hook up dynamically.

Reviewed by: philip, will
MFC after: 1 week


# 101235dc 21-Jul-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

Since r186119 IP6 input counters for octets and packets were not
working anymore. In addition more checks and operations were missing.

In case lla_lookup results in a match, get the ifaddr to update the
statistics counters, and check that the address is neither tentative,
duplicate or otherwise invalid before accepting the packet. If ok,
record the address information in the mbuf. [ as is done in case
lla_lookup does not return a result and we go through the FIB ].

Reported by: remko
Tested by: remko
MFC after: 2 weeks


# 83e711ec 16-May-2010 Kip Macy <kmacy@FreeBSD.org>

allocate ipv6 flows from the ipv6 flow zone

reported by: rrs@

MFC after: 3 days


# 94162961 13-May-2010 Kip Macy <kmacy@FreeBSD.org>

do a proper fix

Pointed out by: np@

MFC after: 3 days


# fc21c49a 13-May-2010 Kip Macy <kmacy@FreeBSD.org>

fix compile error on some builds by doing the equivalent of
an "extern VNET_DEFINE" without "__used"

MFC after: 3 days


# 69381083 10-May-2010 Kip Macy <kmacy@FreeBSD.org>

boot time size the flowtable

MFC after: 3 days


# 77931dd5 09-May-2010 Kip Macy <kmacy@FreeBSD.org>

Add flowtable support to IPv6

Tested by: qingli@

Reviewed by: qingli@
MFC after: 3 days


# 480d7c6c 06-May-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

MFC r207369:
MFP4: @176978-176982, 176984, 176990-176994, 177441

"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by: jhb
Discussed with: rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH


# 82cea7e6 29-Apr-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

MFP4: @176978-176982, 176984, 176990-176994, 177441

"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by: jhb
Discussed with: rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
MFC after: 6 days


# 2ae7ec29 07-Feb-2010 Julian Elischer <julian@FreeBSD.org>

MFC of 197952 and 198075

Virtualize the pfil hooks so that different jails may chose different
packet filters. ALso allows ipfw to be enabled on on ejail and disabled
on another. In 8.0 it's a global setting.
and
Unbreak the VIMAGE build with IPSEC, broken with r197952 by
virtualizing the pfil hooks.
For consistency add the V_ to virtualize the pfil hooks in here as well.


# 3745cc73 08-Jan-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

Replace several instances of 'if (!a & b)' with 'if (!(a &b))' in order
to silence newer GCC versions.


# 0b4b0b0f 10-Oct-2009 Julian Elischer <julian@FreeBSD.org>

Virtualize the pfil hooks so that different jails may chose different
packet filters. ALso allows ipfw to be enabled on on ejail and disabled
on another. In 8.0 it's a global setting.

Sitting aroung in tree waiting to commit for: 2 months
MFC after: 2 months


# a283298c 12-Sep-2009 Hiroki Sato <hrs@FreeBSD.org>

Improve flexibility of receiving Router Advertisement and
automatic link-local address configuration:

- Convert a sysctl net.inet6.ip6.accept_rtadv to one for the
default value of a per-IF flag ND6_IFF_ACCEPT_RTADV, not a
global knob. The default value of the sysctl is 0.

- Add a new per-IF flag ND6_IFF_AUTO_LINKLOCAL and convert a
sysctl net.inet6.ip6.auto_linklocal to one for its default
value. The default value of the sysctl is 1.

- Make ND6_IFF_IFDISABLED more robust. It can be used to disable
IPv6 functionality of an interface now.

- Receiving RA is allowed if ip6_forwarding==0 *and*
ND6_IFF_ACCEPT_RTADV is set on that interface. The former
condition will be revisited later to support a "host + router" box
like IPv6 CPE router. The current behavior is compatible with
the older releases of FreeBSD.

- The ifconfig(8) now supports these ND6 flags as well as "nud",
"prefer_source", and "disabled" in ndp(8). The ndp(8) now
supports "auto_linklocal".

Discussed with: bz and jinmei
Reviewed by: bz
MFC after: 3 days


# 4090e9b2 30-Aug-2009 Qing Li <qingli@FreeBSD.org>

MFC r196569

When multiple interfaces exist in the system, with each interface having
an IPv6 address assigned to it, and if an incoming packet received on
one interface has a packet destination address that belongs to another
interface, the routing table is consulted to determine how to reach this
packet destination. Since the packet destination is an interface address,
the route table will return a host route with the loopback interface as
rt_ifp. The input code must recognize this fact, instead of using the
loopback interface, the input code performs a search to find the right
interface that owns the given IPv6 address.

Reviewed by: bz, gnn, kmacy
Approved by: re


# 7bcee7f3 26-Aug-2009 Qing Li <qingli@FreeBSD.org>

When multiple interfaces exist in the system, with each interface having
an IPv6 address assigned to it, and if an incoming packet received on
one interface has a packet destination address that belongs to another
interface, the routing table is consulted to determine how to reach this
packet destination. Since the packet destination is an interface address,
the route table will return a host route with the loopback interface as
rt_ifp. The input code must recognize this fact, instead of using the
loopback interface, the input code performs a search to find the right
interface that owns the given IPv6 address.

Reviewed by: bz, gnn, kmacy
MFC after: immediately


# 530c0060 01-Aug-2009 Robert Watson <rwatson@FreeBSD.org>

Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by: bz
Approved by: re (vimage blanket)


# 0a4747d4 20-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Garbage collect vnet module registrations that have neither constructors
nor destructors, as there's no actual work to do.

In most cases, the constructors weren't needed because of the existing
protocol initialization functions run by net_init_domain() as part of
VNET_MOD_NET, or they were eliminated when support for static
initialization of virtualized globals was added.

Garbage collect dependency references to modules without constructors or
destructors, notably VNET_MOD_INET and VNET_MOD_INET6.

Reviewed by: bz
Approved by: re (vimage blanket)


# 1e77c105 16-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Remove unused VNET_SET() and related macros; only VNET_GET() is
ever actually used. Rename VNET_GET() to VNET() to shorten
variable references.

Discussed with: bz, julian
Reviewed by: bz
Approved by: re (kensmith, kib)


# eddfbb76 14-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)


# d1da0a06 25-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Add address list locking for in6_ifaddrhead/ia_link: as with locking
for in_ifaddrhead, we stick with an rwlock for the time being, which
we will revisit in the future with a possible move to rmlocks.

Some pieces of code require significant further reworking to be
safe from all classes of writer-writer races.

Reviewed by: bz
MFC after: 6 weeks


# 80af0152 24-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Convert netinet6 to using queue(9) rather than hand-crafted linked lists
for the global IPv6 address list (in6_ifaddr -> in6_ifaddrhead). Adopt
the code styles and conventions present in netinet where possible.

Reviewed by: gnn, bz
MFC after: 6 weeks (possibly not MFCable?)


# 8c0fec80 23-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Modify most routines returning 'struct ifaddr *' to return references
rather than pointers, requiring callers to properly dispose of those
references. The following routines now return references:

ifaddr_byindex
ifa_ifwithaddr
ifa_ifwithbroadaddr
ifa_ifwithdstaddr
ifa_ifwithnet
ifaof_ifpforaddr
ifa_ifwithroute
ifa_ifwithroute_fib
rt_getifa
rt_getifa_fib
IFP_TO_IA
ip_rtaddr
in6_ifawithifp
in6ifa_ifpforlinklocal
in6ifa_ifpwithaddr
in6_ifadd
carp_iamatch6
ip6_getdstifaddr

Remove unused macro which didn't have required referencing:

IFP_TO_IA6

This closes many small races in which changes to interface
or address lists while an ifaddr was in use could lead to use of freed
memory (etc). In a few cases, add missing if_addr_list locking
required to safely acquire references.

Because of a lack of deep copying support, we accept a race in which
an in6_ifaddr pointed to by mbuf tags and extracted with
ip6_getdstifaddr() doesn't hold a reference while in transmit. Once
we have mbuf tag deep copy support, this can be fixed.

Reviewed by: bz
Obtained from: Apple, Inc. (portions)
MFC after: 6 weeks (portions)


# 8d8bc018 08-Jun-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

After r193232 rt_tables in vnet.h are no longer indirectly dependent on
the ROUTETABLES kernel option thus there is no need to include opt_route.h
anymore in all consumers of vnet.h and no longer depend on it for module
builds.

Remove the hidden include in flowtable.h as well and leave the two
explicit #includes in ip_input.c and ip_output.c.


# bc29160d 08-Jun-2009 Marko Zec <zec@FreeBSD.org>

Introduce an infrastructure for dismantling vnet instances.

Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state. The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.

While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers. Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.

Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels. Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.

Bump __FreeBSD_version to 800097.
Reviewed by: bz, julian
Approved by: rwatson, kib (re), julian (mentor)


# d825c793 01-Jun-2009 Marko Zec <zec@FreeBSD.org>

V_loif is not an array but a pure pointer, so treat it as such.

Reviewed by: bz
Approved by: julian (mentor)


# d4b5cae4 01-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Reimplement the netisr framework in order to support parallel netisr
threads:

- Support up to one netisr thread per CPU, each processings its own
workstream, or set of per-protocol queues. Threads may be bound
to specific CPUs, or allowed to migrate, based on a global policy.

In the future it would be desirable to support topology-centric
policies, such as "one netisr per package".

- Allow each protocol to advertise an ordering policy, which can
currently be one of:

NETISR_POLICY_SOURCE: packets must maintain ordering with respect to
an implicit or explicit source (such as an interface or socket).

NETISR_POLICY_FLOW: make use of mbuf flow identifiers to place work,
as well as allowing protocols to provide a flow generation function
for mbufs without flow identifers (m2flow). Falls back on
NETISR_POLICY_SOURCE if now flow ID is available.

NETISR_POLICY_CPU: allow protocols to inspect and assign a CPU for
each packet handled by netisr (m2cpuid).

- Provide utility functions for querying the number of workstreams
being used, as well as a mapping function from workstream to CPU ID,
which protocols may use in work placement decisions.

- Add explicit interfaces to get and set per-protocol queue limits, and
get and clear drop counters, which query data or apply changes across
all workstreams.

- Add a more extensible netisr registration interface, in which
protocols declare 'struct netisr_handler' structures for each
registered NETISR_ type. These include name, handler function,
optional mbuf to flow ID function, optional mbuf to CPU ID function,
queue limit, and ordering policy. Padding is present to allow these
to be expanded in the future. If no queue limit is declared, then
a default is used.

- Queue limits are now per-workstream, and raised from the previous
IFQ_MAXLEN default of 50 to 256.

- All protocols are updated to use the new registration interface, and
with the exception of netnatm, default queue limits. Most protocols
register as NETISR_POLICY_SOURCE, except IPv4 and IPv6, which use
NETISR_POLICY_FLOW, and will therefore take advantage of driver-
generated flow IDs if present.

- Formalize a non-packet based interface between interface polling and
the netisr, rather than having polling pretend to be two protocols.
Provide two explicit hooks in the netisr worker for start and end
events for runs: netisr_poll() and netisr_pollmore(), as well as a
function, netisr_sched_poll(), to allow the polling code to schedule
netisr execution. DEVICE_POLLING still embeds single-netisr
assumptions in its implementation, so for now if it is compiled into
the kernel, a single and un-bound netisr thread is enforced
regardless of tunable configuration.

In the default configuration, the new netisr implementation maintains
the same basic assumptions as the previous implementation: a single,
un-bound worker thread processes all deferred work, and direct dispatch
is enabled by default wherever possible.

Performance measurement shows a marginal performance improvement over
the old implementation due to the use of batched dequeue.

An rmlock is used to synchronize use and registration/unregistration
using the framework; currently, synchronized use is disabled
(replicating current netisr policy) due to a measurable 3%-6% hit in
ping-pong micro-benchmarking. It will be enabled once further rmlock
optimization has taken place. However, in practice, netisrs are
rarely registered or unregistered at runtime.

A new man page for netisr will follow, but since one doesn't currently
exist, it hasn't been updated.

This change is not appropriate for MFC, although the polling shutdown
handler should be merged to 7-STABLE.

Bump __FreeBSD_version.

Reviewed by: bz


# 29dc7bc6 27-May-2009 Bruce M Simpson <bms@FreeBSD.org>

Merge final round of MLD changes from p4:
ip6_input.c, in6.h:
* Add netinet6-specific mbuf flag M_RTALERT_MLD, shadowing M_PROTO6.
* Always set this flag if HBH Router Alert option is present for MLD,
even when not forwarding.

icmp6.c:
* In icmp6_input(), spell m->m_pkthdr.rcvif as ifp to be consistent.
* Use scope ID for verifying input. Do not apply SSM filters here, no inpcb.
* Check for M_RTALERT_MLD when validating MLD traffic, as we can't see
IPv6 hop options outside of ip6_input().

in6_mcast.c:
* Use KAME scope/zone ID in in6_multi.
* Update net.inet6.ip6.mcast.filters implementation to use scope IDs
for comparisons.
* Fix scope ID treatment in multicast socket option processing.
Scope IDs passed in from userland will be ignored as other less
ambiguous APIs exist for specifying the link.
* Tighten userland input checks in IPv6 SSM delta and full-state ops.
* Source filter embedded scope IDs need to be revisited, for now
just clear them and ignore them on input.
* Adapt KAME behaviour of looking up the scope ID in the default zone
for multicast leaves, when the interface is ambiguous.

mld6.c:
* Tighten origin checks on MLD traffic as per RFC3810 Section 6.2:
* ip6_src MAY be the unspecified address for MLDv1 reports.
* ip6_src MAY have link-local address scope for MLDv1 reports,
MLDv1 queries, and MLDv2 queries.
* Perform address field validation *before* accepting queries.
* Use KAME scope/zone ID in query/report processing.
* Break const correctness for mld_v1_input_report(), mld_v1_input_query()
as we temporarily modify the input mbuf chain.
* Clear the scope ID before handoff to userland MLD daemon.
* Fix MLDv1 old querier present timer processing.
With the protocol defaults, hosts should revert to MLDv2 after 260s.
* Add net.inet6.mld.v1enable sysctl, default to on.

ifmcstat.c:
* Use sysctl by default; -K requests kvm(3) if so compiled.

mld.4:
* Connect man page to build.

Tested using PCS.


# f6dfe47a 30-Apr-2009 Marko Zec <zec@FreeBSD.org>

Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance. Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:

1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables. As an example, V_ifnet becomes:

options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet
default build: vnet_net_0._ifnet
options VIMAGE_GLOBALS: ifnet

2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:

INIT_VNET_NET(ifp->if_vnet); becomes

struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];

3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals. If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.

4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet. options VIMAGE builds
will fill in those fields as required.

5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.

6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod. SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.

Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.

Reviewed by: bz, rwatson
Approved by: julian (mentor)


# 33cde130 29-Apr-2009 Bruce M Simpson <bms@FreeBSD.org>

Bite the bullet, and make the IPv6 SSM and MLDv2 mega-commit:
import from p4 bms_netdev. Summary of changes:

* Connect netinet6/in6_mcast.c to build.
The legacy KAME KPIs are mostly preserved.
* Eliminate now dead code from ip6_output.c.
Don't do mbuf bingo, we are not going to do RFC 2292 style
CMSG tricks for multicast options as they are not required
by any current IPv6 normative reference.
* Refactor transports (UDP, raw_ip6) to do own mcast filtering.
SCTP, TCP unaffected by this change.
* Add ip6_msource, in6_msource structs to in6_var.h.
* Hookup mld_ifinfo state to in6_ifextra, allocate from
domifattach path.
* Eliminate IN6_LOOKUP_MULTI(), it is no longer referenced.
Kernel consumers which need this should use in6m_lookup().
* Refactor IPv6 socket group memberships to use a vector (like IPv4).
* Update ifmcstat(8) for IPv6 SSM.
* Add witness lock order for IN6_MULTI_LOCK.
* Move IN6_MULTI_LOCK out of lower ip6_output()/ip6_input() paths.
* Introduce IP6STAT_ADD/SUB/INC/DEC as per rwatson's IPv4 cleanup.
* Update carp(4) for new IPv6 SSM KPIs.
* Virtualize ip6_mrouter socket.
Changes mostly localized to IPv6 MROUTING.
* Don't do a local group lookup in MROUTING.
* Kill unused KAME prototypes in6_purgemkludge(), in6_restoremkludge().
* Preserve KAME DAD timer jitter behaviour in MLDv1 compatibility mode.
* Bump __FreeBSD_version to 800084.
* Update UPDATING.

NOTE WELL:
* This code hasn't been tested against real MLDv2 queriers
(yet), although the on-wire protocol has been verified in Wireshark.
* There are a few unresolved issues in the socket layer APIs to
do with scope ID propagation.
* There is a LOR present in ip6_output()'s use of
in6_setscope() which needs to be resolved. See comments in mld6.c.
This is believed to be benign and can't be avoided for the moment
without re-introducing an indirect netisr.

This work was mostly derived from the IGMPv3 implementation, and
has been sponsored by a third party.


# 3f795dd3 23-Apr-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Compare protosw pointer with NULL.

MFC after: 1 month


# bfe1aba4 10-Apr-2009 Marko Zec <zec@FreeBSD.org>

Introduce vnet module registration / initialization framework with
dependency tracking and ordering enforcement.

With this change, per-vnet initialization functions introduced with
r190787 are no longer directly called from traditional initialization
functions (which cc in most cases inlined to pre-r190787 code), but are
instead registered via the vnet framework first, and are invoked only
after all prerequisite modules have been initialized. In the long run,
this framework should allow us to both initialize and dismantle
multiple vnet instances in a correct order.

The problem this change aims to solve is how to replay the
initialization sequence of various network stack components, which
have been traditionally triggered via different mechanisms (SYSINIT,
protosw). Note that this initialization sequence was and still can be
subtly different depending on whether certain pieces of code have been
statically compiled into the kernel, loaded as modules by boot
loader, or kldloaded at run time.

The approach is simple - we record the initialization sequence
established by the traditional mechanisms whenever vnet_mod_register()
is called for a particular vnet module. The vnet_mod_register_multi()
variant allows a single initializer function to be registered multiple
times but with different arguments - currently this is only used in
kern/uipc_domain.c by net_add_domain() with different struct domain *
as arguments, which allows for protosw-registered initialization
routines to be invoked in a correct order by the new vnet
initialization framework.

For the purpose of identifying vnet modules, each vnet module has to
have a unique ID, which is statically assigned in sys/vimage.h.
Dynamic assignment of vnet module IDs is not supported yet.

A vnet module may specify a single prerequisite module at registration
time by filling in the vmi_dependson field of its vnet_modinfo struct
with the ID of the module it depends on. Unless specified otherwise,
all vnet modules depend on VNET_MOD_NET (container for ifnet list head,
rt_tables etc.), which thus has to and will always be initialized
first. The framework will panic if it detects any unresolved
dependencies before completing system initialization. Detection of
unresolved dependencies for vnet modules registered after boot
(kldloaded modules) is not provided.

Note that the fact that each module can specify only a single
prerequisite may become problematic in the long run. In particular,
INET6 depends on INET being already instantiated, due to TCP / UDP
structures residing in INET container. IPSEC also depends on INET,
which will in turn additionally complicate making INET6-only kernel
configs a reality.

The entire registration framework can be compiled out by turning on the
VIMAGE_GLOBALS kernel config option.

Reviewed by: bz
Approved by: julian (mentor)


# 1ed81b73 06-Apr-2009 Marko Zec <zec@FreeBSD.org>

First pass at separating per-vnet initializer functions
from existing functions for initializing global state.

At this stage, the new per-vnet initializer functions are
directly called from the existing global initialization code,
which should in most cases result in compiler inlining those
new functions, hence yielding a near-zero functional change.

Modify the existing initializer functions which are invoked via
protosw, like ip_init() et. al., to allow them to be invoked
multiple times, i.e. per each vnet. Global state, if any,
is initialized only if such functions are called within the
context of vnet0, which will be determined via the
IS_DEFAULT_VNET(curvnet) check (currently always true).

While here, V_irtualize a few remaining global UMA zones
used by net/netinet/netipsec networking code. While it is
not yet clear to me or anybody else whether this is the right
thing to do, at this stage this makes the code more readable,
and makes it easier to track uncollected UMA-zone-backed
objects on vnet removal. In the long run, it's quite possible
that some form of shared use of UMA zone pools among multiple
vnets should be considered.

Bump __FreeBSD_version due to changes in layout of structs
vnet_ipfw, vnet_inet and vnet_net.

Approved by: julian (mentor)


# 33553d6e 27-Feb-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

For all files including net/vnet.h directly include opt_route.h and
net/route.h.

Remove the hidden include of opt_route.h and net/route.h from net/vnet.h.

We need to make sure that both opt_route.h and net/route.h are included
before net/vnet.h because of the way MRT figures out the number of FIBs
from the kernel option. If we do not, we end up with the default number
of 1 when including net/vnet.h and array sizes are wrong.

This does not change the list of files which depend on opt_route.h
but we can identify them now more easily.


# 09f8c3ff 01-Feb-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Remove the single global unlocked route cache ip6_forward_rt
from the inet6 stack along with statistics and make sure we
properly free the rt in all cases.

While the current situation is not better performance wise it
prevents panics seen more often these days.
After more inet6 and ipsec cleanup we should be able to improve
the situation again passing the rt to ip6_forward directly.

Leave the ip6_forward_rt entry in struct vinet6 but mark it
for removal.

PR: kern/128247, kern/131038
MFC after: 25 days
Committed from: Bugathon #6
Tested by: Denis Ahrens <denis@h3q.com> (different initial version)


# 351c4745 30-Jan-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Remove 4 entirely unsued ip6 variables.
Leave then in struct vinet6 to not break the ABI with kernel modules
but mark them for removal so we can do it in one batch when the time
is right.

MFC after: 1 month


# f5d35259 21-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Correct variable name in comment.

MFC after: 4 weeks


# 099d0bd3 18-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Only unlock the llentry if it is actually valid.

Reported by: ed


# fc384fa5 15-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Another step assimilating IPv[46] PCB code - directly use
the inpcb names rather than the following IPv6 compat macros:
in6pcb,in6p_sp, in6p_ip6_nxt,in6p_flowinfo,in6p_vflag,
in6p_flags,in6p_socket,in6p_lport,in6p_fport,in6p_ppcb and
sotoin6pcb().

Apart from removing duplicate code in netipsec, this is a pure
whitespace, not a functional change.

Discussed with: rwatson
Reviewed by: rwatson (version before review requested changes)
MFC after: 4 weeks (set the timer and see then)


# 6e6b3f7c 14-Dec-2008 Qing Li <qingli@FreeBSD.org>

This main goals of this project are:
1. separating L2 tables (ARP, NDP) from the L3 routing tables
2. removing as much locking dependencies among these layers as
possible to allow for some parallelism in the search operations
3. simplify the logic in the routing code,

The most notable end result is the obsolescent of the route
cloning (RTF_CLONING) concept, which translated into code reduction
in both IPv4 ARP and IPv6 NDP related modules, and size reduction in
struct rtentry{}. The change in design obsoletes the semantics of
RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland
applications such as "arp" and "ndp" have been modified to reflect
those changes. The output from "netstat -r" shows only the routing
entries.

Quite a few developers have contributed to this project in the
past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and
Andre Oppermann. And most recently:

- Kip Macy revised the locking code completely, thus completing
the last piece of the puzzle, Kip has also been conducting
active functional testing
- Sam Leffler has helped me improving/refactoring the code, and
provided valuable reviews
- Julian Elischer setup the perforce tree for me and has helped
me maintaining that branch before the svn conversion


# 1b193af6 13-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Second round of putting global variables, which were virtualized
but formerly missed under VIMAGE_GLOBAL.

Put the extern declarations of the virtualized globals
under VIMAGE_GLOBAL as the globals themsevles are already.
This will help by the time when we are going to remove the globals
entirely.

Sponsored by: The FreeBSD Foundation


# 385195c0 10-Dec-2008 Marko Zec <zec@FreeBSD.org>

Conditionally compile out V_ globals while instantiating the appropriate
container structures, depending on VIMAGE_GLOBALS compile time option.

Make VIMAGE_GLOBALS a new compile-time option, which by default will not
be defined, resulting in instatiations of global variables selected for
V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be
effectively compiled out. Instantiate new global container structures
to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0,
vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0.

Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_
macros resolve either to the original globals, or to fields inside
container structures, i.e. effectively

#ifdef VIMAGE_GLOBALS
#define V_rt_tables rt_tables
#else
#define V_rt_tables vnet_net_0._rt_tables
#endif

Update SYSCTL_V_*() macros to operate either on globals or on fields
inside container structs.

Extend the internal kldsym() lookups with the ability to resolve
selected fields inside the virtualization container structs. This
applies only to the fields which are explicitly registered for kldsym()
visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently
this is done only in sys/net/if.c.

Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code,
and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in
turn result in proper code being generated depending on VIMAGE_GLOBALS.

De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c
which were prematurely V_irtualized by automated V_ prepending scripts
during earlier merging steps. PF virtualization will be done
separately, most probably after next PF import.

Convert a few variable initializations at instantiation to
initialization in init functions, most notably in ipfw. Also convert
TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in
initializer functions.

Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 4b79449e 02-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation


# 44e33a07 19-Nov-2008 Marko Zec <zec@FreeBSD.org>

Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions. As a rule,
initialization at instatiation for such variables should never be
introduced again from now on. Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact. In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 8b615593 02-Oct-2008 Marko Zec <zec@FreeBSD.org>

Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 603724d3 17-Aug-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch


# 48d48eb9 16-Aug-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Fix a regression introduced in r179289 splitting up ip6_savecontrol()
into v4-only vs. v6-only inp_flags processing.
When ip6_savecontrol_v4() is called from ip6_savecontrol() we
were not passing back the **mp thus the information will be missing
in userland.
Istead of going with a *** as suggested in the PR we are returning
**mp now and passing in the v4only flag as a pointer argument.

PR: kern/126349
Reviewed by: rwatson, dwmalone


# 59dd72d0 03-Jul-2008 Robert Watson <rwatson@FreeBSD.org>

Remove NETISR_MPSAFE, which allows specific netisr handlers to be directly
dispatched without Giant, and add NETISR_FORCEQUEUE, which allows specific
netisr handlers to always be dispatched via a queue (deferred). Mark the
usb and if_ppp netisr handlers as NETISR_FORCEQUEUE, and explicitly
acquire Giant in those handlers.

Previously, any netisr handler not marked NETISR_MPSAFE would necessarily
run deferred and with Giant acquired. This change removes Giant
scaffolding from the netisr infrastructure, but NETISR_FORCEQUEUE allows
non-MPSAFE handlers to continue to force deferred dispatch so as to avoid
lock order reversals between their acqusition of Giant and any calling
context.

It is likely we will be able to remove NETISR_FORCEQUEUE once
IFF_NEEDSGIANT is removed, as non-MPSAFE usb and if_ppp drivers will no
longer be supported.

Reviewed by: bz
MFC after: 1 month
X-MFC note: We can't remove NETISR_MPSAFE from stable/7 for KPI reasons,
but the rest can go back.


# aaa37a7e 03-Jul-2008 Robert Watson <rwatson@FreeBSD.org>

Remove GIANT_REQUIRED from IPv6 input, forward, and frag6 code. The frag6
code is believed to be MPSAFE, and leaving aside the IPv6 route cache in
forwarding, Giant appears not to adequately synchronize the data structures
in the input or forwarding paths.


# 0a2fe173 02-Jul-2008 Robert Watson <rwatson@FreeBSD.org>

Set the IPv6 netisr handler as NETISR_MPSAFE on the basis that, despite
there still being some well-known races in mld6 and nd6, running with
Giant over the netisr handler provides little or not additional
synchronization that might cause mld6 and nd6 to behave better.


# 9a38ba81 24-May-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Factor out the v4-only vs. the v6-only inp_flags processing in
ip6_savecontrol in preparation for udp_append() to no longer
need an WLOCK as we will no longer be modifying socket options.

Requested by: rwatson
Reviewed by: gnn
MFC after: 10 days


# 9233d8f3 08-Jan-2008 David E. O'Brien <obrien@FreeBSD.org>

un-__P()


# b48287a3 10-Dec-2007 David E. O'Brien <obrien@FreeBSD.org>

Clean up VCS Ids.


# 2a463222 05-Jul-2007 Xin LI <delphij@FreeBSD.org>

Space cleanup

Approved by: re (rwatson)


# 1272577e 05-Jul-2007 Xin LI <delphij@FreeBSD.org>

ANSIfy[1] plus some style cleanup nearby.

Discussed with: gnn, rwatson
Submitted by: Karl Sj?dahl - dunceor <dunceor gmail com> [1]
Approved by: re (rwatson)


# b2630c29 02-Jul-2007 George V. Neville-Neil <gnn@FreeBSD.org>

Commit the change from FAST_IPSEC to IPSEC. The FAST_IPSEC
option is now deprecated, as well as the KAME IPsec code.
What was FAST_IPSEC is now IPSEC.

Approved by: re
Sponsored by: Secure Computing


# 2cb64cb2 01-Jul-2007 George V. Neville-Neil <gnn@FreeBSD.org>

Commit IPv6 support for FAST_IPSEC to the tree.
This commit includes only the kernel files, the rest of the files
will follow in a second commit.

Reviewed by: bz
Approved by: re
Supported by: Secure Computing


# 7eefde2c 14-May-2007 JINMEI Tatuya <jinmei@FreeBSD.org>

handle IPv6 router alert option contained in an incoming packet per
option value so that unrecognized options are ignored as specified in RFC2711.
(packets containing an MLD router alert option are passed to the upper layer
as before).

Approved by: gnn (mentor), ume (mentor)


# 6be2e366 24-Feb-2007 Bruce M Simpson <bms@FreeBSD.org>

Make IPv6 multicast forwarding dynamically loadable from a GENERIC kernel.
It is built in the same module as IPv4 multicast forwarding, i.e. ip_mroute.ko,
if and only if IPv6 support is enabled for loadable modules.
Export IPv6 forwarding structs to userland netstat(1) via sysctl(9).


# 1d54aa3b 11-Dec-2006 Bjoern A. Zeeb <bz@FreeBSD.org>

MFp4: 92972, 98913 + one more change

In ip6_sprintf no longer use and return one of eight static buffers
for printing/logging ipv6 addresses.
The caller now has to hand in a sufficiently large buffer as first
argument.


# 43bc7a9c 04-Aug-2006 Brooks Davis <brooks@FreeBSD.org>

With exception of the if_name() macro, all definitions in net_osdep.h
were unused or already in if_var.h so add if_name() to if_var.h and
remove net_osdep.h along with all references to it.

Longer term we may want to kill off if_name() entierly since all modern
BSDs have if_xname variables rendering it unnecessicary.


# 656faadc 12-May-2006 Max Laier <mlaier@FreeBSD.org>

Remove ip6fw. Since ipfw has full functional IPv6 support now and - in
contrast to ip6fw - is properly lockes, it is time to retire ip6fw.


# 604afec4 01-Feb-2006 Christian S.J. Peron <csjp@FreeBSD.org>

Somewhat re-factor the read/write locking mechanism associated with the packet
filtering mechanisms to use the new rwlock(9) locking API:

- Drop the variables stored in the phil_head structure which were specific to
conditions and the home rolled read/write locking mechanism.
- Drop some includes which were used for condition variables
- Drop the inline functions, and convert them to macros. Also, move these
macros into pfil.h
- Move pfil list locking macros intp phil.h as well
- Rename ph_busy_count to ph_nhooks. This variable will represent the number
of IN/OUT hooks registered with the pfil head structure
- Define PFIL_HOOKED macro which evaluates to true if there are any
hooks to be ran by pfil_run_hooks
- In the IP/IP6 stacks, change the ph_busy_count comparison to use the new
PFIL_HOOKED macro.
- Drop optimization in pfil_run_hooks which checks to see if there are any
hooks to be ran, and returns if not. This check is already performed by the
IP stacks when they call:

if (!PFIL_HOOKED(ph))
goto skip_hooks;

- Drop in assertion which makes sure that the number of hooks never drops
below 0 for good measure. This in theory should never happen, and if it
does than there are problems somewhere
- Drop special logic around PFIL_WAITOK because rw_wlock(9) does not sleep
- Drop variables which support home rolled read/write locking mechanism from
the IPFW firewall chain structure.
- Swap out the read/write firewall chain lock internal to use the rwlock(9)
API instead of our home rolled version
- Convert the inlined functions to macros

Reviewed by: mlaier, andre, glebius
Thanks to: jhb for the new locking API


# 411babc6 25-Jan-2006 Hajimu UMEMOTO <ume@FreeBSD.org>

don't embed scope id before running packet filters.

Reported by: YAMAMOTO Takashi <yamt__at__mwd.biglobe.ne.jp>
Obtained from: NetBSD
MFC after: 1 week


# 5b27b045 19-Oct-2005 SUZUKI Shinsuke <suz@FreeBSD.org>

supported an ndp command suboption to disable IPv6 in the given interface

Obtained from: KAME
Reviewd by: ume, gnn
MFC after: 2 week


# a1f7e5f8 24-Jul-2005 Hajimu UMEMOTO <ume@FreeBSD.org>

scope cleanup. with this change
- most of the kernel code will not care about the actual encoding of
scope zone IDs and won't touch "s6_addr16[1]" directly.
- similarly, most of the kernel code will not care about link-local
scoped addresses as a special case.
- scope boundary check will be stricter. For example, the current
*BSD code allows a packet with src=::1 and dst=(some global IPv6
address) to be sent outside of the node, if the application do:
s = socket(AF_INET6);
bind(s, "::1");
sendto(s, some_global_IPv6_addr);
This is clearly wrong, since ::1 is only meaningful within a single
node, but the current implementation of the *BSD kernel cannot
reject this attempt.

Submitted by: JINMEI Tatuya <jinmei__at__isl.rdc.toshiba.co.jp>
Obtained from: KAME


# 18b35df8 20-Jul-2005 Hajimu UMEMOTO <ume@FreeBSD.org>

update comments:
- RFC2292bis -> RFC3542
- typo fixes

Submitted by: Keiichi SHIMA <keiichi__at__iijlab.net>
Obtained from: KAME


# 6c011e4d 15-Mar-2005 Sam Leffler <sam@FreeBSD.org>

correct bounds check

Noticed by: Coverity Prevent analysis tool


# caf43b02 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes, separate for KAME


# f45cd79a 19-Oct-2004 Andre Oppermann <andre@FreeBSD.org>

Be more careful to only index valid IP protocols and be more verbose with
comments.


# d6a8d588 28-Sep-2004 Max Laier <mlaier@FreeBSD.org>

Add an additional struct inpcb * argument to pfil(9) in order to enable
passing along socket information. This is required to work around a LOR with
the socket code which results in an easy reproducible hard lockup with
debug.mpsafenet=1. This commit does *not* fix the LOR, but enables us to do
so later. The missing piece is to turn the filter locking into a leaf lock
and will follow in a seperate (later) commit.

This will hopefully be MT5'ed in order to fix the problem for RELENG_5 in
forseeable future.

Suggested by: rwatson
A lot of work by: csjp (he'd be even more helpful w/o mentor-reviews ;)
Reviewed by: rwatson, csjp
Tested by: -pf, -ipfw, LINT, csjp and myself
MFC after: 3 days

LOR IDs: 14 - 17 (not fixed yet)


# c21fd232 27-Aug-2004 Andre Oppermann <andre@FreeBSD.org>

Always compile PFIL_HOOKS into the kernel and remove the associated kernel
compile option. All FreeBSD packet filters now use the PFIL_HOOKS API and
thus it becomes a standard part of the network stack.

If no hooks are connected the entire packet filter hooks section and related
activities are jumped over. This removes any performance impact if no hooks
are active.

Both OpenBSD and DragonFlyBSD have integrated PFIL_HOOKS permanently as well.


# c415679d 22-Aug-2004 Robert Watson <rwatson@FreeBSD.org>

Remove in6_prefix.[ch] and the contained router renumbering capability.
The prefix management code currently resides in nd6, leaving only the
unused router renumbering capability in the in6_prefix files. Removing
it will make it easier for us to provide locking for the remainder of
IPv6 by reducing the number of objects requiring synchronized access.

This functionality has also been removed from NetBSD and OpenBSD.

Submitted by: George Neville-Neil <gnn at neville-neil.com>
Discussed with/approved by: suz, keiichi at kame.net, core at kame.net


# 1f44b0a1 14-Aug-2004 David Malone <dwmalone@FreeBSD.org>

Get rid of the RANDOM_IP_ID option and make it a sysctl. NetBSD
have already done this, so I have styled the patch on their work:

1) introduce a ip_newid() static inline function that checks
the sysctl and then decides if it should return a sequential
or random IP ID.

2) named the sysctl net.inet.ip.random_id

3) IPv6 flow IDs and fragment IDs are now always random.
Flow IDs and frag IDs are significantly less common in the
IPv6 world (ie. rarely generated per-packet), so there should
be smaller performance concerns.

The sysctl defaults to 0 (sequential IP IDs).

Reviewed by: andre, silby, mlaier, ume
Based on: NetBSD
MFC after: 2 months


# 02b199f1 13-Jun-2004 Max Laier <mlaier@FreeBSD.org>

Link ALTQ to the build and break with ABI for struct ifnet. Please recompile
your (network) modules as well as any userland that might make sense of
sizeof(struct ifnet).
This does not change the queueing yet. These changes will follow in a
seperate commit. Same with the driver changes, which need case by case
evaluation.

__FreeBSD_version bump will follow.

Tested-by: (i386)LINT


# 3c751c1b 02-Jun-2004 Hajimu UMEMOTO <ume@FreeBSD.org>

do not check super user privilege in ip6_savecontrol. It is
meaningless and can even be harmful.

Obtained from: KAME
MFC after: 3 days


# f36cfd49 07-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson


# 43eb694a 02-Mar-2004 Max Laier <mlaier@FreeBSD.org>

Move PFIL_HOOKS and ipfw past the scope checks to allow easy redirection to
linklocal.

Obtained from: OpenBSD
Reviewed by: ume
Approved by: bms(mentor)


# 48850f29 02-Mar-2004 Hajimu UMEMOTO <ume@FreeBSD.org>

scope awareness of ff01:: is not merged, yet. So, clear
embeded form of scopeid for ff01:: for now.

Pointed out by: mlaier


# cfcea119 01-Mar-2004 Hajimu UMEMOTO <ume@FreeBSD.org>

- reject incoming packets to an interface-local multicast address from
the wire.
- added a generic scope check, and removed checks for loopback src/dst
addresses.

Obtained from: KAME


# efddf5c6 13-Feb-2004 Hajimu UMEMOTO <ume@FreeBSD.org>

supported IPV6_RECVPATHMTU socket option.

Obtained from: KAME


# 26d02ca7 20-Nov-2003 Andre Oppermann <andre@FreeBSD.org>

Remove RTF_PRCLONING from routing table and adjust users of it
accordingly. The define is left intact for ABI compatibility
with userland.

This is a pre-step for the introduction of tcp_hostcache. The
network stack remains fully useable with this change.

Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)


# 7902224c 08-Nov-2003 Sam Leffler <sam@FreeBSD.org>

o add a flags parameter to netisr_register that is used to specify
whether or not the isr needs to hold Giant when running; Giant-less
operation is also controlled by the setting of debug_mpsafenet
o mark all netisr's except NETISR_IP as needing Giant
o add a GIANT_REQUIRED assertion to the top of netisr's that need Giant
o pickup Giant (when debug_mpsafenet is 1) inside ip_input before
calling up with a packet
o change netisr handling so swi_net runs w/o Giant; instead we grab
Giant before invoking handlers based on whether the handler needs Giant
o change netisr handling so that netisr's that are marked MPSAFE may
have multiple instances active at a time
o add netisr statistics for packets dropped because the isr is inactive

Supported by: FreeBSD Foundation


# 8d996f28 31-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

initialize in6_tmpaddrtimer_ch.

Obtained from: KAME


# 7fc91b3f 30-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

add management part of address selection policy described in
RFC3484.

Obtained from: KAME


# 11de19f4 28-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

ip6_savecontrol() argument is redundant


# 1410779a 28-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

hide m_tag, again.

Requested by: sam


# b2667576 28-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

make sure to accept only IPv6 packet.

Obtained from: KAME


# 2a5aafce 28-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

cleanup use of m_tag.

Obtained from: KAME


# f95d4633 24-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

Switch Advanced Sockets API for IPv6 from RFC2292 to RFC3542
(aka RFC2292bis). Though I believe this commit doesn't break
backward compatibility againt existing binaries, it breaks
backward compatibility of API.
Now, the applications which use Advanced Sockets API such as
telnet, ping6, mld6query and traceroute6 use RFC3542 API.

Obtained from: KAME


# 9a4f9608 21-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

- change scope to zone.
- change node-local to interface-local.
- better error handling of address-to-scope mapping.
- use in6_clearscope().

Obtained from: KAME


# 31b1bfe1 17-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

- add dom_if{attach,detach} framework.
- transition to use ifp->if_afdata.

Obtained from: KAME


# e3124327 16-Oct-2003 Sam Leffler <sam@FreeBSD.org>

fix horribly botched MFp4 merge


# 3c92002f 16-Oct-2003 Sam Leffler <sam@FreeBSD.org>

pfil hooks can modify packet contents so check if the destination
address has been changed when PFIL_HOOKS is enabled and, if it has,
arrange for the proper action by ip*_forward.

Submitted by: Pyun YongHyeon
Supported by: FreeBSD Foundation


# 020a816f 10-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

fixed an endian bug on fragment header scanning

Obtained from: KAME


# 953ad2fb 10-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

nuke SCOPEDROUTING. Though it was there for a long time,
it was never enabled.


# 7efe5d92 08-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

- fix typo in comments.
- style.
- NULL is not 0.
- some variables were renamed.
- nuke unused logic.
(there is no functional change.)

Obtained from: KAME


# 40e39bbb 06-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

return(code) -> return (code)
(reduce diffs against KAME)


# b79274ba 01-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

randomize IPv6 flowlabel when RANDOM_IP_ID is defined.

Obtained from: KAME


# 18193b6f 01-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

use arc4random()


# 134ea224 23-Sep-2003 Sam Leffler <sam@FreeBSD.org>

o update PFIL_HOOKS support to current API used by netbsd
o revamp IPv4+IPv6+bridge usage to match API changes
o remove pfil_head instances from protosw entries (no longer used)
o add locking
o bump FreeBSD version for 3rd party modules

Heavy lifting by: "Max Laier" <max@love2party.net>
Supported by: FreeBSD Foundation
Obtained from: NetBSD (bits of pfil.h and pfil.c)


# 9adc8e4d 11-Mar-2003 Sam Leffler <sam@FreeBSD.org>

correct malloc flag argument

Reported by: Kris Kennaway <kris@obsecurity.org>


# 1cafed39 04-Mar-2003 Jonathan Lemon <jlemon@FreeBSD.org>

Update netisr handling; Each SWI now registers its queue, and all queue
drain routines are done by swi_net, which allows for better queue control
at some future point. Packets may also be directly dispatched to a netisr
instead of queued, this may be of interest at some installations, but
currently defaults to off.

Reviewed by: hsu, silby, jayanth, sam
Sponsored by: DARPA, NAI Labs


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 53d96a08 06-Jan-2003 Sam Leffler <sam@FreeBSD.org>

don't reference a pkthdr after M_MOVE_PKTHDR has "remove it"; instead
reference the pkthdr now in the destination of the move

Sponsored by: Vernier Networks


# 9967cafc 30-Dec-2002 Sam Leffler <sam@FreeBSD.org>

Correct mbuf packet header propagation. Previously, packet headers
were sometimes propagated using M_COPY_PKTHDR which actually did
something between a "move" and a "copy" operation. This is replaced
by M_MOVE_PKTHDR (which copies the pkthdr contents and "removes" it
from the source mbuf) and m_dup_pkthdr which copies the packet
header contents including any m_tag chain. This corrects numerous
problems whereby mbuf tags could be lost during packet manipulations.

These changes also introduce arguments to m_tag_copy and m_tag_copy_chain
to specify if the tag copy work should potentially block. This
introduces an incompatibility with openbsd which we may want to revisit.

Note that move/dup of packet headers does not handle target mbufs
that have a cluster bound to them. We may want to support this;
for now we watch for it with an assert.

Finally, M_COPYFLAGS was updated to include M_FIRSTFRAG|M_LASTFRAG.

Supported by: Vernier Networks
Reviewed by: Robert Watson <rwatson@FreeBSD.org>


# 86fea6be 19-Dec-2002 Bosko Milekic <bmilekic@FreeBSD.org>

o Untangle the confusion with the malloc flags {M_WAITOK, M_NOWAIT} and
the mbuf allocator flags {M_TRYWAIT, M_DONTWAIT}.
o Fix a bpf_compat issue where malloc() was defined to just call
bpf_alloc() and pass the 'canwait' flag(s) along. It's been changed
to call bpf_alloc() but pass the corresponding M_TRYWAIT or M_DONTWAIT
flag (and only one of those two).

Submitted by: Hiten Pandya <hiten@unixdaemons.com> (hiten->commit_count++)


# b9234faf 15-Oct-2002 Sam Leffler <sam@FreeBSD.org>

Tie new "Fast IPsec" code into the build. This involves the usual
configuration stuff as well as conditional code in the IPv4 and IPv6
areas. Everything is conditional on FAST_IPSEC which is mutually
exclusive with IPSEC (KAME IPsec implmentation).

As noted previously, don't use FAST_IPSEC with INET6 at the moment.

Reviewed by: KAME, rwatson
Approved by: silence
Supported by: Vernier Networks


# 5d846453 15-Oct-2002 Sam Leffler <sam@FreeBSD.org>

Replace aux mbufs with packet tags:

o instead of a list of mbufs use a list of m_tag structures a la openbsd
o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit
ABI/module number cookie
o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and
use this in defining openbsd-compatible m_tag_find and m_tag_get routines
o rewrite KAME use of aux mbufs in terms of packet tags
o eliminate the most heavily used aux mbufs by adding an additional struct
inpcb parameter to ip_output and ip6_output to allow the IPsec code to
locate the security policy to apply to outbound packets
o bump __FreeBSD_version so code can be conditionalized
o fixup ipfilter's call to ip_output based on __FreeBSD_version

Reviewed by: julian, luigi (silent), -arch, -net, darren
Approved by: julian, silence from everyone else
Obtained from: openbsd (mostly)
MFC after: 1 month


# 0776834a 31-May-2002 Hajimu UMEMOTO <ume@FreeBSD.org>

__FreeBSD__ is not a compiler constant. We must use
__FreeBSD_version here.

Submitted by: rwatson


# 4cc20ab1 31-May-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Back out my lats commit of locking down a socket, it conflicts with hsu's work.

Requested by: hsu


# 243917fe 19-May-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Lock down a socket, milestone 1.

o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

- so_count
- so_options
- so_linger
- so_state

o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:

- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()

Reviewed by: alfred


# 88ff5695 18-Apr-2002 SUZUKI Shinsuke <suz@FreeBSD.org>

just merged cosmetic changes from KAME to ease sync between KAME and FreeBSD.
(based on freebsd4-snap-20020128)

Reviewed by: ume
MFC after: 1 week


# 6008862b 04-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on: i386, alpha, sparc64


# 44731cab 01-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on: smp@


# 72b1d826 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove duplicate extern declarations to silence warnings.


# bedbd47e 08-Jan-2002 Mike Smith <msmith@FreeBSD.org>

Initialise the intrq_present fields at runtime, not link time. This allows
us to load protocols at runtime, and avoids the use of common variables.

Also fix the ip6_intrq assignment so that it works at all.


# 9494d596 25-Sep-2001 Brooks Davis <brooks@FreeBSD.org>

Make faith loadable, unloadable, and clonable.


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 53dab5fe 02-Jul-2001 Brooks Davis <brooks@FreeBSD.org>

gif(4) and stf(4) modernization:

- Remove gif dependencies from stf.
- Make gif and stf into modules
- Make gif cloneable.

PR: kern/27983
Reviewed by: ru, ume
Obtained from: NetBSD
MFC after: 1 week


# 33841545 10-Jun-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

Sync with recent KAME.
This work was based on kame-20010528-freebsd43-snap.tgz and some
critical problem after the snap was out were fixed.
There are many many changes since last KAME merge.

TODO:
- The definitions of SADB_* in sys/net/pfkeyv2.h are still different
from RFC2407/IANA assignment because of binary compatibility
issue. It should be fixed under 5-CURRENT.
- ip6po_m member of struct ip6_pktopts is no longer used. But, it
is still there because of binary compatibility issue. It should
be removed under 5-CURRENT.

Reviewed by: itojun
Obtained from: KAME
MFC after: 3 weeks


# 8d672524 22-May-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

M_COPY_PKTHDR has to be done before MCLGET.

Obtained from: KAME


# d23d3055 20-Apr-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

Fix typo in previous commit.

Submitted by: JINMEI Tatuya <jinmei@isl.rdc.toshiba.co.jp>


# 8d642984 19-Apr-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

- Fix to receive icmp6 echo reply within the host itself to ff02::1.
- Fix to receive icmp6 echo reply to link-local of itself.

Reported by: Eriya Akasaka <eakasaka@rodfbs.org>
Submitted by: JINMEI Tatuya <jinmei@isl.rdc.toshiba.co.jp>


# 7b35f61a 05-Apr-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

- correct logic of per-address input packet counts for lo0
- reject packets to fe80::xxxx%lo0 (xxxx != 1)

Submitted by: JINMEI Tatuya <jinmei@isl.rdc.toshiba.co.jp>


# 9ec54137 28-Mar-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

Make per-address input packet counts for lo0 work.

Reported by: bmah
Submitted by: Noriyasu KATO <noriyasu.kato@toshiba.co.jp> (via itojun)


# f13bb832 16-Mar-2001 Jun Kuriyama <kuriyama@FreeBSD.org>

Merge from kame (1.175 -> 1.176):
cope with freebsd4 bridge code.


# 0634b4a7 29-Jan-2001 Peter Wemm <peter@FreeBSD.org>

Yikes, these files bogusly #include "loop.h" but didn't use the value.
My searching for NLOOP missed them. :-(


# df5e1987 25-Nov-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Lock down the network interface queues. The queue mutex must be obtained
before adding/removing packets from the queue. Also, the if_obytes and
if_omcasts fields should only be manipulated under protection of the mutex.

IF_ENQUEUE, IF_PREPEND, and IF_DEQUEUE perform all necessary locking on
the queue. An IF_LOCK macro is provided, as well as the old (mutex-less)
versions of the macros in the form _IF_ENQUEUE, _IF_QFULL, for code which
needs them, but their use is discouraged.

Two new macros are introduced: IF_DRAIN() to drain a queue, and IF_HANDOFF,
which takes care of locking/enqueue, and also statistics updating/start
if necessary.


# cf9fa8e7 29-Oct-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Move suser() and suser_xxx() prototypes and a related #define from
<sys/proc.h> to <sys/systm.h>.

Correctly document the #includes needed in the manpage.

Add one now needed #include of <sys/systm.h>.
Remove the consequent 48 unused #includes of <sys/proc.h>.


# 5da9f8fa 19-Oct-2000 Josef Karthauser <joe@FreeBSD.org>

Augment the 'ifaddr' structure with a 'struct if_data' to keep
statistics on a per network address basis.

Teach the IPv4 and IPv6 input/output routines to log packets/bytes
against the network address connected to the flow.

Teach netstat to display the per-address stats for IP protocols
when 'netstat -i' is evoked, instead of displaying the per-interface
stats.


# 20cb9f9e 23-Sep-2000 Hajimu UMEMOTO <ume@FreeBSD.org>

Make ip6fw as loadable module.


# 1469c434 12-Aug-2000 Hajimu UMEMOTO <ume@FreeBSD.org>

Make compilable with -DIPFILTER.
Because I don't use ipfilter at all, this is not tested.
I don't know if ipfilter is work for IPv6.

Submitted by: yoshiaki@kt.rim.or.jp


# c4ac87ea 31-Jul-2000 Darren Reed <darrenr@FreeBSD.org>

activate pfil_hooks and covert ipfilter to use it


# 686cdd19 04-Jul-2000 Jun-ichiro itojun Hagino <itojun@FreeBSD.org>

sync with kame tree as of july00. tons of bug fixes/improvements.

API changes:
- additional IPv6 ioctls
- IPsec PF_KEY API was changed, it is mandatory to upgrade setkey(8).
(also syntax change)


# 3389ae93 19-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Remove ~25 unneeded #include <sys/conf.h>
Remove ~60 unneeded #include <sys/malloc.h>


# 242c5536 12-Feb-2000 Peter Wemm <peter@FreeBSD.org>

Clean up some loose ends in the network code, including the X.25 and ISO
#ifdefs. Clean out unused netisr's and leftover netisr linker set gunk.
Tested on x86 and alpha, including world.

Approved by: jkh


# 210d0432 29-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

Add ip6fw.
Yes it is almost code freeze, but as the result of many thought, now I
think this should be added before 4.0...

make world check, kernel build check is done.

Reviewed by: green
Obtained from: KAME project


# 67e394e0 28-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

Backout diffs which should not be included.


# 577a30ee 27-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

#This is a null commit to give correct description for the previous change.
#Please forget the strange log message of the previous commit .

IPv6 multicast routing.
kernel IPv6 multicast routing support.
pim6 dense mode daemon
pim6 sparse mode daemon
netstat support of IPv6 multicast routing statistics

Merging to the current and testing with other existing multicast routers
is done by Tatsuya Jinmei <jinmei@kame.net>, who writes and maintainances
the base code in KAME distribution.

Make world check and kernel build check was also successful.

Obtained from: KAME project


# 91ec0a1e 27-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

Sorry I didn't commit these files at the commit just a few minutes before.
(IPv6 multicast routing)
I think I mistakenly touched TAB and the last arg sys/netinet6 to
the cvs commit changed to sys/netinet6/in6_proto.c.


# 367d34f8 24-Jan-2000 Brian Somers <brian@FreeBSD.org>

Move the *intrq variables into net/intrq.c and unconditionally
include this in all kernels. Declare some const *intrq_present
variables that can be checked by a module prior to using *intrq
to queue data.

Make the if_tun module capable of processing atm, ip, ip6, ipx,
natm and netatalk packets when TUNSIFHEAD is ioctl()d on.

Review not required by: freebsd-hackers


# 23ecf4b5 12-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

removed an ours case which think a packet destined to loopback interface
with IPv6 loopback addr for its dest or src addr as ours.


# 6a800098 22-Dec-1999 Yoshinobu Inoue <shin@FreeBSD.org>

IPSEC support in the kernel.
pr_input() routines prototype is also changed to support IPSEC and IPV6
chained protocol headers.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# cfa1ca9d 07-Dec-1999 Yoshinobu Inoue <shin@FreeBSD.org>

udp IPv6 support, IPv6/IPv4 tunneling support in kernel,
packet divert at kernel for IPv6/IPv4 translater daemon

This includes queue related patch submitted by jburkhol@home.com.

Submitted by: queue related patch from jburkhol@home.com
Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# e1da8747 22-Nov-1999 Yoshinobu Inoue <shin@FreeBSD.org>

Removed IPSEC and IPV6FIREWALL because they are not ready yet.


# 82cd038d 21-Nov-1999 Yoshinobu Inoue <shin@FreeBSD.org>

KAME netinet6 basic part(no IPsec,no V6 Multicast Forwarding, no UDP/TCP
for IPv6 yet)

With this patch, you can assigne IPv6 addr automatically, and can reply to
IPv6 ping.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project