History log of /freebsd-9.3-release/sys/netinet/ip_output.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 267654 19-Jun-2014 gjb

Copy stable/9 to releng/9.3 as part of the 9.3-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation

# 266192 15-May-2014 tuexen

MFC r265691, r265713:

For some UDP packets (for example with 200 byte payload) and IP options,
the IP header and the UDP header are not in the same mbuf.
Add code to in_delayed_cksum() to deal with this case.

Use KASSERTs as suggested by glebius@


# 253281 12-Jul-2013 trociny

MFC r252710:

In r227207, to fix the issue with possible NULL inp_socket pointer
dereferencing, when checking for SO_REUSEPORT option (and SO_REUSEADDR
for multicast), INP_REUSEPORT flag was introduced to cache the socket
option. It was decided then that one flag would be enough to cache
both SO_REUSEPORT and SO_REUSEADDR: when processing SO_REUSEADDR
setsockopt(2), it was checked if it was called for a multicast address
and INP_REUSEPORT was set accordingly.

Unfortunately that approach does not work when setsockopt(2) is called
before binding to a multicast address: the multicast check fails and
INP_REUSEPORT is not set.

Fix this by adding INP_REUSEADDR flag to unconditionally cache
SO_REUSEADDR.

PR: 179901
Submitted by: Michael Gmelin freebsd grem.de (initial version)
Reviewed by: rwatson
Approved by: re (kib)


# 252828 05-Jul-2013 andre

MFC r243624:

Remove unused and unnecessary CSUM_IP_FRAGS checksumming capability.
Checksumming the IP header of fragments is no different from doing
normal IP headers.

Discussed with: yongari

Note that the definition of CSUM_IP_FRAGS is left in mbuf.h to allow
for stable branch source compatibility.


# 249892 25-Apr-2013 rrs

MFC of r249848

PR: 174749, 157796


# 243586 27-Nov-2012 ae

MFC r242079:
Remove the IPFIREWALL_FORWARD kernel option and make possible to turn
on the related functionality in the runtime via the sysctl variable
net.pfil.forward. It is turned off by default.

MFC r242082:
Note the removal of the IPFIREWALL_FORWARD kernel option.

MFC r242463:
Remove the recently added sysctl variable net.pfil.forward.
Instead, add protocol specific mbuf flags M_IP_NEXTHOP and
M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain
contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup
only when this flag is set.


# 240305 10-Sep-2012 glebius

Merge r238002, r238016, r238092 from head:

r238002 by tuexen@:
Remove dead code (on FreeBSD) as suggested by glebius@.

r238016:
Remove route caching from IP multicast routing code. There is no
reason to do that, and also, cached route never got unreferenced,
which meant a reference leak.

Reviewed by: bms

r238092:
When ip_output()/ip6_output() is supplied a struct route *ro argument,
it skips FLOWTABLE lookup. However, the non-NULL ro has dual meaning
here: it may be supplied to provide route, and it may be supplied to
store and return to caller the route that ip_output()/ip6_output()
finds. In the latter case skipping FLOWTABLE lookup is pessimisation.

The difference between struct route filled by FLOWTABLE and filled
by rtalloc() family is that the former doesn't hold a reference on
its rtentry. Reference is hold by flow entry, and it is about to
be released in future. Thus, route filled by FLOWTABLE shouldn't
be passed to RTFREE() macro.

- Introduce new flag for struct route/route_in6, that marks route
not holding a reference on rtentry.
- Introduce new macro RO_RTFREE() that cleans up a struct route
depending on its kind.
- All callers to ip_output()/ip6_output() that do supply non-NULL
but empty route should use RO_RTFREE() to free results of
lookup.
- ip_output()/ip6_output() now do FLOWTABLE lookup always when
ro->ro_rt == NULL.


# 238713 23-Jul-2012 glebius

Merge from head r238572, r238573:
------------------------------------------------------------------------
r238572 | glebius | 2012-07-18 12:41:00 +0400 (ср, 18 июл 2012) | 3 lines

When traversing global in_ifaddr list in the IFP_TO_IA() macro, we need
to obtain IN_IFADDR_RLOCK().

------------------------------------------------------------------------
r238573 | glebius | 2012-07-18 12:58:30 +0400 (ср, 18 июл 2012) | 5 lines

Plug a reference leak: before doing 'goto again' we need to unref
ia->ia_ifa if there is any.

Submitted by: Andrey Zonov <andrey zonov.org>

Approved by: re (kib)


# 237910 01-Jul-2012 tuexen

MFC r236959:
Add a IP_RECVTOS socket option to receive for received UDP/IPv4
packets a cmsg of type IP_RECVTOS which contains the TOS byte.
Much like IP_RECVTTL does for TTL. This allows to implement a
protocol on top of UDP and implementing ECN.


# 227428 10-Nov-2011 trociny

MFC r227204, 227206, 227207:

r227204:

Fix the typo made in r157474.

r227206:

Before dereferencing intotw() check for NULL, the same way as it is
done for in_pcb (see r157474).

r227207:

Cache SO_REUSEPORT socket option in inpcb-layer in order to avoid
inp_socket->so_options dereference when we may not acquire the lock on
the inpcb.

This fixes the crash due to NULL pointer dereference in
in_pcbbind_setup() when inp_socket->so_options in a pcb returned by
in_pcblookup_local() was checked.

Reported by: dave jones <s.dave.jones@gmail.com>, Arnaud Lacombe <lacombar@gmail.com>
Suggested by: rwatson
Glanced by: rwatson
Tested by: dave jones <s.dave.jones@gmail.com>

Approved by: re (kib)


# 225736 22-Sep-2011 kensmith

Copy head to stable/9 as part of 9.0-RELEASE release cycle.

Approved by: re (implicit)


# 220619 14-Apr-2011 bz

The mbuf_frag_size always was and is file local and not queried from base
user space tools via kvm. Mark it static.

MFC after: 3 days


# 216857 31-Dec-2010 bz

Try to catch a possible divide-by-zero as early as possible if "mtu" is 0
(also test for negative MTUs if checking it anyway).
An MTU of 0 is arguably a bug elsewhere, but this at least gives us some
more debugging hints.

Sponsored by: ISPsystem (Early 2010)
MFC after: 1 week


# 213101 24-Sep-2010 attilio

IP_BINDANY is not correctly handled in getsockopt() case.
Fix it by specifying the correct bits.

Sponsored by: Sandvine Incorporated
Reviewed by: bz, emaste, rstone
Obtained from: Sandvine Incorporated
MFC after: 10 days


# 208553 25-May-2010 qingli

This patch fixes the problem where proxy ARP entries cannot be added
over the if_ng interface.

MFC after: 3 days


# 205104 12-Mar-2010 rrs

The proper fix for the delayed SCTP checksum is to
have the delayed function take an argument as to the offset
to the SCTP header. This allows it to work for V4 and V6.
This of course means changing all callers of the function
to either pass the header len, if they have it, or create
it (ip_hl << 2 or sizeof(ip6_hdr)).
PR: 144529
MFC after: 2 weeks


# 205066 12-Mar-2010 kmacy

- restructure flowtable to support ipv6
- add a name argument to flowtable_alloc for printing with ddb commands
- extend ddb commands to print destination address or 4-tuples
- don't parse ports in ulp header if FL_HASH_ALL is not passed
- add kern_flowtable_insert to enable more generic use of flowtable
(e.g. system calls for adding entries)
- don't hash loopback addresses
- cleanup whitespace
- keep statistics per-cpu for per-cpu flowtables to avoid cache line contention
- add sysctls to accumulate stats and report aggregate

MFC after: 7 days


# 204902 08-Mar-2010 qingli

One of the advantages of enabling ECMP (a.k.a RADIX_MPATH) is to
allow for connection load balancing across interfaces. Currently
the address alias handling method is colliding with the ECMP code.
For example, when two interfaces are configured on the same prefix,
only one prefix route is installed. So connection load balancing
among the available interfaces is not possible.

The other advantage of ECMP is for failover. The issue with the
current code, is that the interface link-state is not reflected
in the route entry. For example, if there are two interfaces on
the same prefix, the cable on one interface is unplugged, new and
existing connections should switch over to the other interface.
This is not done today and packets go into a black hole.

Also, there is a small bug in the kernel where deleting ECMP routes
in the userland will always return an error even though the command
is successfully executed.

MFC after: 5 days


# 201141 28-Dec-2009 bz

Make the compiler happy after r201125:
- + remove two unnecessary initializations in ip_output;
+ + remove one unnecessary initializations in ip_output;


# 201131 28-Dec-2009 luigi

introduce a local variable rte acting as a cache of ro->ro_rt
within ip_output, achieving (in random order of importance):
- a reduction of the number of 'r's in the source code;
- improved legibility;
- a reduction of 64 bytes in the .text


# 201125 28-Dec-2009 luigi

+ remove an unused #define print_ip;
+ remove two unnecessary initializations in ip_output;
+ localize 'len';
+ introduce a temporary variable n to count the number of fragments,
the compiler seems unable to identify a common subexpression
(written 3 times, used twice);
+ document some assumptions on ip_len and ip_hl


# 199102 09-Nov-2009 trasz

Remove ifdefed out part of code, which seems to have originated a decade ago
in OpenBSD. As it is now, there is no way for this to be useful, since IPsec
is free to forward packets via whatever interface it wants, so checking
capabilities of the interface passed from ip_output (fetched from the routing
table) serves no purpose.

Discussed with: sam@


# 197952 11-Oct-2009 julian

Virtualize the pfil hooks so that different jails may chose different
packet filters. ALso allows ipfw to be enabled on on ejail and disabled
on another. In 8.0 it's a global setting.

Sitting aroung in tree waiting to commit for: 2 months
MFC after: 2 months


# 196608 28-Aug-2009 qingli

Do not try to free the rt_lle entry of the cached route in
ip_output() if the cached route was not initialized from the
flow-table. The rt_lle entry is invalid unless it has been
initialized through the flow-table.

Reviewed by: kmacy, rwatson
MFC after: immediately


# 196368 18-Aug-2009 kmacy

- change the interface to flowtable_lookup so that we don't rely on
the mbuf for obtaining the fib index
- check that a cached flow corresponds to the same fib index as the
packet for which we are doing the lookup
- at interface detach time flush any flows referencing stale rtentrys
associated with the interface that is going away (fixes reported
panics)
- reduce the time between cleans in case the cleaner is running at
the time the eventhandler is called and the wakeup is missed less
time will elapse before the eventhandler returns
- separate per-vnet initialization from global initialization
(pointed out by jeli@)

Reviewed by: sam@
Approved by: re@


# 196234 14-Aug-2009 qingli

In function ip_output(), the cached route is flushed when there is a
mismatch between the cached entry and the intended destination. The
cached rtentry{} is flushed but the associated llentry{} is not. This
causes the wrong destination MAC address being used in the output
packets. The fix is to flush the llentry{} when rtentry{} is cleared.

Reviewed by: kmacy, rwatson
Approved by: re


# 196019 01-Aug-2009 rwatson

Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by: bz
Approved by: re (vimage blanket)


# 195699 14-Jul-2009 rwatson

Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)


# 194760 23-Jun-2009 rwatson

Modify most routines returning 'struct ifaddr *' to return references
rather than pointers, requiring callers to properly dispose of those
references. The following routines now return references:

ifaddr_byindex
ifa_ifwithaddr
ifa_ifwithbroadaddr
ifa_ifwithdstaddr
ifa_ifwithnet
ifaof_ifpforaddr
ifa_ifwithroute
ifa_ifwithroute_fib
rt_getifa
rt_getifa_fib
IFP_TO_IA
ip_rtaddr
in6_ifawithifp
in6ifa_ifpforlinklocal
in6ifa_ifpwithaddr
in6_ifadd
carp_iamatch6
ip6_getdstifaddr

Remove unused macro which didn't have required referencing:

IFP_TO_IA6

This closes many small races in which changes to interface
or address lists while an ifaddr was in use could lead to use of freed
memory (etc). In a few cases, add missing if_addr_list locking
required to safely acquire references.

Because of a lack of deep copying support, we accept a race in which
an in6_ifaddr pointed to by mbuf tags and extracted with
ip6_getdstifaddr() doesn't hold a reference while in transmit. Once
we have mbuf tag deep copy support, this can be fixed.

Reviewed by: bz
Obtained from: Apple, Inc. (portions)
MFC after: 6 weeks (portions)


# 194660 22-Jun-2009 zec

V_irtualize flowtable state.

This change should make options VIMAGE kernel builds usable again,
to some extent at least.

Note that the size of struct vnet_inet has changed, though in
accordance with one-bump-per-day policy we didn't update the
__FreeBSD_version number, given that it has already been touched
by r194640 a few hours ago.
Reviewed by: bz
Approved by: julian (mentor)


# 194076 12-Jun-2009 bz

Move the kernel option FLOWTABLE chacking from the header file to the
actual implementation.
Remove the accessor functions for the compiled out case, just returning
"unavail" values. Remove the kernel conditional from the header file as
it is no longer needed, only leaving the externs.
Hide the improperly virtualized SYSCTL/TUNABLE for the flowtable size
under the kernel option as well.

Reviewed by: rwatson


# 193550 05-Jun-2009 pjd

Only four out of nine arguments for ip_ipsec_output() are actually used.
Kill unused arguments except for 'ifp' as it might be used in the future
for detecting IPsec-capable interfaces.


# 193511 05-Jun-2009 rwatson

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# 193217 01-Jun-2009 pjd

- Rename IP_NONLOCALOK IP socket option to IP_BINDANY, to be more consistent
with OpenBSD (and BSD/OS originally). We can't easly do it SOL_SOCKET option
as there is no more space for more SOL_SOCKET options, but this option also
fits better as an IP socket option, it seems.
- Implement this functionality also for IPv6 and RAW IP sockets.
- Always compile it in (don't use additional kernel options).
- Remove sysctl to turn this functionality on and off.
- Introduce new privilege - PRIV_NETINET_BINDANY, which allows to use this
functionality (currently only unjail root can use it).

Discussed with: julian, adrian, jhb, rwatson, kmacy


# 192528 21-May-2009 rwatson

Consolidate and clean up the first section of ip_output.c in light of the
last year or two's work on routing:

- Combine iproute initialization and flowtable lookup blocks, eliminating
unnecessary tests for known-zero'd iproute fields.

- Add a comment indicating (a) why the route entry returned by the
flowtable is considered stable and (b) that the flowtable lookup must
occur after the setup of the mbuf flow ID.

- Assert the inpcb lock before any use of inpcb fields.

Reviewed by: kmacy


# 191621 28-Apr-2009 trasz

Don't require packet to match a route (any route; this information wasn't
used anyway, so a typical workaround was to add a dummy route) if it's going
to be sent through IPSec tunnel.

Reviewed by: bz


# 191259 19-Apr-2009 kmacy

- Allocate a small flowtable in ip_input.c (changeable by tuneable)
- Use for accelerating ip_output


# 191148 16-Apr-2009 kmacy

Change if_output to take a struct route as its fourth argument in order
to allow passing a cached struct llentry * down to L2

Reviewed by: rwatson


# 190951 11-Apr-2009 rwatson

Update stats in struct ipstat using four new macros, IPSTAT_ADD(),
IPSTAT_INC(), IPSTAT_SUB(), and IPSTAT_DEC(), rather than directly
manipulating the fields across the kernel. This will make it easier
to change the implementation of these statistics, such as using
per-CPU versions of the data structures.

MFC after: 3 days


# 190880 10-Apr-2009 kmacy

Import "flowid" support for serializing flows across transmit queues

Reviewed by: rwatson and jeli


# 189359 04-Mar-2009 bms

In ip_output(), do not acquire the IN_MULTI_LOCK(),
and do not attempt to perform a group lookup.
This is a socket layer lock, and the bottom half of IP
really has no business taking it.

Use the value of the in_mcast_loop sysctl to determine
if we should loop back by default, in the absence of
any multicast socket options. Because the check on
group membership is now deferred to the input path,
an m_copym() is now required.

This should increase multicast send performance where the
source has not requested loopback, although this has not been
benchmarked or measured.

It is also a necessary change for IN_MULTI_LOCK to become
non-recursive, which is required in order to implement IGMPv3
in a thread-safe way.


# 189106 27-Feb-2009 bz

For all files including net/vnet.h directly include opt_route.h and
net/route.h.

Remove the hidden include of opt_route.h and net/route.h from net/vnet.h.

We need to make sure that both opt_route.h and net/route.h are included
before net/vnet.h because of the way MRT figures out the number of FIBs
from the kernel option. If we do not, we end up with the default number
of 1 when including net/vnet.h and array sizes are wrong.

This does not change the list of files which depend on opt_route.h
but we can identify them now more easily.


# 188306 08-Feb-2009 bz

Try to remove/assimilate as much of formerly IPv4/6 specific
(duplicate) code in sys/netipsec/ipsec.c and fold it into
common, INET/6 independent functions.

The file local functions ipsec4_setspidx_inpcb() and
ipsec6_setspidx_inpcb() were 1:1 identical after the change
in r186528. Rename to ipsec_setspidx_inpcb() and remove the
duplicate.

Public functions ipsec[46]_get_policy() were 1:1 identical.
Remove one copy and merge in the factored out code from
ipsec_get_policy() into the other. The public function left
is now called ipsec_get_policy() and callers were adapted.

Public functions ipsec[46]_set_policy() were 1:1 identical.
Rename file local ipsec_set_policy() function to
ipsec_set_policy_internal().
Remove one copy of the public functions, rename the other
to ipsec_set_policy() and adapt callers.

Public functions ipsec[46]_hdrsiz() were logically identical
(ignoring one questionable assert in the v6 version).
Rename the file local ipsec_hdrsiz() to ipsec_hdrsiz_internal(),
the public function to ipsec_hdrsiz(), remove the duplicate
copy and adapt the callers.
The v6 version had been unused anyway. Cleanup comments.

Public functions ipsec[46]_in_reject() were logically identical
apart from statistics. Move the common code into a file local
ipsec46_in_reject() leaving vimage+statistics in small AF specific
wrapper functions. Note: unfortunately we already have a public
ipsec_in_reject().

Reviewed by: sam
Discussed with: rwatson (renaming to *_internal)
MFC after: 26 days
X-MFC: keep wrapper functions for public symbols?


# 188066 03-Feb-2009 rrs

Adds support for SCTP checksum offload. This means
we, like TCP and UDP, move the checksum calculation
into the IP routines when there is no hardware support
we call into the normal SCTP checksum routine.

The next round of SCTP updates will use
this functionality. Of course the IGB driver needs
a few updates to support the new intel controller set
that actually does SCTP csum offload too.

Reviewed by: gnn, rwatson, kmacy


# 186961 09-Jan-2009 adrian

Fix indentation; add FALLTHROUGH.

Thanks Max!


# 186955 09-Jan-2009 adrian

Implement a new IP option (not compiled/enabled by default) to allow
applications to specify a non-local IP address when bind()'ing a socket
to a local endpoint.

This allows applications to spoof the client IP address of connections
if (obviously!) they somehow are able to receive the traffic normally
destined to said clients.

This patch doesn't include any changes to ipfw or the bridging code to
redirect the client traffic through the PCB checks so TCP gets a shot
at it. The normal behaviour is that packets with a non-local destination
IP address are not handled locally. This can be dealth with some IPFW hackery;
modifications to IPFW to make this less hacky will occur in subsequent
commmits.

Thanks to Julian Elischer and others at Ironport. This work was approved
and donated before Cisco acquired them.

Obtained from: Julian Elischer and others
MFC after: 2 weeks


# 186717 03-Jan-2009 rwatson

Allow the IP_MINTTL socket option to be set to 0 so that it can be
disabled entirely, which is its default state before set to a
non-zero value.

PR: 128790
Submitted by: Nick Hilliard <nick at foobar dot org>
MFC after: 3 weeks


# 186119 15-Dec-2008 qingli

This main goals of this project are:
1. separating L2 tables (ARP, NDP) from the L3 routing tables
2. removing as much locking dependencies among these layers as
possible to allow for some parallelism in the search operations
3. simplify the logic in the routing code,

The most notable end result is the obsolescent of the route
cloning (RTF_CLONING) concept, which translated into code reduction
in both IPv4 ARP and IPv6 NDP related modules, and size reduction in
struct rtentry{}. The change in design obsoletes the semantics of
RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland
applications such as "arp" and "ndp" have been modified to reflect
those changes. The output from "netstat -r" shows only the routing
entries.

Quite a few developers have contributed to this project in the
past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and
Andre Oppermann. And most recently:

- Kip Macy revised the locking code completely, thus completing
the last piece of the puzzle, Kip has also been conducting
active functional testing
- Sam Leffler has helped me improving/refactoring the code, and
provided valuable reviews
- Julian Elischer setup the perforce tree for me and has helped
me maintaining that branch before the svn conversion


# 185895 10-Dec-2008 zec

Conditionally compile out V_ globals while instantiating the appropriate
container structures, depending on VIMAGE_GLOBALS compile time option.

Make VIMAGE_GLOBALS a new compile-time option, which by default will not
be defined, resulting in instatiations of global variables selected for
V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be
effectively compiled out. Instantiate new global container structures
to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0,
vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0.

Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_
macros resolve either to the original globals, or to fields inside
container structures, i.e. effectively

#ifdef VIMAGE_GLOBALS
#define V_rt_tables rt_tables
#else
#define V_rt_tables vnet_net_0._rt_tables
#endif

Update SYSCTL_V_*() macros to operate either on globals or on fields
inside container structs.

Extend the internal kldsym() lookups with the ability to resolve
selected fields inside the virtualization container structs. This
applies only to the fields which are explicitly registered for kldsym()
visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently
this is done only in sys/net/if.c.

Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code,
and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in
turn result in proper code being generated depending on VIMAGE_GLOBALS.

De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c
which were prematurely V_irtualized by automated V_ prepending scripts
during earlier merging steps. PF virtualization will be done
separately, most probably after next PF import.

Convert a few variable initializations at instantiation to
initialization in init functions, most notably in ipfw. Also convert
TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in
initializer functions.

Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 185571 02-Dec-2008 bz

Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation


# 185348 26-Nov-2008 zec

Merge more of currently non-functional (i.e. resolving to
whitespace) macros from p4/vimage branch.

Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.

De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.

Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 185101 19-Nov-2008 julian

Fix a scope problem in the multiple routing table code that stopped the
SO_SETFIB socket option from working correctly.

Obtained from: Ironport
MFC after: 3 days


# 185088 19-Nov-2008 zec

Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions. As a rule,
initialization at instatiation for such variables should never be
introduced again from now on. Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact. In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 183550 02-Oct-2008 zec

Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 182463 29-Aug-2008 gnn

Fix a bug whereby multicast packets that are looped back locally
wind up with the incorrect checksum on the wire when transmitted via
devices that do checksum offloading.

PR: kern/119635
Reviewed by: rwatson
MFC after: 5 days


# 181966 21-Aug-2008 rwatson

Remove comments and #ifdef notyet'd code relating to directly dispatching
the IP multicast input code from the output path; we don't allow
reentrance of the input path from the IP output path, it must use the
netisr due to potential lock recursion.

MFC after: 3 days


# 181803 17-Aug-2008 bz

Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch


# 178888 09-May-2008 julian

Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.

Constraints:
------------

I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.

One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".

One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.

This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.

To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.

The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.

The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.

In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.

One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).

You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.

This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.

Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.

Packets fall into one of a number of classes.

1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..

setfib -3 ping target.example.com # will use fib 3 for ping.

It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.

2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)

3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).

4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.

5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.

6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.

Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)

In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.

In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.

Early testing experience:
-------------------------

Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.

For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.

Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.

ipfw has grown 2 new keywords:

setfib N ip from anay to any
count ip from any to any fib N

In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.

SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.

Where to next:
--------------------

After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.

Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.

My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.

When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.

Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.

This work was sponsored by Ironport Systems/Cisco

Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco


# 178319 19-Apr-2008 rwatson

In ip_output(), allow a read lock as well as a write lock when asserting
a lock on the passed inpcb.

MFC after: 3 months


# 178285 17-Apr-2008 rwatson

Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.

This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.

MFC after: 3 months
Tested by: kris (superset of committered patch)


# 178167 13-Apr-2008 qingli

This patch provides the back end support for equal-cost multi-path
(ECMP) for both IPv4 and IPv6. Previously, multipath route insertion
is disallowed. For example,

route add -net 192.103.54.0/24 10.9.44.1
route add -net 192.103.54.0/24 10.9.44.2

The second route insertion will trigger an error message of
"add net 192.103.54.0/24: gateway 10.2.5.2: route already in table"

Multiple default routes can also be inserted. Here is the netstat
output:

default 10.2.5.1 UGS 0 3074 bge0 =>
default 10.2.5.2 UGS 0 0 bge0

When multipath routes exist, the "route delete" command requires
a specific gateway to be specified or else an error message would
be displayed. For example,

route delete default

would fail and trigger the following error message:

"route: writing to routing socket: No such process"
"delete net default: not in table"

On the other hand,

route delete default 10.2.5.2

would be successful: "delete net default: gateway 10.2.5.2"

One does not have to specify a gateway if there is only a single
route for a particular destination.

I need to perform more testings on address aliases and multiple
interfaces that have the same IP prefixes. This patch as it
stands today is not yet ready for prime time. Therefore, the ECMP
code fragments are fully guarded by the RADIX_MPATH macro.
Include the "options RADIX_MPATH" in the kernel configuration
to enable this feature.

Reviewed by: robert, sam, gnn, julian, kmacy


# 177599 25-Mar-2008 ru

Replaced the misleading uses of a historical artefact M_TRYWAIT with M_WAIT.
Removed dead code that assumed that M_TRYWAIT can return NULL; it's not true
since the advent of MBUMA.

Reviewed by: arch

There are ongoing disputes as to whether we want to switch to directly using
UMA flags M_WAITOK/M_NOWAIT for mbuf(9) allocation.


# 175892 02-Feb-2008 bz

Rather than passing around a cached 'priv', pass in an ucred to
ipsec*_set_policy and do the privilege check only if needed.

Try to assimilate both ip*_ctloutput code blocks calling ipsec*_set_policy.

Reviewed by: rwatson


# 172930 24-Oct-2007 rwatson

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# 172467 07-Oct-2007 silby

Add FBSDID to all files in netinet so that people can more
easily include file version information in bug reports.

Approved by: re (kensmith)


# 171167 03-Jul-2007 gnn

Commit the change from FAST_IPSEC to IPSEC. The FAST_IPSEC
option is now deprecated, as well as the KAME IPsec code.
What was FAST_IPSEC is now IPSEC.

Approved by: re
Sponsored by: Secure Computing


# 171133 01-Jul-2007 gnn

Commit IPv6 support for FAST_IPSEC to the tree.
This commit includes only the kernel files, the rest of the files
will follow in a second commit.

Reviewed by: bz
Approved by: re
Supported by: Secure Computing


# 170613 12-Jun-2007 bms

Import rewrite of IPv4 socket multicast layer to support source-specific
and protocol-independent host mode multicast. The code is written to
accomodate IPv6, IGMPv3 and MLDv2 with only a little additional work.

This change only pertains to FreeBSD's use as a multicast end-station and
does not concern multicast routing; for an IGMPv3/MLDv2 router
implementation, consider the XORP project.

The work is based on Wilbert de Graaf's IGMPv3 code drop for FreeBSD 4.6,
which is available at: http://www.kloosterhof.com/wilbert/igmpv3.html

Summary
* IPv4 multicast socket processing is now moved out of ip_output.c
into a new module, in_mcast.c.
* The in_mcast.c module implements the IPv4 legacy any-source API in
terms of the protocol-independent source-specific API.
* Source filters are lazy allocated as the common case does not use them.
They are part of per inpcb state and are covered by the inpcb lock.
* struct ip_mreqn is now supported to allow applications to specify
multicast joins by interface index in the legacy IPv4 any-source API.
* In UDP, an incoming multicast datagram only requires that the source
port matches the 4-tuple if the socket was already bound by source port.
An unbound socket SHOULD be able to receive multicasts sent from an
ephemeral source port.
* The UDP socket multicast filter mode defaults to exclusive, that is,
sources present in the per-socket list will be blocked from delivery.
* The RFC 3678 userland functions have been added to libc: setsourcefilter,
getsourcefilter, setipv4sourcefilter, getipv4sourcefilter.
* Definitions for IGMPv3 are merged but not yet used.
* struct sockaddr_storage is now referenced from <netinet/in.h>. It
is therefore defined there if not already declared in the same way
as for the C99 types.
* The RFC 1724 hack (specify 0.0.0.0/8 addresses to IP_MULTICAST_IF
which are then interpreted as interface indexes) is now deprecated.
* A patch for the Rhyolite.com routed in the FreeBSD base system
is available in the -net archives. This only affects individuals
running RIPv1 or RIPv2 via point-to-point and/or unnumbered interfaces.
* Make IPv6 detach path similar to IPv4's in code flow; functionally same.
* Bump __FreeBSD_version to 700048; see UPDATING.

This work was financially supported by another FreeBSD committer.

Obtained from: p4://bms_netdev
Submitted by: Wilbert de Graaf (original work)
Reviewed by: rwatson (locking), silence from fenner,
net@ (but with encouragement)


# 169454 10-May-2007 rwatson

Move universally to ANSI C function declarations, with relatively
consistent style(9)-ish layout.


# 167831 23-Mar-2007 bms

Purge two redundant case labels.


# 167141 01-Mar-2007 bms

Fix undirected broadcast sends for the case where SO_DONTROUTE has also
been set at the socket layer, in our somewhat convoluted IPv4 source
selection logic in ip_output().

IP_ONESBCAST is actually a special case of SO_DONTROUTE, as 255.255.255.255
must always be delivered on a local link with a TTL of 1.

If IP_ONESBCAST has been set at the socket layer, also perform destination
interface lookup for point-to-point interfaces based on the destination
address of the link; previously it was not possible to use the option with
such interfaces; also, the destination/broadcast address fields map to the
same field within struct ifnet, which doesn't help matters.

One more valid fix going forward for these issues is to treat 255.255.255.255
as a destination in its own right in the forwarding trie. Other
implementations do this. It fits with the use of multiple paths, though
it then becomes necessary to specify interface preference.
This hack will eventually go away when that comes to pass.

Reviewed by: andre
MFC after: 1 week


# 165082 10-Dec-2006 bms

Back out revision 1.264.

Fixing the IP accounting issue, if we plan to do so, needs to be better
thought out; the 'fix' introduces a hash lookup and a possible kernel panic.

Reported by: Mark Tinguely


# 164033 06-Nov-2006 rwatson

Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges. These may
require some future tweaking.

Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>


# 163606 22-Oct-2006 rwatson

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# 162798 29-Sep-2006 andre

Remove stone-aged and irrelevant "#ifndef notdef".


# 162616 25-Sep-2006 bms

Forced commit to note this change should be MFCed.

MFC after: 1 week


# 162615 25-Sep-2006 bms

Account for output IP datagrams on the ifaddr where they originated from,
*not* the first ifaddr on the ifp. This is similar to what NetBSD does.

PR: kern/72936
Submitted by: alfred
Reviewed by: andre


# 162231 11-Sep-2006 andre

Fix a NULL pointer dereference of ro->ro_rt->rt_flags by checking for the
validity of ro->ro_rt first. This prevents crashing on any non-normally
routed IP packet.

Coverity CID: 162 (incorrectly, it was re-introduced by previous commit)


# 162205 10-Sep-2006 jmg

make use of the host route's mtu for processing. This means we can now
support a network w/ split mtu's by assigning each host route the correct
mtu. an aspiring programmer could write a daemon to probe hosts and find
out if they support a larger mtu.


# 162084 06-Sep-2006 andre

First step of TSO (TCP segmentation offload) support in our network stack.

o add IFCAP_TSO[46] for drivers to announce this capability for IPv4 and IPv6
o add CSUM_TSO flag to mbuf pkthdr csum_flags field
o add tso_segsz field to mbuf pkthdr
o enhance ip_output() packet length check to allow for large TSO packets
o extend tcp_maxmtu[46]() with a flag pointer to pass interface capabilities
o adjust all callers of tcp_maxmtu[46]() accordingly

Discussed on: -current, -net
Sponsored by: TCP/IP Optimization Fundraise 2005


# 162068 06-Sep-2006 andre

Fix the socket option IP_ONESBCAST by giving it its own case in ip_output()
and skip over the normal IP processing.

Add a supporting function ifa_ifwithbroadaddr() to verify and validate the
supplied subnet broadcast address.

PR: kern/99558
Tested by: Andrey V. Elsukov <bu7cher-at-yandex.ru>
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 days


# 161380 16-Aug-2006 julian

Remove the IPFIREWALL_FORWARD_EXTENDED option and make it on by default as it always was
in older versions of FreeBSD. This option is pointless as it is needed in just
about every interesting usage of forward that I have ever seen. It doesn't make
the system any safer and just wastes huge amounts of develper time
when the system doesn't behave as expected when code is moved from
4.x to 6.x It doesn't make
the system any safer and just wastes huge amounts of develper time
when the system doesn't behave as expected when code is moved from
4.x to 6.x or 7.x
Reviewed by: glebius
MFC after: 1 week


# 160027 29-Jun-2006 glebius

Fix URL to Bellovin's paper.

Submitted by: Anton Yuzhaninov <citrin rambler-co.ru>


# 158799 21-May-2006 maxim

o Add missed error check: in ip_ctloutput() sooptcopyin() returns a
result but we never examine it.

Reviewed by: rwatson
MFC after: 2 weeks


# 158563 14-May-2006 bms

Fix a long-standing limitation in IPv4 multicast group membership.

By making the imo_membership array a dynamically allocated vector,
this minimizes disruption to existing IPv4 multicast code. This
change breaks the ABI for the kernel module ip_mroute.ko, and may
cause a small amount of churn for folks working on the IGMPv3 merge.

Previously, sockets were subject to a compile-time limitation on
the number of IPv4 group memberships, which was hard-coded to 20.
The imo_membership relationship, however, is 1:1 with regards to
a tuple of multicast group address and interface address. Users who
ran routing protocols such as OSPF ran into this limitation on machines
with a large system interface tree.


# 155201 02-Feb-2006 csjp

Somewhat re-factor the read/write locking mechanism associated with the packet
filtering mechanisms to use the new rwlock(9) locking API:

- Drop the variables stored in the phil_head structure which were specific to
conditions and the home rolled read/write locking mechanism.
- Drop some includes which were used for condition variables
- Drop the inline functions, and convert them to macros. Also, move these
macros into pfil.h
- Move pfil list locking macros intp phil.h as well
- Rename ph_busy_count to ph_nhooks. This variable will represent the number
of IN/OUT hooks registered with the pfil head structure
- Define PFIL_HOOKED macro which evaluates to true if there are any
hooks to be ran by pfil_run_hooks
- In the IP/IP6 stacks, change the ph_busy_count comparison to use the new
PFIL_HOOKED macro.
- Drop optimization in pfil_run_hooks which checks to see if there are any
hooks to be ran, and returns if not. This check is already performed by the
IP stacks when they call:

if (!PFIL_HOOKED(ph))
goto skip_hooks;

- Drop in assertion which makes sure that the number of hooks never drops
below 0 for good measure. This in theory should never happen, and if it
does than there are problems somewhere
- Drop special logic around PFIL_WAITOK because rw_wlock(9) does not sleep
- Drop variables which support home rolled read/write locking mechanism from
the IPFW firewall chain structure.
- Swap out the read/write firewall chain lock internal to use the rwlock(9)
API instead of our home rolled version
- Convert the inlined functions to macros

Reviewed by: mlaier, andre, glebius
Thanks to: jhb for the new locking API


# 155179 01-Feb-2006 andre

Move the IPSEC related code blocks to their own file to unclutter
and signifincantly improve the readability of ip_input() and
ip_output() again.

The resulting IPSEC hooks in ip_input() and ip_output() may be
used later on for making IPSEC loadable.

This move is mostly mechanical and should preserve current IPSEC
behaviour as-is. Nothing shall prevent improvements in the way
IPSEC interacts with the IPv4 stack.

Discussed with: bz, gnn, rwatson; (earlier version)


# 154526 18-Jan-2006 andre

In in_delayed_cksum() we can't perform a m_pullup() as it may
change the mbuf pointer and we don't have any way of passing
it back to the callers. Instead just fail silently without
updating the checksum but leaving the mbuf+chain intact.

A search in our GNATS database did not turn up any match for
the existing warning message when this case is encountered.

Found by: Coverity Prevent(tm)
Coverity ID: CID779
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 days


# 154520 18-Jan-2006 andre

Prevent dereferencing a NULL route pointer when trying to update the
route MTU.

This bug is very difficult to reach and not remotely exploitable.

Found by: Coverity Prevent(tm)
Coverity ID: CID162
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 days


# 153164 06-Dec-2005 glebius

When we drop packet due to no space in output interface output queue, also
increase the ifp->if_snd.ifq_drops.

PR: 72440
Submitted by: ikob


# 152592 18-Nov-2005 andre

Consolidate all IP Options handling functions into ip_options.[ch] and
include ip_options.h into all files making use of IP Options functions.

From ip_input.c rev 1.306:
ip_dooptions(struct mbuf *m, int pass)
save_rte(m, option, dst)
ip_srcroute(m0)
ip_stripoptions(m, mopt)

From ip_output.c rev 1.249:
ip_insertoptions(m, opt, phlen)
ip_optcopy(ip, jp)
ip_pcbopts(struct inpcb *inp, int optname, struct mbuf *m)

No functional changes in this commit.

Discussed with: rwatson
Sponsored by: TCP/IP Optimization Fundraise 2005


# 152583 18-Nov-2005 andre

Purge layer specific mbuf flags on layer crossings to avoid confusing
upper or lower layers.

Sponsored by: TCP/IP Optimization Fundraise 2005


# 151967 02-Nov-2005 andre

Retire MT_HEADER mbuf type and change its users to use MT_DATA.

Having an additional MT_HEADER mbuf type is superfluous and redundant
as nothing depends on it. It only adds a layer of confusion. The
distinction between header mbuf's and data mbuf's is solely done
through the m->m_flags M_PKTHDR flag.

Non-native code is not changed in this commit. For compatibility
MT_HEADER is mapped to MT_DATA.

Sponsored by: TCP/IP Optimization Fundraise 2005


# 150594 26-Sep-2005 andre

Implement IP_DONTFRAG IP socket option enabling the Don't Fragment
flag on IP packets. Currently this option is only repected on udp
and raw ip sockets. On tcp sockets the DF flag is controlled by the
path MTU discovery option.

Sending a packet larger than the MTU size of the egress interface
returns an EMSGSIZE error.

Discussed with: rwatson
Sponsored by: TCP/IP Optimization Fundraise 2005


# 149635 30-Aug-2005 andre

Use the correct mbuf type for MGET().


# 149371 22-Aug-2005 andre

Add socketoption IP_MINTTL. May be used to set the minimum acceptable
TTL a packet must have when received on a socket. All packets with a
lower TTL are silently dropped. Works on already connected/connecting
and listening sockets for RAW/UDP/TCP.

This option is only really useful when set to 255 preventing packets
from outside the directly connected networks reaching local listeners
on sockets.

Allows userland implementation of 'The Generalized TTL Security Mechanism
(GTSM)' according to RFC3682. Examples of such use include the Cisco IOS
BGP implementation command "neighbor ttl-security".

MFC after: 2 weeks
Sponsored by: TCP/IP Optimization Fundraise 2005


# 148903 09-Aug-2005 rwatson

Add helper function ip_findmoptions(), which accepts an inpcb, and attempts
to atomically return either an existing set of IP multicast options for the
PCB, or a newlly allocated set with default values. The inpcb is returned
locked. This function may sleep.

Call ip_moptions() to acquire a reference to a PCB's socket options, and
perform the update of the options while holding the PCB lock. Release the
lock before returning.

Remove garbage collection of multicast options when values return to the
default, as this complicates locking substantially. Most applications
allocate a socket either to be multicast, or not, and don't tend to keep
around sockets that have previously been used for multicast, then used for
unicast.

This closes a number of race conditions involving multiple threads or
processes modifying the IP multicast state of a socket simultaenously.

MFC after: 7 days


# 148682 03-Aug-2005 rwatson

Introduce in_multi_mtx, which will protect IPv4-layer multicast address
lists, as well as accessor macros. For now, this is a recursive mutex
due code sequences where IPv4 multicast calls into IGMP calls into
ip_output(), which then tests for a multicast forwarding case.

For support macros in in_var.h to check multicast address lists, assert
that in_multi_mtx is held.

Acquire in_multi_mtx around iteration over the IPv4 multicast address
lists, such as in ip_input() and ip_output().

Acquire in_multi_mtx when manipulating the IPv4 layer multicast addresses,
as well as over the manipulation of ifnet multicast address lists in order
to keep the two layers in sync.

Lock down accesses to IPv4 multicast addresses in IGMP, or assert the
lock when performing IGMP join/leave events.

Eliminate spl's associated with IPv4 multicast addresses, portions of
IGMP that weren't previously expunged by IGMP locking.

Add in_multi_mtx, igmp_mtx, and if_addr_mtx lock order to hard-coded
lock order in WITNESS, in that order.

Problem reported by: Ed Maste <emaste at phaedrus dot sandvine dot ca>
MFC after: 10 days


# 147785 05-Jul-2005 rwatson

Eliminate MAC entry point mac_create_mbuf_from_mbuf(), which is
redundant with respect to existing mbuf copy label routines. Expose
a new mac_copy_mbuf() routine at the top end of the Framework and
use that; use the existing mpo_copy_mbuf_label() routine on the
bottom end.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA, SPAWAR
Approved by: re (scottl)


# 147256 10-Jun-2005 brooks

Stop embedding struct ifnet at the top of driver softcs. Instead the
struct ifnet or the layer 2 common structure it was embedded in have
been replaced with a struct ifnet pointer to be filled by a call to the
new function, if_alloc(). The layer 2 common structure is also allocated
via if_alloc() based on the interface type. It is hung off the new
struct ifnet member, if_l2com.

This change removes the size of these structures from the kernel ABI and
will allow us to better manage them as interfaces come and go.

Other changes of note:
- Struct arpcom is no longer referenced in normal interface code.
Instead the Ethernet address is accessed via the IFP2ENADDR() macro.
To enforce this ac_enaddr has been renamed to _ac_enaddr.
- The second argument to ether_ifattach is now always the mac address
from driver private storage rather than sometimes being ac_enaddr.

Reviewed by: sobomax, sam


# 142248 22-Feb-2005 andre

Bring back the full packet destination manipulation for 'ipfw fwd'
with the kernel compile time option:

options IPFIREWALL_FORWARD_EXTENDED

This option has to be specified in addition to IPFIRWALL_FORWARD.

With this option even packets targeted for an IP address local
to the host can be redirected. All restrictions to ensure proper
behaviour for locally generated packets are turned off. Firewall
rules have to be carefully crafted to make sure that things like
PMTU discovery do not break.

Document the two kernel options.

PR: kern/71910
PR: kern/73129
MFC after: 1 week


# 140675 23-Jan-2005 alc

Correctly move the packet header in ip_insertoptions().

Reported by: Anupam Chanda
Reviewed by: sam@
MFC after: 2 weeks


# 139823 06-Jan-2005 imp

/* -> /*- for license, minor formatting changes


# 139310 25-Dec-2004 rwatson

Remove an errant blank line apparently introduced in
ip_output.c:1.194.


# 138408 05-Dec-2004 rwatson

Pass the inpcb reference into ip_getmoptions() rather than just the
inp->inp_moptions pointer, so that ip_getmoptions() can perform
necessary locking when doing non-atomic reads.

Lock the inpcb by default to copy any data to local variables, then
unlock before performing sooptcopyout().

MFC after: 2 weeks


# 138404 05-Dec-2004 rwatson

Push the inpcb argument into ip_setmoptions() when setting IP multicast
socket options, so that it is available for locking.


# 138398 05-Dec-2004 rwatson

Forced commit to note that the previous revision, 1.233, should be:

MFC after: 2 weeks


# 138397 05-Dec-2004 rwatson

Start working through inpcb locking for ip_ctloutput() by cleaning up
modifications to the inpcb IP options mbuf:

- Lock the inpcb before passing it into ip_pcbopts() in order to prevent
simulatenous reads and read-modify-writes that could result in races.
- Pass the inpcb reference into ip_pcbopts() instead of the option chain
pointer in the inpcb.
- Assert the inpcb lock in ip_pcbots.
- Convert one or two uses of a pointer as a boolean or an integer
comparison to a comparison with NULL for readability.


# 135920 29-Sep-2004 mlaier

Add an additional struct inpcb * argument to pfil(9) in order to enable
passing along socket information. This is required to work around a LOR with
the socket code which results in an easy reproducible hard lockup with
debug.mpsafenet=1. This commit does *not* fix the LOR, but enables us to do
so later. The missing piece is to turn the filter locking into a leaf lock
and will follow in a seperate (later) commit.

This will hopefully be MT5'ed in order to fix the problem for RELENG_5 in
forseeable future.

Suggested by: rwatson
A lot of work by: csjp (he'd be even more helpful w/o mentor-reviews ;)
Reviewed by: rwatson, csjp
Tested by: -pf, -ipfw, LINT, csjp and myself
MFC after: 3 days

LOR IDs: 14 - 17 (not fixed yet)


# 135160 13-Sep-2004 andre

Make comments more clear for the packet changed cases after pfil hooks.


# 134852 06-Sep-2004 jmg

revert comment from rev1.158 now that rev1.225 backed it out..

MFC after: 3 days


# 134385 27-Aug-2004 andre

In the case the destination of a packet was changed by the packet filter
to point to a local IP address; and the packet was sourced from this host
we fill in the m_pkthdr.rcvif with a pointer to the loopback interface.

Before the function ifunit("lo0") was used to obtain the ifp. However
this is sub-optimal from a performance point of view and might be dangerous
if the loopback interface has been renamed. Use the global variable 'loif'
instead which always points to the loopback interface.

Submitted by: brooks


# 134383 27-Aug-2004 andre

Always compile PFIL_HOOKS into the kernel and remove the associated kernel
compile option. All FreeBSD packet filters now use the PFIL_HOOKS API and
thus it becomes a standard part of the network stack.

If no hooks are connected the entire packet filter hooks section and related
activities are jumped over. This removes any performance impact if no hooks
are active.

Both OpenBSD and DragonFlyBSD have integrated PFIL_HOOKS permanently as well.


# 134172 22-Aug-2004 mlaier

Allow early drop for non-ALTQ enabled queues in an ALTQ-enabled kernel.
Previously the early drop was disabled unconditionally for ALTQ-enabled
kernels.

This should give some benefit for the normal gateway + LAN-server case with
a busy LAN leg and an ALTQ managed uplink.

Reviewed and style help from: cperciva, pjd


# 133922 17-Aug-2004 peter

Make the kernel compile again if you are not using PFIL_HOOKS


# 133920 17-Aug-2004 andre

Convert ipfw to use PFIL_HOOKS. This is change is transparent to userland
and preserves the ipfw ABI. The ipfw core packet inspection and filtering
functions have not been changed, only how ipfw is invoked is different.

However there are many changes how ipfw is and its add-on's are handled:

In general ipfw is now called through the PFIL_HOOKS and most associated
magic, that was in ip_input() or ip_output() previously, is now done in
ipfw_check_[in|out]() in the ipfw PFIL handler.

IPDIVERT is entirely handled within the ipfw PFIL handlers. A packet to
be diverted is checked if it is fragmented, if yes, ip_reass() gets in for
reassembly. If not, or all fragments arrived and the packet is complete,
divert_packet is called directly. For 'tee' no reassembly attempt is made
and a copy of the packet is sent to the divert socket unmodified. The
original packet continues its way through ip_input/output().

ipfw 'forward' is done via m_tag's. The ipfw PFIL handlers tag the packet
with the new destination sockaddr_in. A check if the new destination is a
local IP address is made and the m_flags are set appropriately. ip_input()
and ip_output() have some more work to do here. For ip_input() the m_flags
are checked and a packet for us is directly sent to the 'ours' section for
further processing. Destination changes on the input path are only tagged
and the 'srcrt' flag to ip_forward() is set to disable destination checks
and ICMP replies at this stage. The tag is going to be handled on output.
ip_output() again checks for m_flags and the 'ours' tag. If found, the
packet will be dropped back to the IP netisr where it is going to be picked
up by ip_input() again and the directly sent to the 'ours' section. When
only the destination changes, the route's 'dst' is overwritten with the
new destination from the forward m_tag. Then it jumps back at the route
lookup again and skips the firewall check because it has been marked with
M_SKIP_FIREWALL. ipfw 'forward' has to be compiled into the kernel with
'option IPFIREWALL_FORWARD' to enable it.

DUMMYNET is entirely handled within the ipfw PFIL handlers. A packet for
a dummynet pipe or queue is directly sent to dummynet_io(). Dummynet will
then inject it back into ip_input/ip_output() after it has served its time.
Dummynet packets are tagged and will continue from the next rule when they
hit the ipfw PFIL handlers again after re-injection.

BRIDGING and IPFW_ETHER are not changed yet and use ipfw_chk() directly as
they did before. Later this will be changed to dedicated ETHER PFIL_HOOKS.

More detailed changes to the code:

conf/files
Add netinet/ip_fw_pfil.c.

conf/options
Add IPFIREWALL_FORWARD option.

modules/ipfw/Makefile
Add ip_fw_pfil.c.

net/bridge.c
Disable PFIL_HOOKS if ipfw for bridging is active. Bridging ipfw
is still directly invoked to handle layer2 headers and packets would
get a double ipfw when run through PFIL_HOOKS as well.

netinet/ip_divert.c
Removed divert_clone() function. It is no longer used.

netinet/ip_dummynet.[ch]
Neither the route 'ro' nor the destination 'dst' need to be stored
while in dummynet transit. Structure members and associated macros
are removed.

netinet/ip_fastfwd.c
Removed all direct ipfw handling code and replace it with the new
'ipfw forward' handling code.

netinet/ip_fw.h
Removed 'ro' and 'dst' from struct ip_fw_args.

netinet/ip_fw2.c
(Re)moved some global variables and the module handling.

netinet/ip_fw_pfil.c
New file containing the ipfw PFIL handlers and module initialization.

netinet/ip_input.c
Removed all direct ipfw handling code and replace it with the new
'ipfw forward' handling code. ip_forward() does not longer require
the 'next_hop' struct sockaddr_in argument. Disable early checks
if 'srcrt' is set.

netinet/ip_output.c
Removed all direct ipfw handling code and replace it with the new
'ipfw forward' handling code.

netinet/ip_var.h
Add ip_reass() as general function. (Used from ipfw PFIL handlers
for IPDIVERT.)

netinet/raw_ip.c
Directly check if ipfw and dummynet control pointers are active.

netinet/tcp_input.c
Rework the 'ipfw forward' to local code to work with the new way of
forward tags.

netinet/tcp_sack.c
Remove include 'opt_ipfw.h' which is not needed here.

sys/mbuf.h
Remove m_claim_next() macro which was exclusively for ipfw 'forward'
and is no longer needed.

Approved by: re (scottl)


# 133720 14-Aug-2004 dwmalone

Get rid of the RANDOM_IP_ID option and make it a sysctl. NetBSD
have already done this, so I have styled the patch on their work:

1) introduce a ip_newid() static inline function that checks
the sysctl and then decides if it should return a sequential
or random IP ID.

2) named the sysctl net.inet.ip.random_id

3) IPv6 flow IDs and fragment IDs are now always random.
Flow IDs and frag IDs are significantly less common in the
IPv6 world (ie. rarely generated per-packet), so there should
be smaller performance concerns.

The sysctl defaults to 0 (sequential IP IDs).

Reviewed by: andre, silby, mlaier, ume
Based on: NetBSD
MFC after: 2 months


# 133481 11-Aug-2004 andre

Consistently use NULL for pointer comparisons.


# 133389 09-Aug-2004 andre

Make a comment that "ipfw forward" is not SMP and PREEMPTION safe.


# 133074 03-Aug-2004 andre

o Delayed checksums are now calculated in divert_packet() for diverted packets
Remove the XXX-escaped code that did it in ip_output()'s IPHACK section.


# 131012 24-Jun-2004 rwatson

In ip_ctloutput(), acquire the inpcb lock around some of the basic
inpcb flag and status updates.


# 130416 13-Jun-2004 mlaier

Link ALTQ to the build and break with ABI for struct ifnet. Please recompile
your (network) modules as well as any userland that might make sense of
sizeof(struct ifnet).
This does not change the queueing yet. These changes will follow in a
seperate commit. Same with the driver changes, which need case by case
evaluation.

__FreeBSD_version bump will follow.

Tested-by: (i386)LINT


# 129126 11-May-2004 maxim

o Calculate a number of bytes to copy (cnt) correctly:

+----+-+-+-+-+----+----+- - - - - - - - - - - - -+----+
| | |C| | | | | | |
| IP |N|O|L|P| | IP | | IP |
| #1 |O|D|E|T| | #2 | | #n |
| |P|E|N|R| | | | |
+----+-+-+-+-+----+----+- - - - - - - - - - - - -+----+
^ ^<---- cnt - (IPOPT_MINOFF - 1) ---->|
| |
src | +-- cp[IPOPT_OFF + 1] + sizeof(struct in_addr)
|
dst +-- cp[IPOPT_OFF + 1]

PR: kern/66386
Submitted by: Andrei Iltchenko
MFC after: 3 weeks


# 128829 02-May-2004 darrenr

Rename m_claim_next_hop() to m_claim_next(), as suggested by Max Laier.


# 128816 02-May-2004 darrenr

Rename ip_claim_next_hop() to m_claim_next_hop(), give it an extra arg
(the type of tag to claim) and push it out of ip_var.h into mbuf.h alongside
all of the other macros that work ok mbuf's and tag's.


# 128210 13-Apr-2004 luigi

In an effort to simplify the routing code, try to deprecate rtalloc()
in favour of rtalloc_ign(), which is what would end up being called
anyways.

There are 25 more instances of rtalloc() in net*/ and
about 10 instances of rtalloc_ign()


# 128019 07-Apr-2004 imp

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson


# 128003 07-Apr-2004 ru

Fixed a bug in previous revision: compute the payload checksum before
we convert ip_len into a network byte order; in_delayed_cksum() still
expects it in host byte order.

The symtom was the ``in_cksum_skip: out of data by %d'' complaints
from the kernel.

To add to the previous commit log. These fixes make tcpdump(1) happy
by not complaining about UDP/TCP checksum being bad for looped back
IP multicast when multicast router is deactivated.

Reported by: Vsevolod Lobko


# 127396 25-Mar-2004 ru

Untangle IP multicast routing interaction with delayed payload checksums.

Compute the payload checksum for a locally originated IP multicast where
God intended, in ip_mloopback(), rather than doing it in ip_output() and
only when multicast router is active. This is more correct as we do not
fool ip_input() that the packet has the correct payload checksum when in
fact it does not (when multicast router is inactive). This is also more
efficient if we don't join the multicast group we send to, thus allowing
the hardware to checksum the payload.


# 126486 02-Mar-2004 mlaier

Two minor follow-ups on the MT_TAG removal:
ifp is now passed explicitly to ether_demux; no need to look it up again.
Make mtag a global var in ip_input.

Noticed by: rwatson
Approved by: bms(mentor)


# 126239 25-Feb-2004 mlaier

Re-remove MT_TAGs. The problems with dummynet have been fixed now.

Tested by: -current, bms(mentor), me
Approved by: bms(mentor), sam


# 125952 17-Feb-2004 mlaier

Backout MT_TAG removal (i.e. bring back MT_TAGs) for now, as dummynet is
not working properly with the patch in place.

Approved by: bms(mentor)


# 125875 16-Feb-2004 ume

don't update outgoing ifp, if ipsec tunnel mode encapsulation
was not made.

Obtained from: KAME


# 125784 13-Feb-2004 mlaier

This set of changes eliminates the use of MT_TAG "pseudo mbufs", replacing
them mostly with packet tags (one case is handled by using an mbuf flag
since the linkage between "caller" and "callee" is direct and there's no
need to incur the overhead of a packet tag).

This is (mostly) work from: sam

Silence from: -arch
Approved by: bms(mentor), sam, rwatson


# 125680 11-Feb-2004 bms

Initial import of RFC 2385 (TCP-MD5) digest support.

This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.

For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.

Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.

There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.

Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.

This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.

Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.

Sponsored by: sentex.net


# 125396 03-Feb-2004 ume

pass pcb rather than so. it is expected that per socket policy
works again.


# 124247 08-Jan-2004 andre

Do not set the ip_id to zero when DF is set on packet and
restore the general pre-randomid behaviour.

Setting the ip_id to zero causes several problems with
packet reassembly when a device along the path removes
the DF bit for some reason.

Other BSD and Linux have found and fixed the same issues.

PR: kern/60889
Tested by: Richard Wendland <richard@wendland.org.uk>
Approved by: re (scottl)


# 122922 20-Nov-2003 andre

Introduce tcp_hostcache and remove the tcp specific metrics from
the routing table. Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.

It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination. Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.

tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.

It removes significant locking requirements from the tcp stack with
regard to the routing table.

Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)


# 122921 20-Nov-2003 andre

Remove RTF_PRCLONING from routing table and adjust users of it
accordingly. The define is left intact for ABI compatibility
with userland.

This is a pre-step for the introduction of tcp_hostcache. The
network stack remains fully useable with this change.

Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)


# 122708 14-Nov-2003 andre

Remove the global one-level rtcache variable and associated
complex locking and rework ip_rtaddr() to do its own rtlookup.
Adopt all its callers to this and make ip_output() callable
with NULL rt pointer.

Reviewed by: sam (mentor)


# 122702 14-Nov-2003 andre

Introduce ip_fastforward and remove ip_flow.

Short description of ip_fastforward:

o adds full direct process-to-completion IPv4 forwarding code
o handles ip fragmentation incl. hw support (ip_flow did not)
o sends icmp needfrag to source if DF is set (ip_flow did not)
o supports ipfw and ipfilter (ip_flow did not)
o supports divert, ipfw fwd and ipfilter nat (ip_flow did not)
o returns anything it can't handle back to normal ip_input

Enable with sysctl -w net.inet.ip.fastforwarding=1

Reviewed by: sam (mentor)


# 122588 12-Nov-2003 andre

Do not fragment a packet with hardware assistance if it has the DF
bit set.

Reviewed by: sam (mentor)


# 122330 08-Nov-2003 sam

assert optional inpcb is passed in locked

Supported by: FreeBSD Foundation


# 122062 04-Nov-2003 ume

- cleanup SP refcnt issue.
- share policy-on-socket for listening socket.
- don't copy policy-on-socket at all. secpolicy no longer contain
spidx, which saves a lot of memory.
- deep-copy pcb policy if it is an ipsec policy. assign ID field to
all SPD entries. make it possible for racoon to grab SPD entry on
pcb.
- fixed the order of searching SA table for packets.
- fixed to get a security association header. a mode is always needed
to compare them.
- fixed that the incorrect time was set to
sadb_comb_{hard|soft}_usetime.
- disallow port spec for tunnel mode policy (as we don't reassemble).
- an user can define a policy-id.
- clear enc/auth key before freeing.
- fixed that the kernel crashed when key_spdacquire() was called
because key_spdacquire() had been implemented imcopletely.
- preparation for 64bit sequence number.
- maintain ordered list of SA, based on SA id.
- cleanup secasvar management; refcnt is key.c responsibility;
alloc/free is keydb.c responsibility.
- cleanup, avoid double-loop.
- use hash for spi-based lookup.
- mark persistent SP "persistent".
XXX in theory refcnt should do the right thing, however, we have
"spdflush" which would touch all SPs. another solution would be to
de-register persistent SPs from sptree.
- u_short -> u_int16_t
- reduce kernel stack usage by auto variable secasindex.
- clarify function name confusion. ipsec_*_policy ->
ipsec_*_pcbpolicy.
- avoid variable name confusion.
(struct inpcbpolicy *)pcb_sp, spp (struct secpolicy **), sp (struct
secpolicy *)
- count number of ipsec encapsulations on ipsec4_output, so that we
can tell ip_output() how to handle the packet further.
- When the value of the ul_proto is ICMP or ICMPV6, the port field in
"src" of the spidx specifies ICMP type, and the port field in "dst"
of the spidx specifies ICMP code.
- avoid from applying IPsec transport mode to the packets when the
kernel forwards the packets.

Tested by: nork
Obtained from: KAME


# 121972 03-Nov-2003 rwatson

Note that when ip_output() is called from ip_forward(), it will already
have its options inserted, so the opt argument to ip_output() must be
NULL.


# 120727 04-Oct-2003 sam

Locking for updates to routing table entries. Each rtentry gets a mutex
that covers updates to the contents. Note this is separate from holding
a reference and/or locking the routing table itself.

Other/related changes:

o rtredirect loses the final parameter by which an rtentry reference
may be returned; this was never used and added unwarranted complexity
for locking.
o minor style cleanups to routing code (e.g. ansi-fy function decls)
o remove the logic to bump the refcnt on the parent of cloned routes,
we assume the parent will remain as long as the clone; doing this avoids
a circularity in locking during delete
o convert some timeouts to MPSAFE callouts

Notes:

1. rt_mtx in struct rtentry is guarded by #ifdef _KERNEL as user-level
applications cannot/do-no know about mutex's. Doing this requires
that the mutex be the last element in the structure. A better solution
is to introduce an externalized version of struct rtentry but this is
a major task because of the intertwining of rtentry and other data
structures that are visible to user applications.
2. There are known LOR's that are expected to go away with forthcoming
work to eliminate many held references. If not these will be resolved
prior to release.
3. ATM changes are untested.

Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS (partly)


# 120386 23-Sep-2003 sam

o update PFIL_HOOKS support to current API used by netbsd
o revamp IPv4+IPv6+bridge usage to match API changes
o remove pfil_head instances from protosw entries (no longer used)
o add locking
o bump FreeBSD version for 3rd party modules

Heavy lifting by: "Max Laier" <max@love2party.net>
Supported by: FreeBSD Foundation
Obtained from: NetBSD (bits of pfil.h and pfil.c)


# 119644 01-Sep-2003 silby

Implement MBUF_STRESS_TEST mark II.

Changes from the original implementation:

- Fragmentation is handled by the function m_fragment, which can
be called from whereever fragmentation is needed. Note that this
function is wrapped in #ifdef MBUF_STRESS_TEST to discourage non-testing
use.

- m_fragment works slightly differently from the old fragmentation
code in that it allocates a seperate mbuf cluster for each fragment.
This defeats dma_map_load_mbuf/buffer's feature of coalescing adjacent
fragments. While that is a nice feature in practice, it nerfed the
usefulness of mbuf_stress_test.

- Add two modes of random fragmentation. Chains with fragments all of
the same random length and chains with fragments that are each uniquely
random in length may now be requested.


# 119178 20-Aug-2003 bms

Add the IP_ONESBCAST option, to enable undirected IP broadcasts to be sent on
specific interfaces. This is required by aodvd, and may in future help us
in getting rid of the requirement for BPF from our import of isc-dhcp.

Suggested by: fenestro
Obtained from: BSD/OS
Reviewed by: mini, sam
Approved by: jake (mentor)


# 118622 07-Aug-2003 hsu

1. Basic PIM kernel support
Disabled by default. To enable it, the new "options PIM" must be
added to the kernel configuration file (in addition to MROUTING):

options MROUTING # Multicast routing
options PIM # Protocol Independent Multicast

2. Add support for advanced multicast API setup/configuration and
extensibility.

3. Add support for kernel-level PIM Register encapsulation.
Disabled by default. Can be enabled by the advanced multicast API.

4. Implement a mechanism for "multicast bandwidth monitoring and upcalls".

Submitted by: Pavlin Radoslavov <pavlin@icir.org>


# 117765 19-Jul-2003 silby

Minor fix to the MBUF_STRESS_TEST code so that it keeps
pkthdr.len consistant at all times. (Some debugging
code I'm working on is tripped otherwise.)

MFC after: 3 days


# 115471 31-May-2003 wollman

Don't generate an ip_id for packets with the DF bit set; ip_id is
only meaningful for fragments. Also don't bother to byte-swap the
ip_id when we do generate it; it is only used at the receiver as a
nonce. I tried several different permutations of this code with no
measurable difference to each other or to the unmodified version, so
I've settled on the one for which gcc seems to generate the best code.
(If anyone cares to microoptimize this differently for an architecture
where it actually matters, feel free.)

Suggested by: Steve Bellovin's paper in IMW'02


# 114258 29-Apr-2003 mdodd

IP_RECVTTL socket option.

Reviewed by: Stuart Cheshire <cheshire@apple.com>


# 113384 12-Apr-2003 silby

Rename MBUF_FRAG_TEST to MBUF_STRESS_TEST as it will be extended
to include more than just frag tests.


# 113255 08-Apr-2003 des

Introduce an M_ASSERTPKTHDR() macro which performs the very common task
of asserting that an mbuf has a packet header. Use it instead of hand-
rolled versions wherever applicable.

Submitted by: Hiten Pandya <hiten@unixdaemons.com>


# 113074 04-Apr-2003 des

Replace memcpy() and ovbcopy() with bcopy(); ditch some caddr_t usage.


# 112985 02-Apr-2003 mdodd

Back out support for RFC3514.

RFC3514 poses an unacceptale risk to compliant systems.


# 112983 02-Apr-2003 mdodd

- Use the correct constant define.
- Add a missing break.


# 112973 02-Apr-2003 mdodd

Sync constant define with NetBSD.

Requested by: Tom Spindler <dogcow@babymeat.com>


# 112929 01-Apr-2003 mdodd

Implement support for RFC 3514 (The Security Flag in the IPv4 Header).
(See: ftp://ftp.rfc-editor.org/in-notes/rfc3514.txt)

This fulfills the host requirements for userland support by
way of the setsockopt() IP_EVIL_INTENT message.

There are three sysctl tunables provided to govern system behavior.

net.inet.ip.rfc3514:

Enables support for rfc3514. As this is an
Informational RFC and support is not yet widespread
this option is disabled by default.

net.inet.ip.hear_no_evil

If set the host will discard all received evil packets.

net.inet.ip.speak_no_evil

If set the host will discard all transmitted evil packets.

The IP statistics counter 'ips_evil' (available via 'netstat') provides
information on the number of 'evil' packets recieved.

For reference, the '-E' option to 'ping' has been provided to demonstrate
and test the implementation.


# 112650 25-Mar-2003 mux

Try to make the MBUF_FRAG_TEST code work better.

- Don't try to fragment the packet if it's smaller than mbuf_frag_size.
- Preserve the size of the mbuf chain which is modified by m_split().
- Check that m_split() didn't return NULL.
- Make it so we don't end up with two M_PKTHDR mbuf in the chain.
- Use m->m_pkthdr.len instead of m->m_len so that we fragment the whole
chain and not just the first mbuf.
- Fix a nearby style bug and rework the logic of the loops so that it's
more clear.

This is still not quite right, because we're clearly abusing m_split() to
do something it was not designed for, but at least it works now. We
should probably move this code into a m_fragment() function when it's
correct.


# 112591 25-Mar-2003 silby

Add the MBUF_FRAG_TEST option. When compiled in, this option
allows you to tell ip_output to fragment all outgoing packets
into mbuf fragments of size net.inet.ip.mbuf_frag_size bytes.
This is an excellent way to test if network drivers can properly
handle long mbuf chains being passed to them.

net.inet.ip.mbuf_frag_size defaults to 0 (no fragmentation)
so that you can at least boot before your network driver dies. :)


# 111186 20-Feb-2003 jlemon

Remove unused variables in the IPSEC case.

Submitted by: Lars Eggert <larse@ISI.EDU>


# 111145 19-Feb-2003 jlemon

Add a TCP TIMEWAIT state which uses less space than a fullblown TCP
control block. Allow the socket and tcpcb structures to be freed
earlier than inpcb. Update code to understand an inp w/o a socket.

Reviewed by: hsu, silby, jayanth
Sponsored by: DARPA, NAI Labs


# 111119 19-Feb-2003 imp

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 110074 30-Jan-2003 sam

FAST_IPSEC bandaid: act like KAME and ignore ENOENT error codes from
ipsec4_process_packet; they happen when a packet is dropped because
an SA acquire is initiated

Submitted by: Doug Ambrisko <ambrisko@verniernetworks.com>


# 109623 21-Jan-2003 alfred

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 108533 01-Jan-2003 schweikh

Correct typos, mostly s/ a / an / where appropriate. Some whitespace cleanup,
especially in troff files.


# 107112 20-Nov-2002 luigi

Back out the ip_fragment() code -- it is not urgent to have it in now,
I will put it back in in a better form after 5.0 is out.

Requested by: sam, rwatson, luigi (on second thought)
Approved by: re


# 107020 17-Nov-2002 luigi

Move the ip_fragment code from ip_output() to a separate function,
so that it can be reused elsewhere (there is a number of places
where it can be useful). This also trims some 200 lines from
the body of ip_output(), which helps readability a bit.

(This change was discussed a few weeks ago on the mailing lists,
Julian agreed, silence from others. It is not a functional change,
so i expect it to be ok to commit it now but i am happy to back it
out if there are objections).

While at it, fix some function headers and replace m_copy() with
m_copypacket() where applicable.

MFC after: 1 week


# 106968 15-Nov-2002 luigi

Massive cleanup of the ip_mroute code.

No functional changes, but:

+ the mrouting module now should behave the same as the compiled-in
version (it did not before, some of the rsvp code was not loaded
properly);
+ netinet/ip_mroute.c is now truly optional;
+ removed some redundant/unused code;
+ changed many instances of '0' to NULL and INADDR_ANY as appropriate;
+ removed several static variables to make the code more SMP-friendly;
+ fixed some minor bugs in the mrouting code (mostly, incorrect return
values from functions).

This commit is also a prerequisite to the addition of support for PIM,
which i would like to put in before DP2 (it does not change any of
the existing APIs, anyways).

Note, in the process we found out that some device drivers fail to
properly handle changes in IFF_ALLMULTI, leading to interesting
behaviour when a multicast router is started. This bug is not
corrected by this commit, and will be fixed with a separate commit.

Detailed changes:
--------------------
netinet/ip_mroute.c all the above.
conf/files make ip_mroute.c optional
net/route.c fix mrt_ioctl hook
netinet/ip_input.c fix ip_mforward hook, move rsvp_input() here
together with other rsvp code, and a couple
of indentation fixes.
netinet/ip_output.c fix ip_mforward and ip_mcast_src hooks
netinet/ip_var.h rsvp function hooks
netinet/raw_ip.c hooks for mrouting and rsvp functions, plus
interface cleanup.
netinet/ip_mroute.h remove an unused and optional field from a struct

Most of the code is from Pavlin Radoslavov and the XORP project

Reviewed by: sam
MFC after: 1 week


# 106678 08-Nov-2002 sam

correct fast ipsec logic: compare destination ip address against the
contents of the SA, not the SP

Submitted by: "Doug Ambrisko" <ambrisko@verniernetworks.com>


# 105586 20-Oct-2002 phk

Fix two instances of variant struct definitions in sys/netinet:

Remove the never completed _IP_VHL version, it has not caught on
anywhere and it would make us incompatible with other BSD netstacks
to retain this version.

Add a CTASSERT protecting sizeof(struct ip) == 20.

Don't let the size of struct ipq depend on the IPDIVERT option.

This is a functional no-op commit.

Approved by: re


# 105199 16-Oct-2002 sam

Tie new "Fast IPsec" code into the build. This involves the usual
configuration stuff as well as conditional code in the IPv4 and IPv6
areas. Everything is conditional on FAST_IPSEC which is mutually
exclusive with IPSEC (KAME IPsec implmentation).

As noted previously, don't use FAST_IPSEC with INET6 at the moment.

Reviewed by: KAME, rwatson
Approved by: silence
Supported by: Vernier Networks


# 105194 15-Oct-2002 sam

Replace aux mbufs with packet tags:

o instead of a list of mbufs use a list of m_tag structures a la openbsd
o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit
ABI/module number cookie
o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and
use this in defining openbsd-compatible m_tag_find and m_tag_get routines
o rewrite KAME use of aux mbufs in terms of packet tags
o eliminate the most heavily used aux mbufs by adding an additional struct
inpcb parameter to ip_output and ip6_output to allow the IPsec code to
locate the security policy to apply to outbound packets
o bump __FreeBSD_version so code can be conditionalized
o fixup ipfilter's call to ip_output based on __FreeBSD_version

Reviewed by: julian, luigi (silent), -arch, -net, darren
Approved by: julian, silence from everyone else
Obtained from: openbsd (mostly)
MFC after: 1 month


# 103852 23-Sep-2002 maxim

Slightly rearrange a code in rev. 1.164:

o Move len initialization closer to place of its first usage.
o Compare len with 0 to improve readability.
o Explicitly zero out phlen in ip_insertoptions() in failure case.

Suggested by: jhb
Reviewed by: jhb
MFC after: 2 weeks


# 103478 17-Sep-2002 maxim

In rare cases when there is no room for ip options ip_insertoptions()
can fail and corrupt a header length. Initialize len and check what
ip_insertoptions() returns.

Reviewed by: archie, silence on -net
MFC after: 5 days


# 101096 31-Jul-2002 rwatson

Introduce support for Mandatory Access Control and extensible
kernel access control.

When fragmenting an IP datagram, invoke an appropriate MAC entry
point so that MAC labels may be copied (...) to the individual
IP fragment mbufs by MAC policies.

When IP options are inserted into an IP datagram when leaving a
host, preserve the label if we need to reallocate the mbuf for
alignment or size reasons.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 99891 12-Jul-2002 luigi

Avoid dereferencing a null pointer in ro_rt.

This was always broken in HEAD (the offending statement was introduced
in rev. 1.123 for HEAD, while RELENG_4 included this fix (in rev.
1.99.2.12 for RELENG_4) and I inadvertently deleted it in 1.99.2.30.

So I am also restoring these two lines in RELENG_4 now.
We might need another few things from 1.99.2.30.


# 98904 27-Jun-2002 mux

Warning fixes for 64 bits platforms. With this last fix,
I can build a GENERIC sparc64 kernel with -Werror.

Reviewed by: luigi


# 98849 26-Jun-2002 ken

At long last, commit the zero copy sockets code.

MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes.

ti.4: Update the ti(4) man page to include information on the
TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
and also include information about the new character
device interface and the associated ioctls.

man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated
links.

jumbo.9: New man page describing the jumbo buffer allocator
interface and operation.

zero_copy.9: New man page describing the general characteristics of
the zero copy send and receive code, and what an
application author should do to take advantage of the
zero copy functionality.

NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.

conf/files: Add uipc_jumbo.c and uipc_cow.c.

conf/options: Add the 5 options mentioned above.

kern_subr.c: Receive side zero copy implementation. This takes
"disposable" pages attached to an mbuf, gives them to
a user process, and then recycles the user's page.
This is only active when ZERO_COPY_SOCKETS is turned on
and the kern.ipc.zero_copy.receive sysctl variable is
set to 1.

uipc_cow.c: Send side zero copy functions. Takes a page written
by the user and maps it copy on write and assigns it
kernel virtual address space. Removes copy on write
mapping once the buffer has been freed by the network
stack.

uipc_jumbo.c: Jumbo disposable page allocator code. This allocates
(optionally) disposable pages for network drivers that
want to give the user the option of doing zero copy
receive.

uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are
enabled if ZERO_COPY_SOCKETS is turned on.

Add zero copy send support to sosend() -- pages get
mapped into the kernel instead of getting copied if
they meet size and alignment restrictions.

uipc_syscalls.c:Un-staticize some of the sf* functions so that they
can be used elsewhere. (uipc_cow.c)

if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
calling malloc() with M_WAITOK. Return an error if
the M_NOWAIT malloc fails.

The ti(4) driver and the wi(4) driver, at least, call
this with a mutex held. This causes witness warnings
for 'ifconfig -a' with a wi(4) or ti(4) board in the
system. (I've only verified for ti(4)).

ip_output.c: Fragment large datagrams so that each segment contains
a multiple of PAGE_SIZE amount of data plus headers.
This allows the receiver to potentially do page
flipping on receives.

if_ti.c: Add zero copy receive support to the ti(4) driver. If
TI_PRIVATE_JUMBOS is not defined, it now uses the
jumbo(9) buffer allocator for jumbo receive buffers.

Add a new character device interface for the ti(4)
driver for the new debugging interface. This allows
(a patched version of) gdb to talk to the Tigon board
and debug the firmware. There are also a few additional
debugging ioctls available through this interface.

Add header splitting support to the ti(4) driver.

Tweak some of the default interrupt coalescing
parameters to more useful defaults.

Add hooks for supporting transmit flow control, but
leave it turned off with a comment describing why it
is turned off.

if_tireg.h: Change the firmware rev to 12.4.11, since we're really
at 12.4.11 plus fixes from 12.4.13.

Add defines needed for debugging.

Remove the ti_stats structure, it is now defined in
sys/tiio.h.

ti_fw.h: 12.4.11 firmware.

ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13,
and my header splitting patches. Revision 12.4.13
doesn't handle 10/100 negotiation properly. (This
firmware is the same as what was in the tree previously,
with the addition of header splitting support.)

sys/jumbo.h: Jumbo buffer allocator interface.

sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to
indicate that the payload buffer can be thrown away /
flipped to a userland process.

socketvar.h: Add prototype for socow_setup.

tiio.h: ioctl interface to the character portion of the ti(4)
driver, plus associated structure/type definitions.

uio.h: Change prototype for uiomoveco() so that we'll know
whether the source page is disposable.

ufs_readwrite.c:Update for new prototype of uiomoveco().

vm_fault.c: In vm_fault(), check to see whether we need to do a page
based copy on write fault.

vm_object.c: Add a new function, vm_object_allocate_wait(). This
does the same thing that vm_object allocate does, except
that it gives the caller the opportunity to specify whether
it should wait on the uma_zalloc() of the object structre.

This allows vm objects to be allocated while holding a
mutex. (Without generating WITNESS warnings.)

vm_object_allocate() is implemented as a call to
vm_object_allocate_wait() with the malloc flag set to
M_WAITOK.

vm_object.h: Add prototype for vm_object_allocate_wait().

vm_page.c: Add page-based copy on write setup, clear and fault
routines.

vm_page.h: Add page based COW function prototypes and variable in
the vm_page structure.

Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.


# 98666 23-Jun-2002 luigi

fix bad indentation and whitespace resulting from cut&paste


# 98613 22-Jun-2002 luigi

Remove (almost all) global variables that were used to hold
packet forwarding state ("annotations") during ip processing.
The code is considerably cleaner now.

The variables removed by this change are:

ip_divert_cookie used by divert sockets
ip_fw_fwd_addr used for transparent ip redirection
last_pkt used by dynamic pipes in dummynet

Removal of the first two has been done by carrying the annotations
into volatile structs prepended to the mbuf chains, and adding
appropriate code to add/remove annotations in the routines which
make use of them, i.e. ip_input(), ip_output(), tcp_input(),
bdg_forward(), ether_demux(), ether_output_frame(), div_output().

On passing, remove a bug in divert handling of fragmented packet.
Now it is the fragment at offset 0 which sets the divert status of
the whole packet, whereas formerly it was the last incoming fragment
to decide.

Removal of last_pkt required a change in the interface of ip_fw_chk()
and dummynet_io(). On passing, use the same mechanism for dummynet
annotations and for divert/forward annotations.

option IPFIREWALL_FORWARD is effectively useless, the code to
implement it is very small and is now in by default to avoid the
obfuscation of conditionally compiled code.

NOTES:
* there is at least one global variable left, sro_fwd, in ip_output().
I am not sure if/how this can be removed.

* I have deliberately avoided gratuitous style changes in this commit
to avoid cluttering the diffs. Minor stule cleanup will likely be
necessary

* this commit only focused on the IP layer. I am sure there is a
number of global variables used in the TCP and maybe UDP stack.

* despite the number of files touched, there are absolutely no API's
or data structures changed by this commit (except the interfaces of
ip_fw_chk() and dummynet_io(), which are internal anyways), so
an MFC is quite safe and unintrusive (and desirable, given the
improved readability of the code).

MFC after: 10 days


# 97074 21-May-2002 arr

- Change the newly turned INVARIANTS #ifdef blocks (they were changed from
DIAGNOSTIC yesterday) into KASSERT()'s as these help to increase code
readability.


# 97020 20-May-2002 arr

- Turn a few DIAGNOSTIC into INVARIANTS since they are really sanity
checks.


# 96245 09-May-2002 luigi

Cleanup the interface to ip_fw_chk, two of the input arguments
were totally useless and have been removed.

ip_input.c, ip_output.c:
Properly initialize the "ip" pointer in case the firewall does an
m_pullup() on the packet.

Remove some debugging code forgotten long ago.

ip_fw.[ch], bridge.c:
Prepare the grounds for matching MAC header fields in bridged packets,
so we can have 'etherfw' functionality without a lot of kernel and
userland bloat.


# 93593 01-Apr-2002 jhb

Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on: smp@


# 92960 22-Mar-2002 ru

Prevent icmp_reflect() from calling ip_output() with a NULL route
pointer which will then result in the allocated route's reference
count never being decremented. Just flood ping the localhost and
watch refcnt of the 127.0.0.1 route with netstat(1).

Submitted by: jayanth

Back out ip_output.c,v 1.143 and ip_mroute.c,v 1.69 that allowed
ip_output() to be called with a NULL route pointer. The previous
paragraph shows why this was a bad idea in the first place.

MFC after: 0 days


# 92723 19-Mar-2002 alfred

Remove __P.


# 90868 18-Feb-2002 mike

o Move NTOHL() and associated macros into <sys/param.h>. These are
deprecated in favor of the POSIX-defined lowercase variants.
o Change all occurrences of NTOHL() and associated marcros in the
source tree to use the lowercase function variants.
o Add missing license bits to sparc64's <machine/endian.h>.
Approved by: jake
o Clean up <machine/endian.h> files.
o Remove unused __uint16_swap_uint32() from i386's <machine/endian.h>.
o Remove prototypes for non-existent bswapXX() functions.
o Include <machine/endian.h> in <arpa/inet.h> to define the
POSIX-required ntohl() family of functions.
o Do similar things to expose the ntohl() family in libstand, <netinet/in.h>,
and <sys/param.h>.
o Prepend underscores to the ntohl() family to help deal with
complexities associated with having MD (asm and inline) versions, and
having to prevent exposure of these functions in other headers that
happen to make use of endian-specific defines.
o Create weak aliases to the canonical function name to help deal with
third-party software forgetting to include an appropriate header.
o Remove some now unneeded pollution from <sys/types.h>.
o Add missing <arpa/inet.h> includes in userland.

Tested on: alpha, i386
Reviewed by: bde, jake, tmm


# 90698 15-Feb-2002 ru

Moved the 127/8 check below so that IPF redirects have a chance of working.

MFC after: 1 day


# 89624 21-Jan-2002 ume

- Check the address family of the destination cached in a PCB.
- Clear the cached destination before getting another cached route.
Otherwise, garbage in the padding space (which might be filled in if it was
used for IPv4) could annoy rtalloc.

Obtained from: KAME


# 89614 21-Jan-2002 ru

RFC1122 requires that addresses of the form { 127, <any> } MUST NOT
appear outside a host.

PR: 30792, 33996
Obtained from: ip_input.c
MFC after: 1 week


# 88931 05-Jan-2002 fenner

Pre-calculate the checksum for multicast packets sourced on a
multicast router. This is overkill; it should be possible to
delay to hardware interfaces and only pre-calculate when forwarding
to a tunnel.


# 88593 28-Dec-2001 julian

Fix ipfw fwd so that it acts as the docs say
when forwarding an incoming packet to another machine.

Obtained from: Vicor Production tree
MFC after: 3 weeks


# 88190 19-Dec-2001 yar

Don't try to free a NULL route when doing IPFIREWALL_FORWARD.
An old route will be NULL at that point if a packet were initially
routed to an interface (using the IP_ROUTETOIF flag.)

Submitted by: Igor Timkin <ivt@gamma.ru>


# 87916 14-Dec-2001 jlemon

whitespace and style fixes recovered from -stable.


# 87167 01-Dec-2001 ru

Allow for ip_output() to be called with a NULL route pointer.
This fixes a panic I introduced yesterday in ip_icmp.c,v 1.64.


# 86047 04-Nov-2001 luigi

MFS: sync the ipfw/dummynet/bridge code with the one recently merged
into stable (mostly , but not only, formatting and comments changes).


# 85741 30-Oct-2001 wpaul

Fix a (long standing?) bug in ip_output(): if ip_insertoptions() is
called and ip_output() encounters an error and bails (i.e. host
unreachable), we will leak an mbuf. This is because the code calls
m_freem(m0) after jumping to the bad: label at the end of the function,
when it should be calling m_freem(m). (m0 is the original mbuf list
_without_ the options mbuf prepended.)

Obtained from: NetBSD


# 85732 30-Oct-2001 jlemon

When dropping a packet because there is no room in the queue (which itself
is somewhat bogus), update the statistics to indicate something was dropped.

PR: 13740


# 84516 05-Oct-2001 ps

Make it so dummynet and bridge can be loaded as modules.

Submitted by: billf


# 84102 29-Sep-2001 jlemon

Add a hash table that contains the list of internet addresses, and use
this in place of the in_ifaddr list when appropriate. This improves
performance on hosts which have a large number of IP aliases.


# 84101 29-Sep-2001 jlemon

Centralize satosin(), sintosa() and ifatoia() macros in <netinet/in.h>
Remove local definitions.


# 84058 27-Sep-2001 luigi

Two main changes here:
+ implement "limit" rules, which permit to limit the number of sessions
between certain host pairs (according to masks). These are a special
type of stateful rules, which might be of interest in some cases.
See the ipfw manpage for details.

+ merge the list pointers and ipfw rule descriptors in the kernel, so
the code is smaller, faster and more readable. This patch basically
consists in replacing "foo->rule->bar" with "rule->bar" all over
the place.
I have been willing to do this for ages!

MFC after: 1 week


# 83934 25-Sep-2001 brooks

Make faith loadable, unloadable, and clonable.


# 83366 12-Sep-2001 julian

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 83130 06-Sep-2001 jlemon

Wrap array accesses in macros, which also happen to be lvalues:

ifnet_addrs[i - 1] -> ifaddr_byindex(i)
ifindex2ifnet[i] -> ifnet_byindex(i)

This is intended to ease the conversion to SMPng.


# 81111 03-Aug-2001 dcs

MFS: Avoid dropping fragments in the absence of an interface address.

Noticed by: fenner
Submitted by: iedowse
Not committed to current by: iedowse ;-)


# 80211 23-Jul-2001 ru

Avoid a NULL pointer derefence introduced in rev. 1.129.

Problem noticed by: bde, gcc(1)
Panic caught by: mjacob
Patch tested by: mjacob


# 79934 19-Jul-2001 ru

Backout non-functional changes from revision 1.128.

Not objected to by: dcs


# 79836 17-Jul-2001 dcs

(forced commit for more info on the previous change)

According to RFC 1112, which deals with multicasting, an application must
specify the interface on which it wished to send a multicast packet. In the
absence of such an interface, a default is selected.

Previous behavior in ip_output.c erroneously checked for a route to the
destination multicast address, rejecting the packet if none existed.

Applications got around this with a FAQ recommending a 224.0.0.0/29 route
to loopback. This worked because the multicast code in ip_output.c discards
route information if an interface was selected.

The previous commit skips the route check in the case where a multicast packet
is being send to an application-defined interface.

The only change in behavior, if no bugs were introduced, is the lack of
rejection of a multicast packet for which no route exists sent to an
application-defined interface.

MFC after: 2 weeks


# 79830 17-Jul-2001 dcs

Skip the route checking in the case of multicast packets with known
interfaces.

Reviewed by: people at that channel
Approved by: silence on -net


# 78093 11-Jun-2001 ume

This is force commit to mention about previous commit.

- use 0/8 to specify interface index on multicast get/setsockopt
- make sure to nuke m->m_aux pointer for ipsec, on if_output.
- pass error from ipsec_setsocket() all the way up.
- move ipsec output processing before filtering section.


# 78064 11-Jun-2001 ume

Sync with recent KAME.
This work was based on kame-20010528-freebsd43-snap.tgz and some
critical problem after the snap was out were fixed.
There are many many changes since last KAME merge.

TODO:
- The definitions of SADB_* in sys/net/pfkeyv2.h are still different
from RFC2407/IANA assignment because of binary compatibility
issue. It should be fixed under 5-CURRENT.
- ip6po_m member of struct ip6_pktopts is no longer used. But, it
is still there because of binary compatibility issue. It should
be removed under 5-CURRENT.

Reviewed by: itojun
Obtained from: KAME
MFC after: 3 weeks


# 77574 01-Jun-2001 kris

Add ``options RANDOM_IP_ID'' which randomizes the ID field of IP packets.
This closes a minor information leak which allows a remote observer to
determine the rate at which the machine is generating packets, since the
default behaviour is to increment a counter for each packet sent.

Reviewed by: -net
Obtained from: OpenBSD


# 74213 13-Mar-2001 ru

RFC768 (UDP) requires that "if the computed checksum is zero, it
is transmitted as all ones". This got broken after introduction
of delayed checksums as follows. Some guys (including Jonathan)
think that it is allowed to transmit all ones in place of a zero
checksum for TCP the same way as for UDP. (The discussion still
takes place on -net.) Thus, the 0 -> 0xffff checksum fixup was
first moved from udp_output() (see udp_usrreq.c, 1.64 -> 1.65)
to in_cksum_skip() (see sys/i386/i386/in_cksum.c, 1.17 -> 1.18,
INVERT expression). Besides that I disagree that it is valid for
TCP, there was no real problem until in_cksum.c,v 1.20, where the
in_cksum() was made just a special version of in_cksum_skip().
The side effect was that now every incoming IP datagram failed to
pass the checksum test (in_cksum() returned 0xffff when it should
actually return zero). It was fixed next day in revision 1.21,
by removing the INVERT expression. The latter also broke the
0 -> 0xffff fixup for UDP checksums.

Before this change:
: tcpdump: listening on lo0
: 127.0.0.1.33005 > 127.0.0.1.33006: udp 0 (ttl 64, id 1)
: 4500 001c 0001 0000 4011 7cce 7f00 0001
: 7f00 0001 80ed 80ee 0008 0000

After this change:
: tcpdump: listening on lo0
: 127.0.0.1.33005 > 127.0.0.1.33006: udp 0 (ttl 64, id 1)
: 4500 001c 0001 0000 4011 7cce 7f00 0001
: 7f00 0001 80ed 80ee 0008 ffff


# 74111 11-Mar-2001 iedowse

In ip_output(), initialise `ia' in the case where the packet has
come from a dummynet pipe. Without this, the code which increments
the per-ifaddr stats can dereference an uninitialised pointer. This
should make dummynet usable again.

Reported by: "Dmitry A. Yanko" <fm@astral.ntu-kpi.kiev.ua>
Reviewed by: luigi, joe


# 73102 26-Feb-2001 asmodai

Remove conditionals for vax support.
People who care much about this are welcomed to try 2.11BSD. :)

Noticed by: luigi
Reviewed by: jesper


# 72012 04-Feb-2001 phk

Another round of the <sys/queue.h> FOREACH transmogriffer.

Created with: sed(1)
Reviewed by: md5(1)


# 71999 04-Feb-2001 phk

Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)


# 71909 01-Feb-2001 luigi

MFS: bridge/ipfw/dummynet fixes (bridge.c will be committed separately)


# 71613 25-Jan-2001 luigi

Pass up errors returned by dummynet. The same should be done with
divert.


# 70254 21-Dec-2000 bmilekic

* Rename M_WAIT mbuf subsystem flag to M_TRYWAIT.
This is because calls with M_WAIT (now M_TRYWAIT) may not wait
forever when nothing is available for allocation, and may end up
returning NULL. Hopefully we now communicate more of the right thing
to developers and make it very clear that it's necessary to check whether
calls with M_(TRY)WAIT also resulted in a failed allocation.
M_TRYWAIT basically means "try harder, block if necessary, but don't
necessarily wait forever." The time spent blocking is tunable with
the kern.ipc.mbuf_wait sysctl.
M_WAIT is now deprecated but still defined for the next little while.

* Fix a typo in a comment in mbuf.h

* Fix some code that was actually passing the mbuf subsystem's M_WAIT to
malloc(). Made it pass M_WAITOK instead. If we were ever to redefine the
value of the M_WAIT flag, this could have became a big problem.


# 68150 31-Oct-2000 joe

It's no longer true that "nobody uses ia beyond here"; it's now
used to keep address based if_data statistics in.

Submitted by: ru


# 67893 29-Oct-2000 phk

Move suser() and suser_xxx() prototypes and a related #define from
<sys/proc.h> to <sys/systm.h>.

Correctly document the #includes needed in the manpage.

Add one now needed #include of <sys/systm.h>.
Remove the consequent 48 unused #includes of <sys/proc.h>.


# 67833 29-Oct-2000 joe

Count per-address statistics for IP fragments.

Requested by: ru
Obtained from: BSD/OS


# 67375 20-Oct-2000 ru

Save a few CPU cycles in IP fragmentation code.


# 67334 19-Oct-2000 joe

Augment the 'ifaddr' structure with a 'struct if_data' to keep
statistics on a per network address basis.

Teach the IPv4 and IPv6 input/output routines to log packets/bytes
against the network address connected to the flow.

Teach netstat to display the per-address stats for IP protocols
when 'netstat -i' is evoked, instead of displaying the per-interface
stats.


# 65837 14-Sep-2000 ru

Follow BSD/OS and NetBSD, keep the ip_id field in network order all the time.

Requested by: wollman


# 65327 01-Sep-2000 ru

Fixed broken ICMP error generation, unified conversion of IP header
fields between host and network byte order. The details:

o icmp_error() now does not add IP header length. This fixes the problem
when icmp_error() is called from ip_forward(). In this case the ip_len
of the original IP datagram returned with ICMP error was wrong.

o icmp_error() expects all three fields, ip_len, ip_id and ip_off in host
byte order, so DTRT and convert these fields back to network byte order
before sending a message. This fixes the problem described in PR 16240
and PR 20877 (ip_id field was returned in host byte order).

o ip_ttl decrement operation in ip_forward() was moved down to make sure
that it does not corrupt the copy of original IP datagram passed later
to icmp_error().

o A copy of original IP datagram in ip_forward() was made a read-write,
independent copy. This fixes the problem I first reported to Garrett
Wollman and Bill Fenner and later put in audit trail of PR 16240:
ip_output() (not always) converts fields of original datagram to network
byte order, but because copy (mcopy) and its original (m) most likely
share the same mbuf cluster, ip_output()'s manipulations on original
also corrupted the copy.

o ip_output() now expects all three fields, ip_len, ip_off and (what is
significant) ip_id in host byte order. It was a headache for years that
ip_id was handled differently. The only compatibility issue here is the
raw IP socket interface with IP_HDRINCL socket option set and a non-zero
ip_id field, but ip.4 manual page was unclear on whether in this case
ip_id field should be in host or network byte order.


# 64060 31-Jul-2000 darrenr

activate pfil_hooks and covert ipfilter to use it


# 62587 04-Jul-2000 itojun

sync with kame tree as of july00. tons of bug fixes/improvements.

API changes:
- additional IPv6 ioctls
- IPsec PF_KEY API was changed, it is mandatory to upgrade setkey(8).
(also syntax change)


# 61183 02-Jun-2000 jlemon

Add boundary checks against IP options.

Obtained from: OpenBSD


# 60910 25-May-2000 jlemon

Mark the checksum as complete when looping back multicast packets.

Submitted by: Jeff Gibbons <jgibbons@n2.net>


# 60889 24-May-2000 archie

Just need to pass the address family to if_simloop(), not the whole sockaddr.


# 60765 21-May-2000 jlemon

Compute the checksum before handing the packet off to IPFilter.

Tested by: Cy Schubert <Cy.Schubert@uumail.gov.bc.ca>


# 58936 02-Apr-2000 shin

Move htons() ip_len to after the in_delayed_cksum() call.
This should stop cksum error messages on IPsec communication
which was reported on freebsd-current.

Reviewed by: jlemon


# 58895 01-Apr-2000 jlemon

Calculate any delayed checksums before handing an mbuf off to a
divert socket. This fixes a problem with ppp/natd.

Reviewed by: bsd (Brian Dean, gotta love that login name)


# 58806 30-Mar-2000 jlemon

If `ipfw fwd' loops an mbuf back to ip_input from ip_output and the
mbuf is marked for delayed checksums, then additionally mark the
packet as having it's checksums computed. This allows us to bypass
computing/checking the checksum entirely, which isn't really needeed
as the packet has never hit the wire.

Reviewed by: green


# 58698 27-Mar-2000 jlemon

Add support for offloading IP/TCP/UDP checksums to NIC hardware which
supports them.


# 57855 09-Mar-2000 shin

Initialize mbuf pointer at getting ipsec policy.
Without this, kernel will panic at getsockopt() of IPSEC_POLICY.
Also make compilable libipsec/test-policy.c which tries getsockopt() of
IPSEC_POLICY.

Approved by: jkh

Submitted by: sakane@kame.net


# 57401 23-Feb-2000 guido

Remove option IPFILTER_KLD. In case you wanted to kldload ipfilter,
the module would only work in kernels built with this option.

Approved by: jkh


# 57114 10-Feb-2000 luigi

Support the net.inet.ip.fw.enable variable, part of
the recent ipfw modifications.

Approved-by: jordan


# 55777 10-Jan-2000 ru

MGETHDR() does not initialize m_pkthdr.rcvif, do it here.

This fixes page fault panic observed when diverting packets
with IP options (e.g. ping -R remoteIP over natd).

PR: kern/8596, kern/11199


# 55632 09-Jan-2000 shin

enable IPsec over DUMMYNET again

Submitted by: luigi
Reviewed by: luigi


# 55598 08-Jan-2000 luigi

Cleanup dummynet call interface so it should now work on the Alpha
as well. Also (probably) fix a bug introduced during the IPv6 import.


# 55009 22-Dec-1999 shin

IPSEC support in the kernel.
pr_input() routines prototype is also changed to support IPSEC and IPV6
chained protocol headers.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 54175 05-Dec-1999 archie

Miscellaneous fixes/cleanups relating to ipfw and divert(4):

- Implement 'ipfw tee' (finally)
- Divert packets by calling new function divert_packet() directly instead
of going through protosw[].
- Replace kludgey global variable 'ip_divert_port' with a function parameter
to divert_packet()
- Replace kludgey global variable 'frag_divert_port' with a function parameter
to ip_reass()
- style(9) fixes

Reviewed by: julian, green


# 50477 27-Aug-1999 peter

$Id$ -> $FreeBSD$


# 46420 04-May-1999 luigi

Free the dummynet descriptor in ip_dummynet, not in the called
routines. The descriptor contains parameters which could be used
within those routines (eg. ip_output() ).

On passing, add IPPROTO_PGM entry to netinet/in.h


# 46393 04-May-1999 luigi

forgot passing the right pointer to dst to dummynet_io().
(-stable and releng2 were already safe).
Debugged-By: phk


# 45869 20-Apr-1999 peter

Tidy up some stray / unused stuff in the IPFW package and friends.
- unifdef -DCOMPAT_IPFW (this was on by default already)
- remove traces of in-kernel ip_nat package, it was never committed.
- Make IPFW and DUMMYNET initialize themselves rather than depend on
compiled-in hooks in ip_init(). This means they initialize the same
way both in-kernel and as kld modules. (IPFW initializes now :-)


# 44797 16-Mar-1999 luigi

Fix a dummynet bug caused by passing a bad next hop address (the
symptom was the msg "arp failure -- host is not on local network" that
some user have seen on multihomed machines.
Bug tracked down by Emmanuel Duros


# 44154 19-Feb-1999 luigi

avoid panic with pkts larger than MTU and DF set coming out of a pipe.


# 41990 21-Dec-1998 luigi

Restore 1.82->1.83 change deleted by mistake< per Bruce suggestion


# 41793 14-Dec-1998 luigi

Last bits (i think) of dummynet for -current.


# 41059 10-Nov-1998 peter

add #include <sys/kernel.h> where it's needed by MALLOC_DEFINE()


# 38754 02-Sep-1998 wollman

Properly fragment multicast packets.

PR: 7802
Submitted by: Steve McCanne <mccanne@cs.berkeley.edu>


# 38482 23-Aug-1998 wollman

Yow! Completely change the way socket options are handled, eliminating
another specialized mbuf type in the process. Also clean up some
of the cruft surrounding IPFW, multicast routing, RSVP, and other
ill-explored corners.


# 37996 01-Aug-1998 peter

Fix a compile error if IPFIREWALL_FORWARD active without IPDIVERT.


# 37624 13-Jul-1998 bde

Fixed some longs that should have been fixed-sized types.


# 37413 06-Jul-1998 julian

Don't expect the new code to be used without the right option file being
included.


# 37412 06-Jul-1998 julian

Fix braino in switching to TAILQ macro.


# 37409 06-Jul-1998 julian

Support for IPFW based transparent forwarding.
Any packet that can be matched by a ipfw rule can be redirected
transparently to another port or machine. Redirection to another port
mostly makes sense with tcp, where a session can be set up
between a proxy and an unsuspecting client. Redirection to another machine
requires that the other machine also be expecting to receive the forwarded
packets, as their headers will not have been modified.

/sbin/ipfw must be recompiled!!!

Reviewed by: Peter Wemm <peter@freebsd.org>
Submitted by: Chrisy Luke <chrisy@flix.net>


# 37094 21-Jun-1998 bde

Removed unused includes.


# 36995 14-Jun-1998 julian

fix another typo


# 36992 14-Jun-1998 julian

Try narrow down the culprit sending undefined packet types through the loopback


# 36908 12-Jun-1998 julian

Go through the loopback code with a broom..
Remove lots'o'hacks.
looutput is now static.

Other callers who want to use loopback to allow shortcutting
should call the special entrypoint for this, if_simloop(), which is
specifically designed for this purpose. Using looutput for this purpose
was problematic, particularly with bpf and trying to keep track
of whether one should be using the charateristics of the loopback interface
or the interface (e.g. if_ethersubr.c) that was requesting the loopback.
There was a whole class of errors due to this mis-use each of which had
hacks to cover them up.

Consists largly of hack removal :-)


# 36710 06-Jun-1998 julian

Make sure the default value of a dummy variable is 0
so that it doesn't do anything.


# 36708 06-Jun-1998 julian

Fix wrong data type for a pointer.


# 36707 06-Jun-1998 julian

clean up the changes made to ipfw over the last weeks
(should make the ipfw lkm work again)


# 36678 05-Jun-1998 julian

Reverse the default sense of the IPFW/DIVERT reinjection code
so that the new behaviour is now default.
Solves the "infinite loop in diversion" problem when more than one diversion
is active.
Man page changes follow.

The new code is in -stable as the NON default option.


# 36369 25-May-1998 julian

Add optional code to change the way that divert and ipfw work together.
Prior to this change, Accidental recursion protection was done by
the diverted daemon feeding back the divert port number it got
the packet on, as the port number on a sendto(). IPFW knew not to
redivert a packet to this port (again). Processing of the ruleset
started at the beginning again, skipping that divert port.

The new semantic (which is how we should have done it the first time)
is that the port number in the sendto() is the rule number AFTER which
processing should restart, and on a recvfrom(), the port number is the
rule number which caused the diversion. This is much more flexible,
and also more intuitive. If the user uses the same sockaddr received
when resending, processing resumes at the rule number following that
that caused the diversion. The user can however select to resume rule
processing at any rule. (0 is restart at the beginning)

To enable the new code use

option IPFW_DIVERT_RESTART

This should become the default as soon as people have looked at it a bit


# 34746 21-Mar-1998 peter

Make this compile.. There are some unpleasing hacks in here.
A major unifdef session is sorely tempting but would destroy any remaining
chance of tracking the original sources.


# 33678 20-Feb-1998 bde

Don't depend on "implicit int".


# 33134 06-Feb-1998 eivind

Back out DIAGNOSTIC changes.


# 33108 04-Feb-1998 eivind

Turn DIAGNOSTIC into a new-style option.


# 31017 07-Nov-1997 phk

Rename some local variables to avoid shadowing other local variables.

Found by: -Wshadow


# 30966 05-Nov-1997 joerg

Make IPDIVERT a supported option. Alas, in_var.h depends on it, i
hope i've found out all files that actually depend on this dependancy.
IMHO, it's not very good practice to change the size of internal
structs depending on kernel options.


# 30354 12-Oct-1997 phk

Last major round (Unless Bruce thinks of somthing :-) of malloc changes.

Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.

A couple of finer points by: bde


# 30309 11-Oct-1997 phk

Distribute and statizice a lot of the malloc M_* types.

Substantial input from: bde


# 27845 02-Aug-1997 bde

Removed unused #includes.


# 26359 02-Jun-1997 julian

Submitted by: Whistle Communications (archie Cobbs)

these are quite extensive additions to the ipfw code.
they include a change to the API because the old method was
broken, but the user view is kept the same.

The new code allows a particular match to skip forward to a particular
line number, so that blocks of rules can be
used without checking all the intervening rules.
There are also many more ways of rejecting
connections especially TCP related, and
many many more ...

see the man page for a complete description.


# 25516 06-May-1997 fenner

Pull up the IP header in ip_mloopback(). This makes sure that the
operations on the header inside ip_mloopback() are performed on
a private copy instead of a shared cluster.

PR: kern/3410


# 25201 27-Apr-1997 wollman

The long-awaited mega-massive-network-code- cleanup. Part I.

This commit includes the following changes:
1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility
glue for them is deleted, and the kernel will panic on boot if any are compiled
in.

2) Certain protocol entry points are modified to take a process structure,
so they they can easily tell whether or not it is possible to sleep, and
also to access credentials.

3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt()
call. Protocols should use the process pointer they are now passed.

4) The PF_LOCAL and PF_ROUTE families have been updated to use the new
style, as has the `raw' skeleton family.

5) PF_LOCAL sockets now obey the process's umask when creating a socket
in the filesystem.

As a result, LINT is now broken. I'm hoping that some enterprising hacker
with a bit more time will either make the broken bits work (should be
easy for netipx) or dike them out.


# 24590 03-Apr-1997 darrenr

Resolve conflicts created by import.


# 24570 03-Apr-1997 dg

Reorganize elements of the inpcb struct to take better advantage of
cache lines. Removed the struct ip proto since only a couple of chars
were actually being used in it. Changed the order of compares in the
PCB hash lookup to take advantage of partial cache line fills (on PPro).

Discussed-with: wollman


# 23221 28-Feb-1997 fenner

Fix a comment and some commented-out code in ip_mloopback to
reflect how multicast loopback really works.


# 22975 22-Feb-1997 peter

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 22927 19-Feb-1997 darrenr

change IP Filter hooks to match new 3.1.8 patches for FreeBSD


# 22531 10-Feb-1997 darrenr

Add IP Filter hooks (from patches).


# 22212 02-Feb-1997 brian

Reset ip_divert_ignore to zero immediately after use - also,
set it in the first place, independent of whether sin->sin_port
is set.

The result is that diverted packets that are being forwarded
will be diverted once and only once on the way in (ip_input())
and again, once and only once on the way out (ip_output()) -
twice in total. ICMP packets that don't contain a port will
now also be diverted.


# 21673 14-Jan-1997 jkh

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 20407 13-Dec-1996 wollman

Convert the interface address and IP interface address structures
to TAILQs. Fix places which referenced these for no good reason
that I can see (the references remain, but were fixed to compile
again; they are still questionable).


# 19622 11-Nov-1996 fenner

Add the IP_RECVIF socket option, which supplies a packet's incoming interface
using a sockaddr_dl.

Fix the other packet-information socket options (SO_TIMESTAMP, IP_RECVDSTADDR)
to work for multicast UDP and raw sockets as well. (They previously only
worked for unicast UDP).


# 19113 22-Oct-1996 sos

Changed args to the nat functions.


# 18797 07-Oct-1996 wollman

All three files: make COMPAT_IPFW==0 case work again.
ip_input.c:
- delete some dusty code
- _IP_VHL
- use fast inline header checksum when possible


# 17758 21-Aug-1996 sos

Add hooks for an IP NAT module, much like the firewall stuff...
Move the sockopt definitions for the firewall code from
ip_fw.h to in.h where it belongs.


# 17072 10-Jul-1996 julian

Adding changes to ipfw and the kernel to support ip packet diversion..
This stuff should not be too destructive if the IPDIVERT is not compiled in..
be aware that this changes the size of the ip_fw struct
so ipfw needs to be recompiled to use it.. more changes coming to clean this up.


# 16206 08-Jun-1996 bde

Changed some memcpy()'s back to bcopy()'s.

gcc only inlines memcpy()'s whose count is constant and didn't inline
these. I want memcpy() in the kernel go away so that it's obvious that
it doesn't need to be optimized. Now it is only used for one struct
copy in si.c.


# 15869 22-May-1996 wollman

Conditionalize calls to IPFW code on COMPAT_IPFW. This is done slightly
unconventionally:
If COMPAT_IPFW is not defined, or if it is defined to 1, enable;
otherwise, disable.

This means that these changes actually have no effect on anyone at the
moment. (It just makes it easier for me to keep my code in sync.)
In the future, the `not defined' part of the hack should be eliminated,
but doing this now would require everyone to change their config files.

The same conditionals need to be made in ip_input.c as well for this to
ave any useful effect, but I'm not ready to do that right now.


# 15850 21-May-1996 peter

Fix an embarresing error on my part that made the IP_PORTRANGE options
return a failure code (even though it worked).
This commit brought to you by the 'C' keyword "break".. :-)


# 15652 06-May-1996 wollman

Add three new route flags to help determine what sort of address
the destination represents. For IP:

- Iff it is a host route, RTF_LOCAL and RTF_BROADCAST indicate local
(belongs to this host) and broadcast addresses, respectively.

- For all routes, RTF_MULTICAST is set if the destination is multicast.

The RTF_BROADCAST flag is used by ip_output() to eliminate a call to
in_broadcast() in a common case; this gives about 1% in our packet-generation
experiments. All three flags might be used (although they aren't now)
to determine whether a packet can be forwarded; a given host route can
represent a forwardable address if:

(rt->rt_flags & (RTF_HOST | RTF_LOCAL | RTF_BROADCAST | RTF_MULTICAST))
== RTF_HOST

Obviously, one still has to do all the work if a host route is not present,
but this code allows one to cache the results of such a lookup if rtalloc1()
is called without masking RTF_PRCLONING.


# 15335 21-Apr-1996 bde

Fixed in-line IP header checksumming. It was performed on the wrong header
in one case.


# 15295 18-Apr-1996 wollman

Three speed-ups in the output path (two small, one substantial):

1) Require all callers to pass a valid route pointer to ip_output()
so that we don't have to check and allocate one off the stack
as was done before. This eliminates one test and some stack
bloat from the common (UDP and TCP) case.

2) Perform the IP header checksum in-line if it's of the usual length.
This results in about a 5% speed-up in my packet-generation test.

3) Use ip_vhl field rather than ip_v and ip_hl bitfields.


# 15026 03-Apr-1996 phk

Add feature for tcp "established".
Change interface between netinet and ip_fw to be more general, and thus
hopefully also support other ip filtering implementations.


# 14823 26-Mar-1996 fenner

Add missing splx(s) in IP_MULTICAST_IF

Submitted by: Jim Binkley <jrb@cs.pdx.edu>


# 14611 13-Mar-1996 pst

Fix ip option processing for raw IP sockets. This whole thing is a compromise
between ignoring options specified in the setsockopt call if IP_HDRINCL is set
(the UCB choice when VJ's code was brought in) vs allowing them (what everyone
else did, and what is assumed by programs everywhere...sigh).

Also perform some checking of the passed down packet to avoid running off
the end of a mbuf chain.

Reviewed by: fenner


# 14546 11-Mar-1996 dg

Move or add #include <queue.h> in preparation for upcoming struct socket
changes.


# 14230 23-Feb-1996 phk

The new firewall functionality:
Filter on the direction (in/out).
Filter on fragment/not fragment.


# 14209 23-Feb-1996 phk

Big sweep over the IPFIREWALL and IPACCT code.

Close the ip-fragment hole.
Waste less memory.
Rewrite to contemporary more readable style.
Kill separate IPACCT facility, use "accept" rules in IPFIREWALL.
Filter incoming >and< outgoing packets.
Replace "policy" by sticky "deny all" rule.
Rules have numbers used for ordering and deletion.
Remove "rerorder" code entirely.
Count packet & bytecount matches for rules.

Code in -current & -stable is now the same.


# 14195 22-Feb-1996 peter

Make the default behavior of local port assignment match traditional
systems (my last change did not mix well with some firewall
configurations). As much as I dislike firewalls, this is one thing I
I was not prepared to break by default.. :-)

Allow the user to nominate one of three ranges of port numbers as
candidates for selecting a local address to replace a zero port number.
The ranges are selected via a setsockopt(s, IPPROTO_IP, IP_PORTRANGE, &arg)
call. The three ranges are: default, high (to bypass firewalls) and
low (to get a port below 1024).

The default and high port ranges are sysctl settable under sysctl
net.inet.ip.portrange.*

This code also fixes a potential deadlock if the system accidently ran out
of local port addresses. It'd drop into an infinite while loop.

The secure port selection (for root) should reduce overheads and increase
reliability of rlogin/rlogind/rsh/rshd if they are modified to take
advantage of it.

Partly suggested by: pst
Reviewed by: wollman


# 12934 19-Dec-1995 wollman

Added a comment about why trying to make a one-behind cache for
the route in ip_output() is a bad idea.


# 12635 05-Dec-1995 wollman

Path MTU Discovery is now standard.


# 12296 14-Nov-1995 phk

New style sysctl & staticize alot of stuff.


# 11537 16-Oct-1995 wollman

The ability to administratively change the MTU of an interface presents
a few new wrinkles for MTU discovery which tcp_output() had better
be prepared to handle. ip_output() is also modified to do something
helpful in this case, since it has already calculated the information
we need.


# 9728 26-Jul-1995 wollman

Fix test for determining when RSVP is inactive in a router. (In this
case, multicast options are not passed to ip_mforward().) The previous
version had a wrong test, thus causing RSVP mrouters to forward RSVP messages
in violation of the spec.


# 9386 02-Jul-1995 joerg

Slightly modify my previous change to return EINVAL instead of
EFAULT.

Submitted by: Peter Wemm


# 9383 01-Jul-1995 joerg

I saw a very low-key commit message on the netbsd mailing lists and
figured out what the problem was.. Anyway, I rate it as "highly
serious".

Submitted by: peter@haywire.DIALix.COM (Peter Wemm)


# 9209 13-Jun-1995 wollman

Kernel side of 3.5 multicast routing code, based on work by Bill Fenner
and other work done here. The LKM support is probably broken, but it
still compiles and will be fixed later.


# 8876 30-May-1995 rgrimes

Remove trailing whitespace.


# 8384 09-May-1995 dg

Replaced some bcopy()'s with memcpy()'s so that gcc while inline/optimize.


# 8090 26-Apr-1995 pst

Cleanup loopback interface support.
Reviewed by: wollman


# 7684 08-Apr-1995 dg

Implemented PCB hashing. Includes new functions in_pcbinshash, in_pcbrehash,
and in_pcblookuphash.


# 7191 20-Mar-1995 wollman

This should be splimp() rather than splnet() since ifaddrs might go away
as a result of link-layer processing.


# 7190 20-Mar-1995 wollman

Fix race conditions involved in setting IP multicast options. This should
fix Dennis Fortin's problem for good, if I've got it figured out right.

(The problem was that a `struct ifaddr' could get deleted out from under
the current requester, thus leaving him with an invalid interface pointer
and causing even more bogus accesses.)


# 7090 16-Mar-1995 bde

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 5543 12-Jan-1995 ugen

Actual firewall change.
1) Firewall is not subdivided on forwarding / blocking chains
anymore.Actually only one chain left-it was the blocking one.
2) LKM support.ip_fwdef.c is function pointers definition and
goes into kernel along with all INET stuff.


# 5105 13-Dec-1994 wollman

Call rtalloc_ign() so that protocol cloning will not occur at the IP layer.


# 5085 12-Dec-1994 ugen

Add match by interface from which packet arrived (via)
Handle right fragmented packets. Remove checking option
from kernel..


# 4523 16-Nov-1994 jkh

Ugen J.S.Antsilevich's latest, happiest, IP firewall code.
Poul: Please take this into BETA. It's non-intrusive, and a rather
substantial improvement over what was there before.


# 2754 14-Sep-1994 wollman

Shuffle some functions and variables around to make it possible for
multicast routing to be implemented as an LKM. (There's still a bit of
work to do in this area.)


# 2628 09-Sep-1994 wollman

Disable IPMULTICAST_VIF socket option when MROUTING is not defined,
since it doesn'tmake any sense for non-routers.
CVS:


# 2531 06-Sep-1994 wollman

Initial get-the-easy-case-working upgrade of the multicast code
to something more recent than the ancient 1.2 release contained in
4.4. This code has the following advantages as compared to
previous versions (culled from the README file for the SunOS release):

- True multicast delivery
- Configurable rate-limiting of forwarded multicast traffic on each
physical interface or tunnel, using a token-bucket limiter.
- Simplistic classification of packets for prioritized dropping.
- Administrative scoping of multicast address ranges.
- Faster detection of hosts leaving groups.
- Support for multicast traceroute (code not yet available).
- Support for RSVP, the Resource Reservation Protocol.

What still needs to be done:

- The multicast forwarder needs testing.
- The multicast routing daemon needs to be ported.
- Network interface drivers need to have the `#ifdef MULTICAST' goop ripped
out of them.
- The IGMP code should probably be bogon-tested.

Some notes about the porting process:

In some cases, the Berkeley people decided to incorporate functionality from
later releases of the multicast code, but then had to do things differently.
As a result, if you look at Deering's patches, and then look at
our code, it is not always obvious whether the patch even applies. Let
the reader beware.

I ran ip_mroute.c through several passes of `unifdef' to get rid of
useless grot, and to permanently enable the RSVP support, which we will
include as standard.

Ported by: Garrett Wollman
Submitted by: Steve Deering and Ajit Thyagarajan (among others)


# 2112 18-Aug-1994 wollman

Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.


# 1817 02-Aug-1994 dg

Added $Id$


# 1813 01-Aug-1994 dg

fixed bug where large amounts of unidirectional UDP traffic would fill
the interface output queue and further udp packets would be fragmented
and only partially sent - keeping the output queue full and jamming the
network, but not actually getting any real work done (because you can't
send just 'part' of a udp packet - if you fragment it, you must send
the whole thing). The fix involves adding a check to make sure that the
output queue has sufficient space for all of the fragments.


# 1549 25-May-1994 rgrimes

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# 1542 24-May-1994 rgrimes

This commit was generated by cvs2svn to compensate for changes in r1541,
which included commits to RCS files with non-trunk default branches.


# 1541 24-May-1994 rgrimes

BSD 4.4 Lite Kernel Sources