History log of /freebsd-10.3-release/sys/netinet/udp_usrreq.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 296373 04-Mar-2016 marius

- Copy stable/10@296371 to releng/10.3 in preparation for 10.3-RC1
builds.
- Update newvers.sh to reflect RC1.
- Update __FreeBSD_version to reflect 10.3.
- Update default pkg(8) configuration to use the quarterly branch.

Approved by: re (implicit)

# 277789 27-Jan-2015 bryanv

MFC r272886:

Add context pointer and source address to the UDP tunnel callback

These are needed for the forthcoming vxlan implementation. The context
pointer means we do not have to use a spare pointer field in the inpcb,
and the source address is required to populate vxlan's forwarding table.


# 274266 08-Nov-2014 bryanv

MFC r272844:

Add missing UDP multicast receive dtrace probes


# 274264 08-Nov-2014 bryanv

MFC r272797:

Check for mbuf copy failure when there are multiple multicast sockets

This partitular case is the only path where the mbuf could be NULL.
udp_append() checked for a NULL mbuf only after invoking the tunneling
callback. Our only in tree tunneling callback - SCTP - assumed a non
NULL mbuf, and it is a bit odd to make the callbacks responsible for
checking this condition.

This also reduces the differences between the IPv4 and IPv6 code.


# 272661 06-Oct-2014 tuexen

MFC r272347:
The default for UDPLITE_RECV_CSCOV is zero. RFC 3828 recommend
that this means full checksum coverage for received packets.
If an application is willing to accept packets with partial
coverage, it is expected to use the socket option and provide
the minimum coverage it accepts.


# 272660 06-Oct-2014 tuexen

MFC r272326:
UDPLite requires a checksum. Therefore, discard a received packet if
the checksum is 0.


# 272645 06-Oct-2014 tuexen

MFC r272323:
If the checksum coverage field in the UDPLITE header is the length
of the complete UDPLITE packet, the packet has full checksum coverage.
So fix the condition.


# 272627 06-Oct-2014 tuexen

MFC r272263:

Checksum coverage values larger than 65535 for UDPLite are invalid.
Check for this when the user calls setsockopt using UDPLITE_{SEND,RECV}CSCOV.


# 265946 13-May-2014 kevlo

MFC r264212,r264213,r264248,r265776,r265811,r265909:

- Add support for UDP-Lite protocol (RFC 3828) to IPv4 and IPv6 stacks.
Tested with vlc and a test suite [1].
[1] http://www.erg.abdn.ac.uk/~gerrit/udp-lite/files/udplite_linux.tar.gz

Reviewed by: jhb, glebius, adrian

- Fix a logic bug which prevented the sending of UDP packet with 0 checksum.

- Disable TX checksum offload for UDP-Lite completely. It wasn't used for
partial checksum coverage, but even for full checksum coverage it doesn't
work.


# 256281 10-Oct-2013 gjb

Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


# 254893 26-Aug-2013 markj

The second last argument of udp:::receive is supposed to contain the
connection state, not the IP header.

X-MFC with: r254889


# 254889 25-Aug-2013 markj

Implement the ip, tcp, and udp DTrace providers. The probe definitions use
dynamic translation so that their arguments match the definitions for
these providers in Solaris and illumos. Thus, existing scripts for these
providers should work unmodified on FreeBSD.

Tested by: gnn, hiren
MFC after: 1 month


# 253571 23-Jul-2013 ae

Remove the large part of struct ipsecstat. Only few fields of this
structure is used, but they already have equal fields in the struct
newipsecstat, that was introduced with FAST_IPSEC and then was merged
together with old ipsecstat structure.

This fixes kernel stack overflow on some architectures after migration
ipsecstat to PCPU counters.

Reported by: Taku YAMAMOTO, Maciej Milewski


# 253084 09-Jul-2013 ae

Migrate structs arpstat, icmpstat, mrtstat, pimstat and udpstat to PCPU
counters.


# 252026 20-Jun-2013 ae

Use IPSECSTAT_INC() and IPSEC6STAT_INC() macros for ipsec statistics
accounting.

MFC after: 2 weeks


# 244031 08-Dec-2012 pjd

More warnings for zones that depend on the kern.ipc.maxsockets limit.

Obtained from: WHEEL Systems


# 243882 05-Dec-2012 glebius

Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually


# 242463 02-Nov-2012 ae

Remove the recently added sysctl variable net.pfil.forward.
Instead, add protocol specific mbuf flags M_IP_NEXTHOP and
M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain
contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup
only when this flag is set.

Suggested by: andre


# 242079 25-Oct-2012 ae

Remove the IPFIREWALL_FORWARD kernel option and make possible to turn
on the related functionality in the runtime via the sysctl variable
net.pfil.forward. It is turned off by default.

Sponsored by: Yandex LLC
Discussed with: net@
MFC after: 2 weeks


# 241923 23-Oct-2012 glebius

Do not reduce ip_len by size of IP header in the ip_input()
before passing a packet to protocol input routines.
For several protocols this mean that now protocol needs to
do subtraction itself, and for another half this means that
we do not need to add header length back to the packet.

Make ip_stripoptions() to adjust ip_len, since now we enter
this function with a packet header whose ip_len does represent
length of entire packet, not payload only.


# 241913 22-Oct-2012 glebius

Switch the entire IPv4 stack to keep the IP packet header
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.

After this change a packet processed by the stack isn't
modified at all[2] except for TTL.

After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.

[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.

[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.

Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>


# 241480 12-Oct-2012 glebius

In ip_stripoptions():
- Remove unused argument and incorrect comment.
- Fixup ip_len after stripping.


# 236961 12-Jun-2012 tuexen

Add a cmsg of type IP_TOS for UDP/IPv4 sockets to specify the TOS byte.

MFC after: 3 days


# 235985 25-May-2012 bz

MFp4 bz_ipv6_fast:

Properly protect the inp read access when handling the control code.
In the past this was expensive but given the rlock it's not so much
anymore.

Spotted while: optimizing udp6
Discussed with: rwatson (a few months ago)

Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems

Reviewed by: gnn (as part of the whole)
MFC After: 3 days


# 233554 27-Mar-2012 bz

Export the udp_cksum sysctl for upcoming SCTP work. Rather than always,
SCTP will only do IPv4 UDP checksum calculation as defined by the host
policy. When tunneling SCTP always calculates the inner checksum already
so not doing the outer UDP can save cycles.

While here virtualize the variable.

Requested by: tuexen
MFC after: 2 weeks


# 225044 20-Aug-2011 bz

Add support for IPv6 to ipfw fwd:
Distinguish IPv4 and IPv6 addresses and optional port numbers in
user space to set the option for the correct protocol family.
Add support in the kernel for carrying the new IPv6 destination
address and port.
Add support to TCP and UDP for IPv6 and fix UDP IPv4 to not change
the address in the IP header.
Add support for IPv6 forwarding to a non-local destination.
Add a regession test uitilizing VIMAGE to check all 20 possible
combinations I could think of.

Obtained from: David Dolson at Sandvine Incorporated
(original version for ipfw fwd IPv6 support)
Sponsored by: Sandvine Incorporated
PR: bin/117214
MFC after: 4 weeks
Approved by: re (kib)


# 222748 06-Jun-2011 rwatson

Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.

Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).

Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.

Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.

Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).

Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.

Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.

Reviewed by: bz
Sponsored by: Juniper Networks, Inc.


# 222691 04-Jun-2011 rwatson

Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).

Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.

(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)

Reviewed by: bz
Sponsored by: Juniper Networks, Inc.


# 222488 30-May-2011 rwatson

Decompose the current single inpcbinfo lock into two locks:

- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).

- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.

Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.

A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:

INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb

Callers must pass exactly one of these flags (for the time being).

Some notes:

- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).

This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.

Reviewed by: bz
Sponsored by: Juniper Networks, Inc.


# 221248 30-Apr-2011 bz

Make the UDP code compile without INET. Expose udp_usrreq.c to IPv6 only
as well compiling out most functions adding or extending #ifdef INET
coverage.

Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 4 days


# 220620 14-Apr-2011 trasz

Refactor udp_input(), moving calls to u_tun_func() into udp_append().

Obtained from: Wheel Systems Sp. z o.o.
Reviewed by: bz@


# 218629 13-Feb-2011 deischen

Oops, revert an accidental local change that got added in
my last commit (r218627). No damage was done in the last
commit, just some duplicated code was added (which is now
removed).


# 218627 13-Feb-2011 deischen

Allow the SO_SETFIB socket option to select the default (0)
routing table.

Reviewed by: julian


# 217597 19-Jan-2011 rrs

Fix style 9 nit that snuck in when I
grabbed the wrong patch ;-0 (thanks Daniel)

MFC after: 1 week


# 217592 19-Jan-2011 rrs

Fix a bug where Multicast packets sent from a
udp endpoint may end up echoing back to the sender
even with OUT joining the multi-cast group.

Reviewed by: gnn, bms, bz?
Obtained from: deischen (with help from)


# 217554 18-Jan-2011 mdf

Specify a CTLTYPE_FOO so that a future sysctl(8) change does not need
to rely on the format string. For SYSCTL_PROC instances that I
noticed a discrepancy between the CTLTYPE and the format specifier,
fix the CTLTYPE.


# 215701 22-Nov-2010 dim

After some off-list discussion, revert a number of changes to the
DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various
people working on the affected files. A better long-term solution is
still being considered. This reversal may give some modules empty
set_pcpu or set_vnet sections, but these are harmless.

Changes reverted:

------------------------------------------------------------------------
r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines

Instead of unconditionally emitting .globl's for the __start_set_xxx and
__stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu
sections are actually defined.

------------------------------------------------------------------------
r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.

------------------------------------------------------------------------
r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines

Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.


# 215317 14-Nov-2010 dim

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.


# 211433 17-Aug-2010 jhb

Ensure a minimum "slop" of 10 extra pcb structures when providing a
memory size estimate to userland for pcb list sysctls. The previous
behavior of a "slop" of n/8 does not work well for small values of n
(e.g. no slop at all if you have less than 8 open UDP connections).

Reviewed by: bz
MFC after: 1 week


# 207369 29-Apr-2010 bz

MFP4: @176978-176982, 176984, 176990-176994, 177441

"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by: jhb
Discussed with: rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
MFC after: 6 days


# 205251 17-Mar-2010 bz

Add pcb reference counting to the pcblist sysctl handler functions
to ensure type stability while caching the pcb pointers for the
copyout.

Reviewed by: rwatson
MFC after: 7 days


# 205157 14-Mar-2010 rwatson

Abstract out initialization of most aspects of struct inpcbinfo from
their calling contexts in {IP divert, raw IP sockets, TCP, UDP} and
create new helper functions: in_pcbinfo_init() and in_pcbinfo_destroy()
to do this work in a central spot. As inpcbinfo becomes more complex
due to ongoing work to add connection groups, this will reduce code
duplication.

MFC after: 1 month
Reviewed by: bz
Sponsored by: Juniper Networks


# 204826 07-Mar-2010 rwatson

Make udp_set_kernel_tunneling() less forgiving when its invariants are
violated: so_pcb can never be NULL for a valid UDP socket, and it is
always SOCK_DGRAM. Use sotoinpcb() as the rest of the UDP code does.

MFC after: 1 week
Reviewed by: bz
Sponsored by: Juniper Networks


# 204807 06-Mar-2010 bz

Destroy UDP UMA zones (empty or not) upon network stack teardown
to not leak them making the VM subsystem unhappy with every stoped vnet(*).
We will still leak pages (especially as zones are marked NOFREE).

(*) This will also keep vmstat -z more usable.

Sponsored by: ISPsystem
MFC after: 5 days


# 196039 02-Aug-2009 rwatson

Many network stack subsystems use a single global data structure to hold
all pertinent statatistics for the subsystem. These structures are
sometimes "borrowed" by kernel modules that require a place to store
statistics for similar events.

Add KPI accessor functions for statistics structures referenced by kernel
modules so that they no longer encode certain specifics of how the data
structures are named and stored. This change is intended to make it
easier to move to per-CPU network stats following 8.0-RELEASE.

The following modules are affected by this change:

if_bridge
if_cxgb
if_gif
ip_mroute
ipdivert
pf

In practice, most of these statistics consumers should, in fact, maintain
their own statistics data structures rather than borrowing structures
from the base network stack. However, that change is too agressive for
this point in the release cycle.

Reviewed by: bz
Approved by: re (kib)


# 196019 01-Aug-2009 rwatson

Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by: bz
Approved by: re (vimage blanket)


# 195727 16-Jul-2009 rwatson

Remove unused VNET_SET() and related macros; only VNET_GET() is
ever actually used. Rename VNET_GET() to VNET() to shorten
variable references.

Discussed with: bz, julian
Reviewed by: bz
Approved by: re (kensmith, kib)


# 195699 14-Jul-2009 rwatson

Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)


# 194062 12-Jun-2009 vanhu

Added support for NAT-Traversal (RFC 3948) in IPsec stack.

Thanks to (no special order) Emmanuel Dreyfus (manu@netbsd.org), Larry
Baird (lab@gta.com), gnn, bz, and other FreeBSD devs, Julien Vanherzeele
(julien.vanherzeele@netasq.com, for years of bug reporting), the PFSense
team, and all people who used / tried the NAT-T patch for years and
reported bugs, patches, etc...

X-MFC: never

Reviewed by: bz
Approved by: gnn(mentor)
Obtained from: NETASQ


# 193731 08-Jun-2009 zec

Introduce an infrastructure for dismantling vnet instances.

Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state. The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.

While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers. Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.

Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels. Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.

Bump __FreeBSD_version to 800097.
Reviewed by: bz, julian
Approved by: rwatson, kib (re), julian (mentor)


# 193511 05-Jun-2009 rwatson

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# 192895 27-May-2009 jamie

Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails. Child jails may be restricted more than their parents,
but never less. Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system. Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings. The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by: bz (mentor)


# 192649 23-May-2009 bz

Implement UDP control block support.

So far the udp_tun_func_t had been (ab)using inp_ppcb for udp in kernel
tunneling callbacks. Move that into the udpcb and add a field for flags
there to be used by upcoming changes instead of sticking udp only flags
into in_pcb flags2.

Bump __FreeBSD_version for ports to detect it and because of vnet* struct
size changes.

Submitted by: jhb (7.x version)
Reviewed by: rwatson


# 191688 30-Apr-2009 zec

Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance. Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:

1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables. As an example, V_ifnet becomes:

options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet
default build: vnet_net_0._ifnet
options VIMAGE_GLOBALS: ifnet

2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:

INIT_VNET_NET(ifp->if_vnet); becomes

struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];

3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals. If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.

4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet. options VIMAGE builds
will fill in those fields as required.

5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.

6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod. SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.

Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.

Reviewed by: bz, rwatson
Approved by: julian (mentor)


# 191548 26-Apr-2009 zec

In preparation for turning on options VIMAGE in next commits,
rearrange / replace / adjust several INIT_VNET_* initializer
macros, all of which currently resolve to whitespace.

Reviewed by: bz (an older version of the patch)
Approved by: julian (mentor)


# 190962 12-Apr-2009 rwatson

Update stats in struct udpstat using two new macros, UDPSTAT_ADD()
and UDPSTAT_INC(), rather than directly manipulating the fields
across the kernel. This will make it easier to change the
implementation of these statistics, such as using per-CPU versions
of the data structures.

MFC after: 3 days


# 190951 11-Apr-2009 rwatson

Update stats in struct ipstat using four new macros, IPSTAT_ADD(),
IPSTAT_INC(), IPSTAT_SUB(), and IPSTAT_DEC(), rather than directly
manipulating the fields across the kernel. This will make it easier
to change the implementation of these statistics, such as using
per-CPU versions of the data structures.

MFC after: 3 days


# 189592 09-Mar-2009 bms

Merge IGMPv3 and Source-Specific Multicast (SSM) to the FreeBSD
IPv4 stack.

Diffs are minimized against p4.
PCS has been used for some protocol verification, more widespread
testing of recorded sources in Group-and-Source queries is needed.
sizeof(struct igmpstat) has changed.

__FreeBSD_version is bumped to 800070.


# 188144 05-Feb-2009 jamie

Standardize the various prison_foo_ip[46] functions and prison_if to
return zero on success and an error code otherwise. The possible errors
are EADDRNOTAVAIL if an address being checked for doesn't match the
prison, and EAFNOSUPPORT if the prison doesn't have any addresses in
that address family. For most callers of these functions, use the
returned error code instead of e.g. a hard-coded EADDRNOTAVAIL or
EINVAL.

Always include a jailed() check in these functions, where a non-jailed
cred always returns success (and makes no changes). Remove the explicit
jailed() checks that preceded many of the function calls.

Approved by: bz (mentor)


# 186821 06-Jan-2009 rrs

Addresses Roberts comments on comments. Also adds
the KASSERT and checks suggested.

Reviewed by: The udp tunneling was discussed on net@ under the
thread entitled "Heads up -- Thinking about UDP and tunneling"


# 186813 06-Jan-2009 rrs

Add the ability of an alternate transport protocol
to easily tunnel over udp by providing a hook
function that will be called instead of appending
to the socket buffer.


# 185895 10-Dec-2008 zec

Conditionally compile out V_ globals while instantiating the appropriate
container structures, depending on VIMAGE_GLOBALS compile time option.

Make VIMAGE_GLOBALS a new compile-time option, which by default will not
be defined, resulting in instatiations of global variables selected for
V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be
effectively compiled out. Instantiate new global container structures
to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0,
vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0.

Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_
macros resolve either to the original globals, or to fields inside
container structures, i.e. effectively

#ifdef VIMAGE_GLOBALS
#define V_rt_tables rt_tables
#else
#define V_rt_tables vnet_net_0._rt_tables
#endif

Update SYSCTL_V_*() macros to operate either on globals or on fields
inside container structs.

Extend the internal kldsym() lookups with the ability to resolve
selected fields inside the virtualization container structs. This
applies only to the fields which are explicitly registered for kldsym()
visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently
this is done only in sys/net/if.c.

Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code,
and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in
turn result in proper code being generated depending on VIMAGE_GLOBALS.

De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c
which were prematurely V_irtualized by automated V_ prepending scripts
during earlier merging steps. PF virtualization will be done
separately, most probably after next PF import.

Convert a few variable initializations at instantiation to
initialization in init functions, most notably in ipfw. Also convert
TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in
initializer functions.

Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 185571 02-Dec-2008 bz

Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation


# 185435 29-Nov-2008 bz

MFp4:
Bring in updated jail support from bz_jail branch.

This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..

SCTP support was updated and supports IPv6 in jails as well.

Cpuset support permits jails to be bound to specific processor
sets after creation.

Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.

DDB 'show jails' command was added to aid debugging.

Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.

Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.

Bump __FreeBSD_version for the afore mentioned and in kernel changes.

Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.

Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible


# 185348 26-Nov-2008 zec

Merge more of currently non-functional (i.e. resolving to
whitespace) macros from p4/vimage branch.

Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.

De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.

Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 185088 19-Nov-2008 zec

Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions. As a rule,
initialization at instatiation for such variables should never be
introduced again from now on. Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact. In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 183982 17-Oct-2008 bz

Add cr_canseeinpcb() doing checks using the cached socket
credentials from inp_cred which is also available after the
socket is gone.
Switch cr_canseesocket consumers to cr_canseeinpcb.
This removes an extra acquisition of the socket lock.

Reviewed by: rwatson
MFC after: 3 months (set timer; decide then)


# 183606 04-Oct-2008 bz

Cache so_cred as inp_cred in the inpcb.
This means that inp_cred is always there, even after the socket
has gone away. It also means that it is constant for the lifetime
of the inp.
Both facts lead to simpler code and possibly less locking.

Suggested by: rwatson
Reviewed by: rwatson
MFC after: 6 weeks
X-MFC Note: use a inp_pspare for inp_cred


# 183550 02-Oct-2008 zec

Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 182148 25-Aug-2008 julian

Another missed V_ instance


# 181803 17-Aug-2008 bz

Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch


# 181782 16-Aug-2008 bz

Fix a regression introduced in r179289 splitting up ip6_savecontrol()
into v4-only vs. v6-only inp_flags processing.
When ip6_savecontrol_v4() is called from ip6_savecontrol() we
were not passing back the **mp thus the information will be missing
in userland.
Istead of going with a *** as suggested in the PR we are returning
**mp now and passing in the v4only flag as a pointer argument.

PR: kern/126349
Reviewed by: rwatson, dwmalone


# 180836 26-Jul-2008 mav

Increase UDBHASHSIZE from 16 to 128 items.
Previous value was chosen 10 years ago and not very effective now.
This change gives several percents speedup on 1000 L2TP mpd links.


# 180631 20-Jul-2008 trhodes

Document a few sysctls.

Reviewed by: rwatson


# 180558 16-Jul-2008 rwatson

Fix error in comment.

MFC after: 3 weeks


# 180536 15-Jul-2008 rwatson

Merge last of a series of rwlock conversion changes to UDP, which
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:

- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
local inpcb address and port bindings in order to determine what further
locking is required:
- If the inpcb is currently not bound (at all) and are implicitly
connecting, we require inpcbinfo and inpcb write locks, so drop the
read lock and re-acquire.
- If the inpcb is bound for at least one of the port or address, but an
explicit source or destination is requested, trylock the inpcbinfo
lock, and if that fails, drop the inpcb lock, lock the global lock,
and relock the inpcb lock.
- Otherwise, no further locking is required (common case).
- Update comments.

In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack. This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.

The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.

Tested by: ps, kris (older revision), bde
MFC after: 3 weeks


# 180429 10-Jul-2008 rwatson

Slightly rearrange validation of UDP arguments and jail processing in
udp_output() so that argument validation occurs before jail processing.

Add additional comments explaining what's going on when we process
addresses and binding during udp_output().

MFC after: 3 weeks


# 180422 10-Jul-2008 rwatson

Apply the MAC label to an outgoing UDP packet when other inpcb properties are
processed, meaning that we avoid the cost of MAC label assignment if we're
going to drop the packet due to mbuf exhaustion, etc.

MFC after: 3 weeks


# 180348 07-Jul-2008 rwatson

Allow udp_notify() to accept read, as well as write, locks on the passed
inpcb. When directly invoking udp_notify() from udp_ctlinput(), acquire
only a read lock; we may still see write locks in udp_notify() as the
in_pcbnotifyall() routine is shared with TCP and always uses a write lock
on the inpcb being notified.

MFC after: 1 month


# 180346 07-Jul-2008 rwatson

Add additional udbinfo and inpcb locking assertions to udp_output(); for
some code paths, global or inpcb write locks are required, but for other
code paths, read locks or no locking at all are sufficient for the data
structures.

MFC after: 1 month


# 180344 07-Jul-2008 rwatson

First step towards parallel transmit in UDP: if neither a specific
source or a specific destination address is requested as part of a send
on a UDP socket, read lock the inpcb rather than write lock it. This
will allow fully parallel transmit down to the IP layer when sending
simultaneously from multiple threads on a connected UDP socket.

Parallel transmit for more complex cases, such as when sendto(2) is
invoked with an address and there's already a local binding, will
follow.

MFC after: 1 month


# 180338 07-Jul-2008 rwatson

Drop read lock on udbinfo earlier during delivery to the last matching
UDP socket for a datagram; the inpcb read lock is sufficient to provide
inpcb stability during udp_append().

MFC after: 1 month


# 180198 02-Jul-2008 rwatson

Add soreceive_dgram(9), an optimized socket receive function for use by
datagram-only protocols, such as UDP. This version removes use of
sblock(), which is not required due to an inability to interlace data
improperly with datagrams, as well as avoiding some of the larger loops
and state management that don't apply on datagram sockets.

This is experimental code, so hook it up only for UDPv4 for testing; if
there are problems we may need to revise it or turn it off by default,
but it offers *significant* performance improvements for threaded UDP
applications such as BIND9, nsd, and memcached using UDP.

Tested by: kris, ps


# 180127 30-Jun-2008 rwatson

In udp_append() and udp_input(), make use of read locking on incpbs
rather than write locking: while we need to maintain a valid reference
to the inpcb and fix its state, no protocol layer state is modified
during an IPv4 UDP receive -- there are only changes at the socket
layer, which is separately protected by socket locking.

While parallel concurrent receive on a single UDP socket is currently
relatively unusual, introducing read locking in the transmit path,
allowing concurrent receive and transmit, will significantly improve
performance for loads such as BIND, memcached, etc.

MFC after: 2 months
Tested by: gnn, kris, ps


# 179412 29-May-2008 rwatson

Employ read locks on UDP inpcbs, rather than write locks, when
monitoring UDP connections using sysctls. In some cases, add
previously missing locking of inpcbs, as inp_socket is followed,
which also allows us to drop global locks more quickly.

MFC after: 1 week


# 179289 24-May-2008 bz

Factor out the v4-only vs. the v6-only inp_flags processing in
ip6_savecontrol in preparation for udp_append() to no longer
need an WLOCK as we will no longer be modifying socket options.

Requested by: rwatson
Reviewed by: gnn
MFC after: 10 days


# 178285 17-Apr-2008 rwatson

Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.

This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.

MFC after: 3 months
Tested by: kris (superset of committered patch)


# 172930 24-Oct-2007 rwatson

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# 172467 07-Oct-2007 silby

Add FBSDID to all files in netinet so that people can more
easily include file version information in bug reports.

Approved by: re (kensmith)


# 172114 10-Sep-2007 rwatson

Further UDPv4 cleanup:

- Resort includes a bit.
- Correct typos and wording problems in comments.
- Rename udpcksum to udp_cksum to be consistent with other UDP-related
configuration variables.
- Remove indirection of udp_notify through local notify variable in
udp_ctlinput(), which is presumably due to copying and pasting from TCP,
where multiple notify routines exist.

Approved by: re (kensmith)


# 171339 10-Jul-2007 rwatson

Further cleanup of UDPv4:

- Move udp_sendspace and udp_recvspace global variables and associated
sysctls to the top of the file where most other such things are present.

- Rename static variable 'blackhole' to 'udp_blackhole' and unstaticize
so that we can add blackhole support for UDPv6 using the same MIB
variable.

- Move udp_append() above udp_input() to match the function order in
udp6_usrreq.c.

Approved by: re (kensmith)


# 171290 07-Jul-2007 rwatson

Minor UDPv4 cleanup: capitalize comment, move statistics update after mbuf
free to be consistent with other error handling, and release socket buffer
lock before freeing mbufs and statistics updates rather than after.

Approved by: re (kensmith)


# 171167 03-Jul-2007 gnn

Commit the change from FAST_IPSEC to IPSEC. The FAST_IPSEC
option is now deprecated, as well as the KAME IPsec code.
What was FAST_IPSEC is now IPSEC.

Approved by: re
Sponsored by: Secure Computing


# 171133 01-Jul-2007 gnn

Commit IPv6 support for FAST_IPSEC to the tree.
This commit includes only the kernel files, the rest of the files
will follow in a second commit.

Reviewed by: bz
Approved by: re
Supported by: Secure Computing


# 170861 17-Jun-2007 mjacob

Make gcc4.2 happy and zero save_ip for the unlikely (blackhole != 0)
codepath.


# 170613 12-Jun-2007 bms

Import rewrite of IPv4 socket multicast layer to support source-specific
and protocol-independent host mode multicast. The code is written to
accomodate IPv6, IGMPv3 and MLDv2 with only a little additional work.

This change only pertains to FreeBSD's use as a multicast end-station and
does not concern multicast routing; for an IGMPv3/MLDv2 router
implementation, consider the XORP project.

The work is based on Wilbert de Graaf's IGMPv3 code drop for FreeBSD 4.6,
which is available at: http://www.kloosterhof.com/wilbert/igmpv3.html

Summary
* IPv4 multicast socket processing is now moved out of ip_output.c
into a new module, in_mcast.c.
* The in_mcast.c module implements the IPv4 legacy any-source API in
terms of the protocol-independent source-specific API.
* Source filters are lazy allocated as the common case does not use them.
They are part of per inpcb state and are covered by the inpcb lock.
* struct ip_mreqn is now supported to allow applications to specify
multicast joins by interface index in the legacy IPv4 any-source API.
* In UDP, an incoming multicast datagram only requires that the source
port matches the 4-tuple if the socket was already bound by source port.
An unbound socket SHOULD be able to receive multicasts sent from an
ephemeral source port.
* The UDP socket multicast filter mode defaults to exclusive, that is,
sources present in the per-socket list will be blocked from delivery.
* The RFC 3678 userland functions have been added to libc: setsourcefilter,
getsourcefilter, setipv4sourcefilter, getipv4sourcefilter.
* Definitions for IGMPv3 are merged but not yet used.
* struct sockaddr_storage is now referenced from <netinet/in.h>. It
is therefore defined there if not already declared in the same way
as for the C99 types.
* The RFC 1724 hack (specify 0.0.0.0/8 addresses to IP_MULTICAST_IF
which are then interpreted as interface indexes) is now deprecated.
* A patch for the Rhyolite.com routed in the FreeBSD base system
is available in the -net archives. This only affects individuals
running RIPv1 or RIPv2 via point-to-point and/or unnumbered interfaces.
* Make IPv6 detach path similar to IPv4's in code flow; functionally same.
* Bump __FreeBSD_version to 700048; see UPDATING.

This work was financially supported by another FreeBSD committer.

Obtained from: p4://bms_netdev
Submitted by: Wilbert de Graaf (original work)
Reviewed by: rwatson (locking), silence from fenner,
net@ (but with encouragement)


# 170587 12-Jun-2007 rwatson

Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in
some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.

Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.

We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths. Do, however, move those prototypes to priv.h.

Reviewed by: csjp
Obtained from: TrustedBSD Project


# 169598 16-May-2007 dwmalone

When verifying the IPv4 UDP checksum, don't overwrite the checksum
value in the mbuf with the result of the calculation. Previously,
if we chose to return an ICMP message, the quoted UDP checksum bytes
would be different to what was sent.

PR: 112471
Submitted by: Matthew Luckie <mluckie@cs.waikato.ac.nz>
MFC after: 3 weeks


# 169462 11-May-2007 rwatson

Reduce network stack oddness: implement .pru_sockaddr and .pru_peeraddr
protocol entry points using functions named proto_getsockaddr and
proto_getpeeraddr rather than proto_setsockaddr and proto_setpeeraddr.
While it's true that sockaddrs are allocated and set, the net effect is
to retrieve (get) the socket address or peer address from a socket, not
set it, so align names to that intent.


# 169349 07-May-2007 rwatson

Since udp_peeraddr() and udp_sockaddr() directly wrap in_setpeeraddr()
and in_setsockaddr(), containing only stale comments on why they
exist, remove them and initialize the protosw for UDP to directly
reference in_setpeeraddr() and in_setsockaddr().


# 169348 07-May-2007 rwatson

Minor style tweaks.


# 169179 01-May-2007 rwatson

Remove unused pcbinfo arguments to in_setsockaddr() and
in_setpeeraddr().


# 169154 30-Apr-2007 rwatson

Rename some fields of struct inpcbinfo to have the ipi_ prefix,
consistent with the naming of other structure field members, and
reducing improper grep matches. Clean up and comment structure
fields in structure definition.


# 167342 08-Mar-2007 bms

Fix IP_SENDSRCADDR semantics.

* To use this option with a UDP socket, it must be bound to a local port,
and INADDR_ANY, to disallow possible collisions with existing udp inpcbs
bound to the same port on other interfaces at send time.

* If the socket is bound to INADDR_ANY, specifying IP_SENDSRCADDR with
INADDR_ANY will be rejected as it is ambiguous.

* If the socket is bound to an address other than INADDR_ANY, specifying
IP_SENDSRCADDR with INADDR_ANY will be disallowed by in_pcbbind_setup().

Reviewed by: silence on -net
Tested with: src/tools/regression/netinet/ipbroadcast
MFC after: 4 days


# 166842 20-Feb-2007 rwatson

Rename two identically named log_in_vain variables: tcp_input.c's static
log_in_vain to tcp_log_in_vain, and udp_usrreq's global log_in_vain to
udp_log_in_vain.

MFC after: 1 week


# 166841 20-Feb-2007 rwatson

Gratuitous UDP restyling toward style(9) in 7.x.


# 165831 06-Jan-2007 maxim

o One more typo in the comment.

PR: kern/107609
Submitted by: Dr. Markus Waldeck


# 165710 01-Jan-2007 imp

Fix typo in comment.

Submitted by: remko


# 165709 31-Dec-2006 imp

Add comment about udp checksums being off in BSD 4.2 compatibility mode.

Submitted by: Dr. Markus Waldeck
PR: kern/106657


# 165634 29-Dec-2006 jhb

Some whitespace nits and remove a few casts.


# 164033 06-Nov-2006 rwatson

Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges. These may
require some future tweaking.

Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>


# 163606 22-Oct-2006 rwatson

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# 162071 06-Sep-2006 andre

Check inp_flags instead of inp_vflag for INP_ONESBCAST flag.

PR: kern/99558
Tested by: Andrey V. Elsukov <bu7cher-at-yandex.ru>
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 days


# 161974 04-Sep-2006 thomas

Fix typo in comment.


# 160549 21-Jul-2006 rwatson

Change semantics of socket close and detach. Add a new protocol switch
function, pru_close, to notify protocols that the file descriptor or
other consumer of a socket is closing the socket. pru_abort is now a
notification of close also, and no longer detaches. pru_detach is no
longer used to notify of close, and will be called during socket
tear-down by sofree() when all references to a socket evaporate after
an earlier call to abort or close the socket. This means detach is now
an unconditional teardown of a socket, whereas previously sockets could
persist after detach of the protocol retained a reference.

This faciliates sharing mutexes between layers of the network stack as
the mutex is required during the checking and removal of references at
the head of sofree(). With this change, pru_detach can now assume that
the mutex will no longer be required by the socket layer after
completion, whereas before this was not necessarily true.

Reviewed by: gnn


# 160491 18-Jul-2006 ups

Fix race conditions on enumerating pcb lists by moving the initialization
( and where appropriate the destruction) of the pcb mutex to the init/finit
functions of the pcb zones.
This allows locking of the pcb entries and race condition free comparison
of the generation count.
Rearrange locking a bit to avoid extra locking operation to update the generation
count in in_pcballoc(). (in_pcballoc now returns the pcb locked)

I am planning to convert pcb list handling from a type safe to a reference count
model soon. ( As this allows really freeing the PCBs)

Reviewed by: rwatson@, mohans@
MFC after: 1 week


# 159198 03-Jun-2006 rwatson

Acquire udbinfo lock after call to soreserve() rather than before, as it
is not required. This simplifies error-handling, and reduces the time
that this lock is held.

MFC after: 1 month


# 158800 21-May-2006 maxim

o In udp|rip_disconnect() acquire a socket lock before the socket
state modification. To prevent races do that while holding inpcb
lock.

Reviewed by: rwatson


# 158332 06-May-2006 rwatson

Modify UDP to use sosend_dgram() instead of sosend(). This allows
for signicantly optimized UDP socket I/O when using a single UDP
socket from many threads or processes that share it, by avoiding
significant locking and other overhead in the general sosend()
path that isn't necessary for simple datagram sockets. Specifically,
this change results in a significant performance improvement for
threaded name service in BIND9 under load.

Suggested by: Jinmei_Tatsuya at isc dot org


# 158021 25-Apr-2006 rwatson

Rename 'last' to 'inp' in udp_append(): the name 'last' is due to
the fact that the loop through inpcb's in udp_input() tracks the
last inpcb while looping. We keep that name in the calling loop
but not in the delivery routine itself.

MFC after: 3 months


# 157927 21-Apr-2006 ps

Allow for nmbclusters and maxsockets to be increased via sysctl.
An eventhandler is used to update all the various zones that depend
on these values.


# 157374 01-Apr-2006 rwatson

Update in_pcb-derived basic socket types following changes to
pru_abort(), pru_detach(), and in_pcbdetach():

- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.

- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, in protocol
shutdown methods, and in raw IP send.

- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.

- Invoke in_pcbfree() after in_pcbdetach() in order to free the
detached in_pcb structure for a socket.

MFC after: 3 months


# 157370 01-Apr-2006 rwatson

Chance protocol switch method pru_detach() so that it returns void
rather than an error. Detaches do not "fail", they other occur or
the protocol flags SS_PROTOREF to take ownership of the socket.

soclose() no longer looks at so_pcb to see if it's NULL, relying
entirely on the protocol to decide whether it's time to free the
socket or not using SS_PROTOREF. so_pcb is now entirely owned and
managed by the protocol code. Likewise, no longer test so_pcb in
other socket functions, such as soreceive(), which have no business
digging into protocol internals.

Protocol detach routines no longer try to free the socket on detach,
this is performed in the socket code if the protocol permits it.

In rts_detach(), no longer test for rp != NULL in detach, and
likewise in other protocols that don't permit a NULL so_pcb, reduce
the incidence of testing for it during detach.

netinet and netinet6 are not fully updated to this change, which
will be in an upcoming commit. In their current state they may leak
memory or panic.

MFC after: 3 months


# 157366 01-Apr-2006 rwatson

Change protocol switch pru_abort() API so that it returns void rather
than an int, as an error here is not meaningful. Modify soabort() to
unconditionally free the socket on the return of pru_abort(), and
modify most protocols to no longer conditionally free the socket,
since the caller will do this.

This commit likely leaves parts of netinet and netinet6 in a situation
where they may panic or leak memory, as they have not are not fully
updated by this commit. This will be corrected shortly in followup
commits to these components.

MFC after: 3 months


# 154767 24-Jan-2006 glebius

Implement 'ipfw fwd laddr,port' feature for UDP. According to ipfw(8)
it should work, however it never did. People expect it to work.

PR: kern/90834


# 154349 14-Jan-2006 rwatson

Remove dead code: 'opts' is not used in udp_append(), only in udp_input(),
so no need to assign it to NULL or conditionally free it.

Found with: Coverity Prevent(tm)
MFC after: 3 days


# 153427 14-Dec-2005 mux

Fix a bunch of SYSCTL_INT() that should have been SYSCTL_ULONG() to
match the type of the variable they are exporting.

Spotted by: Thomas Hurst <tom@hur.st>
MFC after: 3 days


# 152592 18-Nov-2005 andre

Consolidate all IP Options handling functions into ip_options.[ch] and
include ip_options.h into all files making use of IP Options functions.

From ip_input.c rev 1.306:
ip_dooptions(struct mbuf *m, int pass)
save_rte(m, option, dst)
ip_srcroute(m0)
ip_stripoptions(m, mopt)

From ip_output.c rev 1.249:
ip_insertoptions(m, opt, phlen)
ip_optcopy(ip, jp)
ip_pcbopts(struct inpcb *inp, int optname, struct mbuf *m)

No functional changes in this commit.

Discussed with: rwatson
Sponsored by: TCP/IP Optimization Fundraise 2005


# 151263 12-Oct-2005 maxim

o INP_ONESBCAST is inpcb.inp_vflag flag not inp_flags. The confusion
with IP_PORTRANGE_HIGH leads to the incorrect checksum calculation.

PR: kern/87306
Submitted by: Rickard Lind
Reviewed by: bms
MFC after: 2 weeks


# 150594 26-Sep-2005 andre

Implement IP_DONTFRAG IP socket option enabling the Don't Fragment
flag on IP packets. Currently this option is only repected on udp
and raw ip sockets. On tcp sockets the DF flag is controlled by the
path MTU discovery option.

Sending a packet larger than the MTU size of the egress interface
returns an EMSGSIZE error.

Discussed with: rwatson
Sponsored by: TCP/IP Optimization Fundraise 2005


# 149371 22-Aug-2005 andre

Add socketoption IP_MINTTL. May be used to set the minimum acceptable
TTL a packet must have when received on a socket. All packets with a
lower TTL are silently dropped. Works on already connected/connecting
and listening sockets for RAW/UDP/TCP.

This option is only really useful when set to 255 preventing packets
from outside the directly connected networks reaching local listeners
on sockets.

Allows userland implementation of 'The Generalized TTL Security Mechanism
(GTSM)' according to RFC3682. Examples of such use include the Cisco IOS
BGP implementation command "neighbor ttl-security".

MFC after: 2 weeks
Sponsored by: TCP/IP Optimization Fundraise 2005


# 146854 01-Jun-2005 rwatson

De-spl UDP.

MFC after: 3 days


# 145953 06-May-2005 cperciva

If we are going to
1. Copy a NULL-terminated string into a fixed-length buffer, and
2. copyout that buffer to userland,
we really ought to
0. Zero the entire buffer
first.

Security: FreeBSD-SA-05:08.kmem


# 144260 29-Mar-2005 sam

eliminate extraneous null ptr checks

Noticed by: Coverity Prevent analysis tool


# 142207 22-Feb-2005 glebius

In in_pcbconnect_setup() jailed sockets are treated specially: if local
address is not supplied, then jail IP is choosed and in_pcbbind() is called.
Since udp_output() does not save local addr after call to in_pcbconnect_setup(),
in_pcbbind() is called for each packet, and this is incorrect.

So, we shall treat jailed sockets specially in udp_output(), we will save
their local address.

This fixes a long standing bug with broken sendto() system call in jails.

PR: kern/26506
Reviewed by: rwatson
MFC after: 2 weeks


# 139823 07-Jan-2005 imp

/* -> /*- for license, minor formatting changes


# 137386 08-Nov-2004 phk

Initialize struct pr_userreqs in new/sparse style and fill in common
default elements in net_init_domain().

This makes it possible to grep these structures and see any bogosities.


# 137183 04-Nov-2004 phk

Hide udp_in6 behind #ifdef INET6


# 137176 04-Nov-2004 rwatson

Until this change, the UDP input code used global variables udp_in,
udp_in6, and udp_ip6 to pass socket address state between udp_input(),
udp_append(), and soappendaddr_locked(). While file in the default
configuration, when running with multiple netisrs or direct ithread
dispatch, this can result in races wherein user processes using
recvmsg() get back the wrong source IP/port. To correct this and
related races:

- Eliminate udp_ip6, which is believed to be generated but then never
used. Eliminate ip_2_ip6_hdr() as it is now unneeded.

- Eliminate setting, testing, and existence of 'init' status fields
for the IPv6 structures. While with multiple UDP delivery this
could lead to amortization of IPv4 -> IPv6 conversion when
delivering an IPv4 UDP packet to an IPv6 socket, it added
substantial complexity and side effects.

- Move global structures into the stack, declaring udp_in in
udp_input(), and udp_in6 in udp_append() to be used if a conversion
is required. Pass &udp_in into udp_append().

- Re-annotate comments to reflect updates.

With this change, UDP appears to operate correctly in the presence of
substantial inbound processing parallelism. This solution avoids
introducing additional synchronization, but does increase the
potential stack depth.

Discovered by: kris (Bug Magnet)
MFC after: 3 weeks


# 136449 12-Oct-2004 rwatson

Don't release the udbinfo lock until after the last use of UDP inpcb
in udp_input(), since the udbinfo lock is used to prevent removal of
the inpcb while in use (i.e., as a form of reference count) in the
in-bound path.

RELENG_5 candidate.


# 134793 05-Sep-2004 jmg

fix up socket/ip layer violation... don't assume/know that
SO_DONTROUTE == IP_ROUTETOIF and SO_BROADCAST == IP_ALLOWBROADCAST...


# 134142 22-Aug-2004 rwatson

When sliding the m_data pointer forward, update m_pktrhdr.len as well
as m_len, or the pkthdr length will be inconsistent with the actual
length of data in the mbuf chain. The symptom of this occuring was
"out of data" warnings from in_cksum_skip() on large UDP packets sent
via the loopback interface.

Foot shot: green


# 134119 21-Aug-2004 rwatson

When prepending space onto outgoing UDP datagram payloads to hold the
UDP/IP header, make sure that space is also allocated for the link
layer header. If an mbuf must be allocated to hold the UDP/IP header
(very likely), then this will avoid an additional mbuf allocation at
the link layer. This trick is also used by TCP and other protocols to
avoid extra calls to the mbuf allocator in the ethernet (and related)
output routines.


# 133994 19-Aug-2004 rwatson

Push down pcbinfo and inpcb locking from udp_send() into udp_output().
This provides greater context for the locking and allows us to avoid
locking the pcbinfo structure if not binding operations will take
place (i.e., already bound, connected, and no expliti sendto()
address).


# 133874 16-Aug-2004 rwatson

White space cleanup for netinet before branch:

- Trailing tab/space cleanup
- Remove spurious spaces between or before tabs

This change avoids touching files that Andre likely has in his working
set for PFIL hooks changes for IPFW/DUMMYNET.

Approved by: re (scottl)
Submitted by: Xin LI <delphij@frontfree.net>


# 133532 12-Aug-2004 rwatson

When udp_send() fails, make sure to free the control mbufs as well as
the data mbuf. This was done in most error cases, but not the case
where the inpcb pointer is surprisingly NULL.


# 133517 11-Aug-2004 andre

Backout removal of UMA_ZONE_NOFREE flag for all zones which are established
for structures with timers in them. It might be that a timer might fire
even when the associated structure has already been free'd. Having type-
stable storage in this case is beneficial for graceful failure handling and
debugging.

Discussed with: bosko, tegge, rwatson


# 133509 11-Aug-2004 andre

Remove the UMA_ZONE_NOFREE flag to all uma_zcreate() calls in the IP and
TCP code. This flag would have prevented giving back excessive free slabs
to the global pool after a transient peak usage.


# 133189 06-Aug-2004 rwatson

When iterating the UDP inpcb list processing an inbound broadcast
or multicast packet, we don't need to acquire the inpcb mutex
unless we are actually using inpcb fields other than the bound port
and address. Since we hold the pcbinfo lock already, these can't
change. Defer acquiring the inpcb mutex until we have a high
chance of a match. This avoids about 120 mutex operations per UDP
broadcast packet received on one of my work systems.

Reviewed by: sam


# 132653 26-Jul-2004 cperciva

Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.

The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)

Discussed with: rwatson, scottl
Requested by: jhb


# 131151 26-Jun-2004 rwatson

Reduce the number of unnecessary unlock-relocks on socket buffer mutexes
associated with performing a wakeup on the socket buffer:

- When performing an sbappend*() followed by a so[rw]wakeup(), explicitly
acquire the socket buffer lock and use the _locked() variants of both
calls. Note that the _locked() sowakeup() versions unlock the mutex on
return. This is done in uipc_send(), divert_packet(), mroute
socket_send(), raw_append(), tcp_reass(), tcp_input(), and udp_append().

- When the socket buffer lock is dropped before a sowakeup(), remove the
explicit unlock and use the _locked() sowakeup() variant. This is done
in soisdisconnecting(), soisdisconnected() when setting the can't send/
receive flags and dropping data, and in uipc_rcvd() which adjusting
back-pressure on the sockets.

For UNIX domain sockets running mpsafe with a contention-intensive SMP
mysql benchmark, this results in a 1.6% query rate improvement due to
reduce mutex costs.


# 130583 16-Jun-2004 bms

Reverse a patch which has no effect on -CURRENT and should probably be
applied directly to -STABLE.

Noticed by: iedowse
Pointy hat to: bms


# 130559 16-Jun-2004 bms

Disconnect a temporarily-connected UDP socket in out-of-mbufs case. This
fixes the problem of UDP sockets getting wedged in a connected state (and
bound to their destination) under heavy load.
Temporary bind/connect should probably be deleted in future
as an optimization, as described in "A Faster UDP" [Partridge/Pink 1993].

Notes:
- INP_LOCK() is already held in udp_output(). The connection is in effect
happening at a layer lower than the socket layer, therefore in theory
socket locking should not be needed.
- Inlining the in_pcbdisconnect() operation buys us nothing (in the case
of the current state of the code), as laddr is not part of the
inpcb hash or the udbinfo hash. Therefore there should be no need
to rehash after restoring laddr in the error case (this was a
concern of the original author of the patch).

PR: kern/41765
Requested by: gnn
Submitted by: Jinmei Tatuya (with cleanups)
Tested by: spray(8)


# 128905 04-May-2004 rwatson

Switch to using the inpcb MAC label instead of socket MAC label when
labeling new mbufs created from sockets/inpcbs in IPv4. This helps avoid
the need for socket layer locking in the lower level network paths
where inpcb locks are already frequently held where needed. In
particular:

- Use the inpcb for label instead of socket in raw_append().
- Use the inpcb for label instead of socket in tcp_output().
- Use the inpcb for label instead of socket in tcp_respond().
- Use the inpcb for label instead of socket in tcp_twrespond().
- Use the inpcb for label instead of socket in syncache_respond().

While here, modify tcp_respond() to avoid assigning NULL to a stack
variable and centralize assertions about the inpcb when inp is
assigned.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, McAfee Research


# 128904 04-May-2004 rwatson

Assert inpcb lock in udp_append().

Obtained from: TrustedBSD Project
Sponsored by: DARPA, McAfee Research


# 128019 07-Apr-2004 imp

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson


# 127505 27-Mar-2004 pjd

Reduce 'td' argument to 'cred' (struct ucred) argument in those functions:
- in_pcbbind(),
- in_pcbbind_setup(),
- in_pcbconnect(),
- in_pcbconnect_setup(),
- in6_pcbbind(),
- in6_pcbconnect(),
- in6_pcbsetport().
"It should simplify/clarify things a great deal." --rwatson

Requested by: rwatson
Reviewed by: rwatson, ume


# 127504 27-Mar-2004 pjd

Remove unused argument.

Reviewed by: ume


# 126253 26-Feb-2004 truckman

Split the mlock() kernel code into two parts, mlock(), which unpacks
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire(). Split munlock() in a similar way.

Enable the RLIMIT_MEMLOCK checking code in kern_mlock().

Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.

Nuke the vslock() and vsunlock() implementations, which are no
longer used.

Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.

Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails. Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.

Modify the callers of sysctl_wire_old_buffer() to look for the
error return.

Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.

Reviewed by: bms


# 125941 17-Feb-2004 ume

IPSEC and FAST_IPSEC have the same internal API now;
so merge these (IPSEC has an extra ipsecstat)

Submitted by: "Bjoern A. Zeeb" <bzeeb+freebsd@zabbadoz.net>


# 125396 03-Feb-2004 ume

pass pcb rather than so. it is expected that per socket policy
works again.


# 125264 31-Jan-2004 phk

Introduce the SO_BINTIME option which takes a high-resolution timestamp
at packet arrival.

For benchmarking purposes SO_BINTIME is preferable to SO_TIMEVAL
since it has higher resolution and lower overhead. Simultaneous
use of the two options is possible and they will return consistent
timestamps.

This introduces an extra test and a function call for SO_TIMEVAL, but I have
not been able to measure that.


# 125118 27-Jan-2004 ru

Correct the descriptions of the net.inet.{udp,raw}.recvspace sysctls.


# 122991 26-Nov-2003 sam

Split the "inp" mutex class into separate classes for each of divert,
raw, tcp, udp, raw6, and udp6 sockets to avoid spurious witness
complaints.

Reviewed by: rwatson
Approved by: re (rwatson)


# 122922 20-Nov-2003 andre

Introduce tcp_hostcache and remove the tcp specific metrics from
the routing table. Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.

It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination. Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.

tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.

It removes significant locking requirements from the tcp stack with
regard to the routing table.

Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)


# 122875 18-Nov-2003 rwatson

Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.

This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.

For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.

Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.

Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 122579 12-Nov-2003 bms

Add a new sysctl knob, net.inet.udp.strict_mcast_mship, to the udp_input path.

This switch toggles between strict multicast delivery, and traditional
multicast delivery.

The traditional (default) behaviour is to deliver multicast datagrams to all
sockets which are members of that group, regardless of the network interface
where the datagrams were received.

The strict behaviour is to deliver multicast datagrams received on a
particular interface only to sockets whose membership is bound to that
interface.

Note that as a matter of course, multicast consumers specifying INADDR_ANY
for their interface get joined on the interface where the default route
happens to be bound. This switch has no effect if the interface which the
consumer specifies for IP_ADD_MEMBERSHIP is not UP and RUNNING.

The original patch has been cleaned up somewhat from that submitted. It has
been tested on a multihomed machine with multiple QuickTime RTP streams
running over the local switch, which doesn't do IGMP snooping.

PR: kern/58359
Submitted by: William A. Carrel
Reviewed by: rwatson
MFC after: 1 week


# 122328 08-Nov-2003 sam

assert inpcb is locked in udp_output

Supported by: FreeBSD Foundation


# 121674 29-Oct-2003 ume

ip6_savecontrol() argument is redundant


# 119705 03-Sep-2003 bms

PR: kern/56343
Reviewed by: tjr
Approved by: jake (mentor)


# 119178 20-Aug-2003 bms

Add the IP_ONESBCAST option, to enable undirected IP broadcasts to be sent on
specific interfaces. This is required by aodvd, and may in future help us
in getting rid of the requirement for BPF from our import of isc-dhcp.

Suggested by: fenestro
Obtained from: BSD/OS
Reviewed by: mini, sam
Approved by: jake (mentor)


# 119132 19-Aug-2003 sam

add missing unlock when in_pcballoc returns an error


# 111119 19-Feb-2003 imp

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 110896 15-Feb-2003 hsu

Take advantage of pre-existing lock-free synchronization and type stable memory
to avoid acquiring SMP locks during expensive copyout process.


# 109623 21-Jan-2003 alfred

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 107113 20-Nov-2002 luigi

Back out some style changes. They are not urgent,
I will put them back in after 5.0 is out.

Requested by: sam
Approved by: re


# 107018 17-Nov-2002 luigi

Minor documentation changes and indentation fix.

Replace m_copy() with m_copypacket() where applicable.

While at it, fix some function headers and remove 'register' from
variable declarations.


# 105651 21-Oct-2002 iedowse

Implement a new IP_SENDSRCADDR ancillary message type that permits
a server process bound to a wildcard UDP socket to select the IP
address from which outgoing packets are sent on a per-datagram
basis. When combined with IP_RECVDSTADDR, such a server process can
guarantee to reply to an incoming request using the same source IP
address as the destination IP address of the request, without having
to open one socket per server IP address.

Discussed on: -net
Approved by: re


# 105649 21-Oct-2002 iedowse

Remove the "temporary connection" hack in udp_output(). In order
to send datagrams from an unconnected socket, we used to first block
input, then connect the socket to the sendmsg/sendto destination,
send the datagram, and finally disconnect the socket and unblock
input.

We now use in_pcbconnect_setup() to check if a connect() would have
succeeded, but we never record the connection in the PCB (local
anonymous port allocation is still recorded, though). The result
from in_pcbconnect_setup() authorises the sending of the datagram
and selects the local address and port to use, so we just construct
the header and call ip_output().

Discussed on: -net
Approved by: re


# 105201 16-Oct-2002 sam

correct PCB locking in broadcast/multicast case that was exposed by change
to use udp_append

Reviewed by: hsu


# 105199 16-Oct-2002 sam

Tie new "Fast IPsec" code into the build. This involves the usual
configuration stuff as well as conditional code in the IPv4 and IPv6
areas. Everything is conditional on FAST_IPSEC which is mutually
exclusive with IPSEC (KAME IPsec implmentation).

As noted previously, don't use FAST_IPSEC with INET6 at the moment.

Reviewed by: KAME, rwatson
Approved by: silence
Supported by: Vernier Networks


# 105194 16-Oct-2002 sam

Replace aux mbufs with packet tags:

o instead of a list of mbufs use a list of m_tag structures a la openbsd
o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit
ABI/module number cookie
o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and
use this in defining openbsd-compatible m_tag_find and m_tag_get routines
o rewrite KAME use of aux mbufs in terms of packet tags
o eliminate the most heavily used aux mbufs by adding an additional struct
inpcb parameter to ip_output and ip6_output to allow the IPsec code to
locate the security policy to apply to outbound packets
o bump __FreeBSD_version so code can be conditionalized
o fixup ipfilter's call to ip_output based on __FreeBSD_version

Reviewed by: julian, luigi (silent), -arch, -net, darren
Approved by: julian, silence from everyone else
Obtained from: openbsd (mostly)
MFC after: 1 month


# 101948 15-Aug-2002 rwatson

Code formatting sync to trustedbsd_mac: don't perform an assignment
in an if clause.

PR:
Submitted by:
Reviewed by:
Approved by:
Obtained from:
MFC after:


# 101934 15-Aug-2002 rwatson

Rename mac_check_socket_receive() to mac_check_socket_deliver() so that
we can use the names _receive() and _send() for the receive() and send()
checks. Rename related constants, policy implementations, etc.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 101335 04-Aug-2002 luigi

bugfix: move check for udp_blackhole before the one for icmp_bandlim.

MFC after: 3 days


# 101185 01-Aug-2002 rwatson

Introduce support for Mandatory Access Control and extensible
kernel access control.

Add MAC support for the UDP protocol. Invoke appropriate MAC entry
points to label packets that are generated by local UDP sockets,
and to authorize delivery of mbufs to local sockets both in the
multicast/broadcast case and the unicast case.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 100831 28-Jul-2002 truckman

Wire the sysctl output buffer before grabbing any locks to prevent
SYSCTL_OUT() from blocking while locks are held. This should
only be done when it would be inconvenient to make a temporary copy of
the data and defer calling SYSCTL_OUT() until after the locks are
released.


# 99869 12-Jul-2002 truckman

Back out the previous change, since it looks like locking udbinfo provides
sufficient protection.


# 99863 12-Jul-2002 truckman

Lock inp while we're accessing it.


# 99838 11-Jul-2002 truckman

Defer calling SYSCTL_OUT() until after the locks have been released.


# 98598 21-Jun-2002 hsu

Fix logic which resulted in missing a call to INP_UNLOCK().

Submitted by: jlemon, mux


# 98211 14-Jun-2002 hsu

Notify functions can destroy the pcb, so they have to return an
indication of whether this happenned so the calling function
knows whether or not to unlock the pcb.

Submitted by: Jennifer Yang (yangjihui@yahoo.com)
Bug reported by: Sid Carter (sidcarter@symonds.net)


# 98147 12-Jun-2002 hsu

The UDP head was unlocked too early in one unicast case.

Submitted by: bug reported by arr


# 98102 10-Jun-2002 hsu

Lock up inpcb.

Submitted by: Jennifer Yang <yangjihui@yahoo.com>


# 97658 31-May-2002 tanimura

Back out my lats commit of locking down a socket, it conflicts with hsu's work.

Requested by: hsu


# 96972 20-May-2002 tanimura

Lock down a socket, milestone 1.

o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

- so_count
- so_options
- so_linger
- so_state

o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:

- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()

Reviewed by: alfred


# 95759 30-Apr-2002 tanimura

Revert the change of #includes in sys/filedesc.h and sys/socketvar.h.

Requested by: bde

Since locking sigio_lock is usually followed by calling pgsigio(),
move the declaration of sigio_lock and the definitions of SIGIO_*() to
sys/signalvar.h.

While I am here, sort include files alphabetically, where possible.


# 93593 01-Apr-2002 jhb

Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on: smp@


# 93085 24-Mar-2002 bde

Fixed some style bugs in the removal of __P(()). Continuation lines
were not outdented to preserve non-KNF lining up of code with parentheses.
Switch to KNF formatting.


# 92976 22-Mar-2002 rwatson

Merge from TrustedBSD MAC branch:

Move the network code from using cr_cansee() to check whether a
socket is visible to a requesting credential to using a new
function, cr_canseesocket(), which accepts a subject credential
and object socket. Implement cr_canseesocket() so that it does a
prison check, a uid check, and add a comment where shortly a MAC
hook will go. This will allow MAC policies to seperately
instrument the visibility of sockets from the visibility of
processes.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 92760 20-Mar-2002 jeff

Switch vm_zone.h with uma.h. Change over to uma interfaces.


# 92723 19-Mar-2002 alfred

Remove __P.


# 91406 27-Feb-2002 jhb

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


# 91354 27-Feb-2002 dd

Introduce a version field to `struct xucred' in place of one of the
spares (the size of the field was changed from u_short to u_int to
reflect what it really ends up being). Accordingly, change users of
xucred to set and check this field as appropriate. In the kernel,
this is being done inside the new cru2x() routine which takes a
`struct ucred' and fills out a `struct xucred' according to the
former. This also has the pleasant sideaffect of removing some
duplicate code.

Reviewed by: rwatson


# 86183 08-Nov-2001 rwatson

o Replace reference to 'struct proc' with 'struct thread' in 'struct
sysctl_req', which describes in-progress sysctl requests. This permits
sysctl handlers to have access to the current thread, permitting work
on implementing td->td_ucred, migration of suser() to using struct
thread to derive the appropriate ucred, and allowing struct thread to be
passed down to other code, such as network code where td is not currently
available (and curproc is used).

o Note: netncp and netsmb are not updated to reflect this change, as they
are not currently KSE-adapted.

Reviewed by: julian
Obtained from: TrustedBSD Project


# 85315 22-Oct-2001 ume

restore the data of the ip header when extended udp header and data checksum
is calculated. this caused some trouble in the code which the ip header
is not modified. for example, inbound policy lookup failed.

Obtained from: KAME
MFC after: 1 week


# 84736 09-Oct-2001 rwatson

- Combine kern.ps_showallprocs and kern.ipc.showallsockets into
a single kern.security.seeotheruids_permitted, describes as:
"Unprivileged processes may see subjects/objects with different real uid"
NOTE: kern.ps_showallprocs exists in -STABLE, and therefore there is
an API change. kern.ipc.showallsockets does not.
- Check kern.security.seeotheruids_permitted in cr_cansee().
- Replace visibility calls to socheckuid() with cr_cansee() (retain
the change to socheckuid() in ipfw, where it is used for rule-matching).
- Remove prison_unpcb() and make use of cr_cansee() against the UNIX
domain socket credential instead of comparing root vnodes for the
UDS and the process. This allows multiple jails to share the same
chroot() and not see each others UNIX domain sockets.
- Remove unused socheckproc().

Now that cr_cansee() is used universally for socket visibility, a variety
of policies are more consistently enforced, including uid-based
restrictions and jail-based restrictions. This also better-supports
the introduction of additional MAC models.

Reviewed by: ps, billf
Obtained from: TrustedBSD Project


# 84527 05-Oct-2001 ps

Only allow users to see their own socket connections if
kern.ipc.showallsockets is set to 0.

Submitted by: billf (with modifications by me)
Inspired by: Dave McKay (aka pm aka Packet Magnet)
Reviewed by: peter
MFC after: 2 weeks


# 83742 20-Sep-2001 rwatson

o Rename u_cansee() to cr_cansee(), making the name more comprehensible
in the face of a rename of ucred to cred, and possibly generally.

Obtained from: TrustedBSD Project


# 83366 12-Sep-2001 julian

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 82884 03-Sep-2001 julian

Patches from Keiichi SHIMA <keiichi@iij.ad.jp>
to make ip use the standard protosw structure again.

Obtained from: Well, KAME I guess.


# 80406 26-Jul-2001 ume

move ipsec security policy allocation into in_pcballoc, before
making pcbs available to the outside world. otherwise, we will see
inpcb without ipsec security policy attached (-> panic() in ipsec.c).

Obtained from: KAME
MFC after: 3 days


# 78697 24-Jun-2001 dwmalone

Allow getcred sysctl to work in jailed root processes. Processes can
only do getcred calls for sockets which were created in the same jail.
This should allow the ident to work in a reasonable way within jails.

PR: 28107
Approved by: des, rwatson


# 78667 23-Jun-2001 ru

Add netstat(1) knob to reset net.inet.{ip|icmp|tcp|udp|igmp}.stats.
For example, ``netstat -s -p ip -z'' will show and reset IP stats.

PR: bin/17338


# 78064 11-Jun-2001 ume

Sync with recent KAME.
This work was based on kame-20010528-freebsd43-snap.tgz and some
critical problem after the snap was out were fixed.
There are many many changes since last KAME merge.

TODO:
- The definitions of SADB_* in sys/net/pfkeyv2.h are still different
from RFC2407/IANA assignment because of binary compatibility
issue. It should be fixed under 5-CURRENT.
- ip6po_m member of struct ip6_pktopts is no longer used. But, it
is still there because of binary compatibility issue. It should
be removed under 5-CURRENT.

Reviewed by: itojun
Obtained from: KAME
MFC after: 3 weeks


# 74209 13-Mar-2001 ru

Count and show incoming UDP datagrams with no checksum.


# 73109 26-Feb-2001 jlemon

Remove in_pcbnotify and use in_pcblookup_hash to find the cb directly.

For TCP, verify that the sequence number in the ICMP packet falls within
the tcp receive window before performing any actions indicated by the
icmp packet.

Clean up some layering violations (access to tcp internals from in_pcb)


# 72922 22-Feb-2001 jesper

Redo the security update done in rev 1.54 of src/sys/netinet/tcp_subr.c
and 1.84 of src/sys/netinet/udp_usrreq.c

The changes broken down:

- remove 0 as a wildcard for addresses and port numbers in
src/sys/netinet/in_pcb.c:in_pcbnotify()
- add src/sys/netinet/in_pcb.c:in_pcbnotifyall() used to notify
all sessions with the specific remote address.
- change
- src/sys/netinet/udp_usrreq.c:udp_ctlinput()
- src/sys/netinet/tcp_subr.c:tcp_ctlinput()
to use in_pcbnotifyall() to notify multiple sessions, instead of
using in_pcbnotify() with 0 as src address and as port numbers.
- remove check for src port == 0 in
- src/sys/netinet/tcp_subr.c:tcp_ctlinput()
- src/sys/netinet/udp_usrreq.c:udp_ctlinput()
as they are no longer needed.
- move handling of redirects and host dead from in_pcbnotify() to
udp_ctlinput() and tcp_ctlinput(), so they will call
in_pcbnotifyall() to notify all sessions with the specific
remote address.

Approved by: jlemon
Inspired by: NetBSD


# 72786 21-Feb-2001 rwatson

o Move per-process jail pointer (p->pr_prison) to inside of the subject
credential structure, ucred (cr->cr_prison).
o Allow jail inheritence to be a function of credential inheritence.
o Abstract prison structure reference counting behind pr_hold() and
pr_free(), invoked by the similarly named credential reference
management functions, removing this code from per-ABI fork/exit code.
o Modify various jail() functions to use struct ucred arguments instead
of struct proc arguments.
o Introduce jailed() function to determine if a credential is jailed,
rather than directly checking pointers all over the place.
o Convert PRISON_CHECK() macro to prison_check() function.
o Move jail() function prototypes to jail.h.
o Emulate the P_JAILED flag in fill_kinfo_proc() and no longer set the
flag in the process flags field itself.
o Eliminate that "const" qualifier from suser/p_can/etc to reflect
mutex use.

Notes:

o Some further cleanup of the linux/jail code is still required.
o It's now possible to consider resolving some of the process vs
credential based permission checking confusion in the socket code.
o Mutex protection of struct prison is still not present, and is
required to protect the reference count plus some fields in the
structure.

Reviewed by: freebsd-arch
Obtained from: TrustedBSD Project


# 72778 20-Feb-2001 jesper

Only call in_pcbnotify if the src port number != 0, as we
treat 0 as a wildcard in src/sys/in_pbc.c:in_pcbnotify()

It's sufficient to check for src|local port, as we'll have no
sessions with src|local port == 0

Without this a attacker sending ICMP messages, where the attached
IP header (+ 8 bytes) has the address and port numbers == 0, would
have the ICMP message applied to all sessions.

PR: kern/25195
Submitted by: originally by jesper, reimplimented by jlemon's advice
Reviewed by: jlemon
Approved by: jlemon


# 72650 18-Feb-2001 green

Switch to using a struct xucred instead of a struct xucred when not
actually in the kernel. This structure is a different size than
what is currently in -CURRENT, but should hopefully be the last time
any application breakage is caused there. As soon as any major
inconveniences are removed, the definition of the in-kernel struct
ucred should be conditionalized upon defined(_KERNEL).

This also changes struct export_args to remove dependency on the
constantly-changing struct ucred, as well as limiting the bounds
of the size fields to the correct size. This means: a) mountd and
friends won't break all the time, b) mountd and friends won't crash
the kernel all the time if they don't know what they're doing wrt
actual struct export_args layout.

Reviewed by: bde


# 72357 11-Feb-2001 bmilekic

Clean up RST ratelimiting. Previously, ratelimiting occured before tests
were performed to determine if the received packet should be reset. This
created erroneous ratelimiting and false alarms in some cases. The code
has now been reorganized so that the checks for validity come before
the call to badport_bandlim. Additionally, a few changes in the symbolic
names of the bandlim types have been made, as well as a clarification of
exactly which type each RST case falls under.

Submitted by: Mike Silbersack <silby@silby.com>


# 71999 04-Feb-2001 phk

Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)


# 70330 24-Dec-2000 phk

Update the "icmp_admin_prohib_like_rst" code to check the tcp-window and
to be configurable with respect to acting only in SYN or in all TCP states.

PR: 23665
Submitted by: Jesper Skriver <jesper@skriver.dk>


# 70070 15-Dec-2000 bmilekic

Change the following:

1. ICMP ECHO and TSTAMP replies are now rate limited.
2. RSTs generated due to packets sent to open and unopen ports
are now limited by seperate counters.
3. Each rate limiting queue now has its own description, as
follows:

Limiting icmp unreach response from 439 to 200 packets per second
Limiting closed port RST response from 283 to 200 packets per second
Limiting open port RST response from 18724 to 200 packets per second
Limiting icmp ping response from 211 to 200 packets per second
Limiting icmp tstamp response from 394 to 200 packets per second

Submitted by: Mike Silbersack <silby@silby.com>


# 68179 01-Nov-2000 ru

Wrong checksum may have been computed for certain UDP packets.

Reviewed by: jlemon


# 68056 31-Oct-2000 ru

Do not waste a time saving a copy of IP header if we are certainly
not going to send an ICMP error message (net.inet.udp.blackhole=1).


# 67708 27-Oct-2000 phk

Convert all users of fldoff() to offsetof(). fldoff() is bad
because it only takes a struct tag which makes it impossible to
use unions, typedefs etc.

Define __offsetof() in <machine/ansi.h>

Define offsetof() in terms of __offsetof() in <stddef.h> and <sys/types.h>

Remove myriad of local offsetof() definitions.

Remove includes of <stddef.h> in kernel code.

NB: Kernelcode should *never* include from /usr/include !

Make <sys/queue.h> include <machine/ansi.h> to avoid polluting the API.

Deprecate <struct.h> with a warning. The warning turns into an error on
01-12-2000 and the file gets removed entirely on 01-01-2001.

Paritials reviews by: various.
Significant brucifications by: bde


# 65984 17-Sep-2000 phk

Reverse last commit, a better fix has been found.


# 65976 17-Sep-2000 phk

Make sure UDP sockets are explicitly bind(2)'ed [sic] before we connect(2)
them.

PR: 20946
Isolated by: Aaron Gifford <agifford@infowest.com>


# 65327 01-Sep-2000 ru

Fixed broken ICMP error generation, unified conversion of IP header
fields between host and network byte order. The details:

o icmp_error() now does not add IP header length. This fixes the problem
when icmp_error() is called from ip_forward(). In this case the ip_len
of the original IP datagram returned with ICMP error was wrong.

o icmp_error() expects all three fields, ip_len, ip_id and ip_off in host
byte order, so DTRT and convert these fields back to network byte order
before sending a message. This fixes the problem described in PR 16240
and PR 20877 (ip_id field was returned in host byte order).

o ip_ttl decrement operation in ip_forward() was moved down to make sure
that it does not corrupt the copy of original IP datagram passed later
to icmp_error().

o A copy of original IP datagram in ip_forward() was made a read-write,
independent copy. This fixes the problem I first reported to Garrett
Wollman and Bill Fenner and later put in audit trail of PR 16240:
ip_output() (not always) converts fields of original datagram to network
byte order, but because copy (mcopy) and its original (m) most likely
share the same mbuf cluster, ip_output()'s manipulations on original
also corrupted the copy.

o ip_output() now expects all three fields, ip_len, ip_off and (what is
significant) ip_id in host byte order. It was a headache for years that
ip_id was handled differently. The only compatibility issue here is the
raw IP socket interface with IP_HDRINCL socket option set and a non-zero
ip_id field, but ip.4 manual page was unclear on whether in this case
ip_id field should be in host or network byte order.


# 65248 30-Aug-2000 ru

Backout the hack in rev 1.71, I am working on a better patch
that should cover almost all inconsistencies in ICMP error
generation.


# 65197 29-Aug-2000 darrenr

Apply appropriate patch.

PR: 20877
Submitted by: Frank Volf (volf@oasis.IAEhv.nl)


# 62587 04-Jul-2000 itojun

sync with kame tree as of july00. tons of bug fixes/improvements.

API changes:
- additional IPv6 ioctls
- IPsec PF_KEY API was changed, it is mandatory to upgrade setkey(8).
(also syntax change)


# 62573 04-Jul-2000 phk

Previous commit changing SYSCTL_HANDLER_ARGS violated KNF.

Pointed out by: bde


# 62454 03-Jul-2000 phk

Style police catches up with rev 1.26 of src/sys/sys/sysctl.h:

Sanitize SYSCTL_HANDLER_ARGS so that simplistic tools can grog our
sources:

-sysctl_vm_zone SYSCTL_HANDLER_ARGS
+sysctl_vm_zone (SYSCTL_HANDLER_ARGS)


# 60872 24-May-2000 bde

Fixed some style bugs (mainly convoluted logic for blackhole processing).


# 60798 22-May-2000 dan

sysctl'ize ICMP_BANDLIM and ICMP_BANDLIM_SUPPRESS_OUTPUT.

Suggested by: des/nbm


# 58698 27-Mar-2000 jlemon

Add support for offloading IP/TCP/UDP checksums to NIC hardware which
supports them.


# 55009 22-Dec-1999 shin

IPSEC support in the kernel.
pr_input() routines prototype is also changed to support IPSEC and IPV6
chained protocol headers.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 54952 21-Dec-1999 eivind

Change incorrect NULLs to 0s


# 54799 19-Dec-1999 green

M_PREPEND-related cleanups (unregisterifying struct mbuf *s).


# 54526 13-Dec-1999 shin

Always set INP_IPV4 flag for IPv4 pcb entries, because netstat needs it
to print out protocol specific pcb info.

A patch submitted by guido@gvr.org, and asmodai@wxs.nl also reported
the problem.
Thanks and sorry for your troubles.

Submitted by: guido@gvr.org
Reviewed by: shin


# 54263 07-Dec-1999 shin

udp IPv6 support, IPv6/IPv4 tunneling support in kernel,
packet divert at kernel for IPv6/IPv4 translater daemon

This includes queue related patch submitted by jburkhol@home.com.

Submitted by: queue related patch from jburkhol@home.com
Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 53295 17-Nov-1999 phk

The logic for blackhole processing does not free mbufs if the
blackhole flag is set.

PR: 14958
Submitted by: Larry Baird <lab@gta.com>
Reviewed by: phk


# 52904 05-Nov-1999 shin

KAME related header files additions and merges.
(only those which don't affect c source files so much)

Reviewed by: cvs-committers
Obtained from: KAME project


# 51381 19-Sep-1999 green

Change so_cred's type to a ucred, not a pcred. THis makes more sense, actually.
Make a sonewconn3() which takes an extra argument (proc) so new sockets created
with sonewconn() from a user's system call get the correct credentials, not
just the parent's credentials.


# 50477 28-Aug-1999 peter

$Id$ -> $FreeBSD$


# 50043 19-Aug-1999 csgr

Fix breakage if blackhole=1 and tiflags & TH_SYN, plus
style(9) fixes

Submitted by: Jonathon Lemon


# 49968 17-Aug-1999 csgr

Add net.inet.tcp.blackhole and net.inet.udp.blackhole
sysctl knobs.

With these knobs on, refused connection attempts are dropped
without sending a RST, or Port unreachable in the UDP case.
In the TCP case, sending of RST is inhibited iff the incoming
segment was a SYN.

Docs and rc.conf settings to follow.


# 48758 11-Jul-1999 green

Two new sysctls: net.inet.tcp.getcred and net.inet.udp.getcred. These take
a sockaddr_in[2] (local, then remote) and return a struct ucred. Example
code for these is at:
http://www.FreeBSD.org/~green/inetd_ident.patch
http://www.FreeBSD.org/~green/freebsd4.c (for pidentd)

Reviewed by: bde


# 48023 19-Jun-1999 green

This is the much-awaited cleaned up version of IPFW [ug]id support.
All relevant changes have been made (including ipfw.8).


# 46381 03-May-1999 billf

Add sysctl descriptions to many SYSCTL_XXXs

PR: kern/11197
Submitted by: Adrian Chadd <adrian@FreeBSD.org>
Reviewed by: billf(spelling/style/minor nits)
Looked at by: bde(style)


# 46155 28-Apr-1999 phk

This Implements the mumbled about "Jail" feature.

This is a seriously beefed up chroot kind of thing. The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.

For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact: "real virtual servers".

Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.

Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.

It generally does what one would expect, but setting up a jail
still takes a little knowledge.

A few notes:

I have no scripts for setting up a jail, don't ask me for them.

The IP number should be an alias on one of the interfaces.

mount a /proc in each jail, it will make ps more useable.

/proc/<pid>/status tells the hostname of the prison for
jailed processes.

Quotas are only sensible if you have a mountpoint per prison.

There are no privisions for stopping resource-hogging.

Some "#ifdef INET" and similar may be missing (send patches!)

If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!

Tools, comments, patches & documentation most welcome.

Have fun...

Sponsored by: http://www.rndassociates.com/
Run for almost a year by: http://www.servetheweb.com/


# 41487 03-Dec-1998 dillon

Reviewed by: freebsd-current

Add ICMP_BANDLIM option and 'net.inet.icmp.icmplim' sysctl. If option
is specified in kernel config, icmplim defaults to 100 pps. Setting it
to 0 will disable the feature. This feature limits ICMP error responses
for packets sent to bad tcp or udp ports, which does a lot to help the
machine handle network D.O.S. attacks.

The kernel will report packet rates that exceed the limit at a rate of
one kernel printf per second. There is one issue in regards to the
'tail end' of an attack... the kernel will not output the last report
until some unrelated and valid icmp error packet is return at some
point after the attack is over. This is a minor reporting issue only.


# 38513 24-Aug-1998 dfr

Re-implement tcp and ip fragment reassembly to not store pointers in the
ip header which can't work on alpha since pointers are too big.

Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>


# 36079 15-May-1998 wollman

Convert socket structures to be type-stable and add a version number.

Define a parameter which indicates the maximum number of sockets in a
system, and use this to size the zone allocators used for sockets and
for certain PCBs.

Convert PF_LOCAL PCB structures to be type-stable and add a version number.

Define an external format for infomation about socket structures and use
it in several places.

Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through
sysctl(3) without blocking network interrupts for an unreasonable
length of time. This probably still has some bugs and/or race
conditions, but it seems to work well enough on my machines.

It is now possible for `netstat' to get almost all of its information
via the sysctl(3) interface rather than reading kmem (changes to follow).


# 34923 28-Mar-1998 bde

Fixed style bugs (mostly) in previous commit.


# 34881 24-Mar-1998 wollman

Use the zone allocator to allocate inpcbs and tcpcbs. Each protocol creates
its own zone; this is used particularly by TCP which allocates both inpcb and
tcpcb in a single allocation. (Some hackery ensures that the tcpcb is
reasonably aligned.) Also keep track of the number of pcbs of each type
allocated, and keep a generation count (instance version number) for future
use.


# 32821 27-Jan-1998 dg

Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.

Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.

These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.

Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.

WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).


# 32773 25-Jan-1998 steve

Fix a couple of operator precedence bugs.

PR: 5450
Submitted by: Sakari Jalovaara <sja@tekla.fi>


# 31882 19-Dec-1997 bde

Don't use ANSI string concatenation to misformat a string.


# 30813 28-Oct-1997 bde

Removed unused #includes.


# 29366 14-Sep-1997 peter

Update network code to use poll support.


# 28270 16-Aug-1997 wollman

Fix all areas of the system (or at least all those in LINT) to avoid storing
socket addresses in mbufs. (Socket buffers are the one exception.) A number
of kernel APIs needed to get fixed in order to make this happen. Also,
fix three protocol families which kept PCBs in mbufs to not malloc them
instead. Delete some old compatibility cruft while we're at it, and add
some new routines in the in_cksum family.


# 25201 27-Apr-1997 wollman

The long-awaited mega-massive-network-code- cleanup. Part I.

This commit includes the following changes:
1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility
glue for them is deleted, and the kernel will panic on boot if any are compiled
in.

2) Certain protocol entry points are modified to take a process structure,
so they they can easily tell whether or not it is possible to sleep, and
also to access credentials.

3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt()
call. Protocols should use the process pointer they are now passed.

4) The PF_LOCAL and PF_ROUTE families have been updated to use the new
style, as has the `raw' skeleton family.

5) PF_LOCAL sockets now obey the process's umask when creating a socket
in the filesystem.

As a result, LINT is now broken. I'm hoping that some enterprising hacker
with a bit more time will either make the broken bits work (should be
easy for netipx) or dike them out.


# 24570 03-Apr-1997 dg

Reorganize elements of the inpcb struct to take better advantage of
cache lines. Removed the struct ip proto since only a couple of chars
were actually being used in it. Changed the order of compares in the
PCB hash lookup to take advantage of partial cache line fills (on PPro).

Discussed-with: wollman


# 23324 03-Mar-1997 dg

Improved performance of hash algorithm while (hopefully) not reducing
the quality of the hash distribution. This does not fix a problem dealing
with poor distribution when using lots of IP aliases and listening
on the same port on every one of them...some other day perhaps; fixing
that requires significant code changes.
The use of xor was inspired by David S. Miller <davem@jenolan.rutgers.edu>


# 23082 24-Feb-1997 wollman

Fix #include order.


# 22900 18-Feb-1997 wollman

Convert raw IP from mondo-switch-statement-from-Hell to
pr_usrreqs. Collapse duplicates with udp_usrreq.c and
tcp_usrreq.c (calling the generic routines in uipc_socket2.c and
in_pcb.c). Calling sockaddr()_ or peeraddr() on a detached
socket now traps, rather than harmlessly returning an error; this
should never happen. Allow the raw IP buffer sizes to be
controlled via sysctl.


# 22719 14-Feb-1997 wollman

Fix the mechanism for choosing wehether to save the slow-start threshold
in the route. This allows us to remove the unconditional setting of the
pipesize in the route, which should mean that SO_SNDBUF and SO_RCVBUF
should actually work again. While we're at it:

- Convert udp_usrreq from `mondo switch statement from Hell' to new-style.
- Delete old TCP mondo switch statement from Hell, which had previously
been diked out.


# 21673 14-Jan-1997 jkh

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 19622 11-Nov-1996 fenner

Add the IP_RECVIF socket option, which supplies a packet's incoming interface
using a sockaddr_dl.

Fix the other packet-information socket options (SO_TIMESTAMP, IP_RECVDSTADDR)
to work for multicast UDP and raw sockets as well. (They previously only
worked for unicast UDP).


# 19183 25-Oct-1996 fenner

Don't allow reassembly to create packets bigger than IP_MAXPACKET, and count
attempts to do so.
Don't allow users to source packets bigger than IP_MAXPACKET.
Make UDP length and ipovly's protocol length unsigned short.

Reviewed by: wollman
Submitted by: (partly by) kml@nas.nasa.gov (Kevin Lahey)


# 18795 07-Oct-1996 dg

Improved in_pcblookuphash() to support wildcarding, and changed relavent
callers of it to take advantage of this. This reduces new connection
request overhead in the face of a large number of PCBs in the system.
Thanks to David Filo <filo@yahoo.com> for suggesting this and providing
a sample implementation (which wasn't used, but showed that it could be
done).

Reviewed by: wollman


# 16206 08-Jun-1996 bde

Changed some memcpy()'s back to bcopy()'s.

gcc only inlines memcpy()'s whose count is constant and didn't inline
these. I want memcpy() in the kernel go away so that it's obvious that
it doesn't need to be optimized. Now it is only used for one struct
copy in si.c.


# 16143 05-Jun-1996 wollman

Instrument UDP PCB hashing to see how often the hash lookup is effective
for incoming packets.


# 15701 09-May-1996 wollman

Make it possible to return more than one piece of control information
(PR #1178).
Define a new SO_TIMESTAMP socket option for datagram sockets to return
packet-arrival timestamps as control information (PR #1179).

Submitted by: Louis Mamakos <loiue@TransSys.com>


# 15525 02-May-1996 fenner

Back out my stupid braino; I was thinking strlen and not sizeof.


# 15524 02-May-1996 fenner

Size temp var correctly; buf[4*sizeof "123"] is not long enough
to store "192.252.119.189\0".


# 15414 27-Apr-1996 ache

inet_ntoa buffer was evaluated twice in log_in_vain, fix it.
Thanx to: jdp


# 15154 09-Apr-1996 pst

Logging UDP and TCP connection attempts should not be enabled by default.
It's trivial to create a denial of service attack on a box so enabled.

These messages, if enabled at all, must be rate-limited. (!)


# 15038 04-Apr-1996 phk

Log TCP syn packets for ports we don't listen on.
Controlled by: sysctl net.inet.tcp.log_in_vain: 1

Log UDP syn packets for ports we don't listen on.
Controlled by: sysctl net.inet.udp.log_in_vain: 1

Suggested by: Warren Toomey <wkt@cs.adfa.oz.au>


# 14546 11-Mar-1996 dg

Move or add #include <queue.h> in preparation for upcoming struct socket
changes.


# 12881 16-Dec-1995 bde

Uniformized pr_ctlinput protosw functions. The third arg is now `void
*' instead of caddr_t and it isn't optional (it never was). Most of the
netipx (and netns) pr_ctlinput functions abuse the second arg instead of
using the third arg but fixing this is beyond the scope of this round
of changes.


# 12820 14-Dec-1995 phk

Another mega commit to staticize things.


# 12657 06-Dec-1995 bde

Removed unnecessary #includes of vm stuff. Most of them were once
prerequisites for <sys/sysctl.h>.

subr_prof.c:
Also replaced #include of <sys/user.h> by #include of <sys/resourcevar.h>.


# 12325 16-Nov-1995 bde

Fixed recent staticizations. Some protypes for static functions were
left in headers and not staticized.


# 12296 14-Nov-1995 phk

New style sysctl & staticize alot of stuff.


# 10961 22-Sep-1995 wollman

Merge 4.4-Lite-2: always check the UDP checksum if it is present, even
if we are not generating checksums. (Save a test in the input path.)


# 10095 17-Aug-1995 olah

Add a sanity check for the UDP length field in order to prevent
malformed UDP packets to panic the kernel.
Reviewed by: davidg, wollman
Obtained from: dab@berserkly.cray.com (David A. Borman) via end2end list


# 8876 30-May-1995 rgrimes

Remove trailing whitespace.


# 8384 09-May-1995 dg

Replaced some bcopy()'s with memcpy()'s so that gcc while inline/optimize.


# 8235 03-May-1995 dg

Changed in_pcblookuphash() to not automatically call in_pcblookup() if
the lookup fails. Updated callers to deal with this. Call in_pcblookuphash
instead of in_pcblookup() in in_pcbconnect; this improves performance of
UDP output by about 17% in the standard case.


# 7684 09-Apr-1995 dg

Implemented PCB hashing. Includes new functions in_pcbinshash, in_pcbrehash,
and in_pcblookuphash.


# 7090 16-Mar-1995 bde

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 6482 16-Feb-1995 wollman

spl back down in unusual out-of-memory condition in udp_output().

Obtained from: Stevens, vol. 2, exercise 23.4 (solution p. 1083)


# 6479 16-Feb-1995 wollman

Don't add back in the IP header length to ip_len; icmp_error will do it
for us.

Obtained from: Stevens, vol. 2, p. 774


# 6472 16-Feb-1995 wollman

Add lots of useful MIB variables and a few not-so-useful ones for
completeness.


# 3311 02-Oct-1994 phk

GCC cleanup.
Reviewed by:
Submitted by:
Obtained from:


# 1817 02-Aug-1994 dg

Added $Id$


# 1549 25-May-1994 rgrimes

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# 1542 24-May-1994 rgrimes

This commit was generated by cvs2svn to compensate for changes in r1541,
which included commits to RCS files with non-trunk default branches.


# 1541 24-May-1994 rgrimes

BSD 4.4 Lite Kernel Sources