History log of /freebsd-11-stable/sys/netinet/tcp_timewait.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 368183 30-Nov-2020 tuexen

MFC r367530:
RFC 7323 specifies that:
* TCP segments without timestamps should be dropped when support for
the timestamp option has been negotiated.
* TCP segments with timestamps should be processed normally if support
for the timestamp option has not been negotiated.
This patch enforces the above.
Manually resolved merge conflicts.

MFC 367891:
Fix an issue I introuced in r367530: tcp_twcheck() can be called
with to == NULL for SYN segments. So don't assume tp != NULL.
Thanks to jhb@ for reporting and suggesting a fix.

MFC r367946:
Fix two occurences of a typo in a comment introduced in r367530.
Thanks to lstewart@ for reporting them.

PR: 250499
Reviewed by: gnn, rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D27148


# 347161 05-May-2019 tuexen

MFC r336937:
Send consistent SEG.WIN when using timewait codepath for TCP.

When sending TCP segments from the timewait code path, a stored
value of the last sent window is used. Use the same code for
computing this in the timewait code path as in the main code
path used in tcp_output() to avoid inconsistencies.

MFC r344148:
Fix a byte ordering issue for the advertised receiver window in ACK
segments sent in TIMEWAIT state, which I introduced in r336937.


# 347157 05-May-2019 tuexen

MFC r336932:
Add missing send/recv dtrace probes for TCP.

These missing probe are mostly in the syncache and timewait code.

Reviewed by: markj@, rrs@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16369


# 331722 29-Mar-2018 eadler

Revert r330897:

This was intended to be a non-functional change. It wasn't. The commit
message was thus wrong. In addition it broke arm, and merged crypto
related code.

Revert with prejudice.

This revert skips files touched in r316370 since that commit was since
MFCed. This revert also skips files that require $FreeBSD$ property
changes.

Thank you to those who helped me get out of this mess including but not
limited to gonzo, kevans, rgrimes.

Requested by: gjb (re)


# 330897 14-Mar-2018 eadler

Partial merge of the SPDX changes

These changes are incomplete but are making it difficult
to determine what other changes can/should be merged.

No objections from: pfg


# 324948 24-Oct-2017 jch

MFC r324179, r324193:

r324179:

Fix an infinite loop in tcp_tw_2msl_scan() when an INP_TIMEWAIT inp has
been destroyed before its tcptw with INVARIANTS undefined.

This is a symmetric change of r307551:

A INP_TIMEWAIT inp should not be destroyed before its tcptw, and INVARIANTS
will catch this case. If INVARIANTS is undefined it will emit a log(LOG_ERR)
and avoid a hard to debug infinite loop in tcp_tw_2msl_scan().

Reported by: Ben Rubson, hselasky
Submitted by: hselasky
Tested by: Ben Rubson, jch
Sponsored by: Verisign, inc
Differential Revision: https://reviews.freebsd.org/D12267

r324193:

Forgotten bits in r324179: Include sys/syslog.h if INVARIANTS is not defined


# 310212 18-Dec-2016 tuexen

MFC r308832:

Ensure that TCP state changes to state-closing are reported via dtrace.
This does not cover state changes from TIME-WAIT.

Sponsored by: Netflix, Inc.


# 307905 25-Oct-2016 jch

MFC r307551:

Fix a double-free when an inp transitions to INP_TIMEWAIT state
after having been dropped.

This change enforces in_pcbdrop() logic in tcp_input():

"in_pcbdrop() is used by TCP to mark an inpcb as unused and avoid future packet
delivery or event notification when a socket remains open but TCP has closed."

PR: 203175
Reported by: Palle Girgensohn, Slawa Olhovchenkov
Tested by: Slawa Olhovchenkov
Reviewed by: Slawa Olhovchenkov
Approved by: gnn, Slawa Olhovchenkov
Differential Revision: https://reviews.freebsd.org/D8211
Sponsored by: Verisign, inc


# 302408 07-Jul-2016 gjb

Copy head@r302406 to stable/11 as part of the 11.0-RELEASE cycle.
Prune svn:mergeinfo from the new branch, as nothing has been merged
here.

Additional commits post-branch will follow.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


/freebsd-11-stable/MAINTAINERS
/freebsd-11-stable/cddl
/freebsd-11-stable/cddl/contrib/opensolaris
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/dtrace/test/tst/common/print
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zfs
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs
/freebsd-11-stable/contrib/amd
/freebsd-11-stable/contrib/apr
/freebsd-11-stable/contrib/apr-util
/freebsd-11-stable/contrib/atf
/freebsd-11-stable/contrib/binutils
/freebsd-11-stable/contrib/bmake
/freebsd-11-stable/contrib/byacc
/freebsd-11-stable/contrib/bzip2
/freebsd-11-stable/contrib/com_err
/freebsd-11-stable/contrib/compiler-rt
/freebsd-11-stable/contrib/dialog
/freebsd-11-stable/contrib/dma
/freebsd-11-stable/contrib/dtc
/freebsd-11-stable/contrib/ee
/freebsd-11-stable/contrib/elftoolchain
/freebsd-11-stable/contrib/elftoolchain/ar
/freebsd-11-stable/contrib/elftoolchain/brandelf
/freebsd-11-stable/contrib/elftoolchain/elfdump
/freebsd-11-stable/contrib/expat
/freebsd-11-stable/contrib/file
/freebsd-11-stable/contrib/gcc
/freebsd-11-stable/contrib/gcclibs/libgomp
/freebsd-11-stable/contrib/gdb
/freebsd-11-stable/contrib/gdtoa
/freebsd-11-stable/contrib/groff
/freebsd-11-stable/contrib/ipfilter
/freebsd-11-stable/contrib/ldns
/freebsd-11-stable/contrib/ldns-host
/freebsd-11-stable/contrib/less
/freebsd-11-stable/contrib/libarchive
/freebsd-11-stable/contrib/libarchive/cpio
/freebsd-11-stable/contrib/libarchive/libarchive
/freebsd-11-stable/contrib/libarchive/libarchive_fe
/freebsd-11-stable/contrib/libarchive/tar
/freebsd-11-stable/contrib/libc++
/freebsd-11-stable/contrib/libc-vis
/freebsd-11-stable/contrib/libcxxrt
/freebsd-11-stable/contrib/libexecinfo
/freebsd-11-stable/contrib/libpcap
/freebsd-11-stable/contrib/libstdc++
/freebsd-11-stable/contrib/libucl
/freebsd-11-stable/contrib/libxo
/freebsd-11-stable/contrib/llvm
/freebsd-11-stable/contrib/llvm/projects/libunwind
/freebsd-11-stable/contrib/llvm/tools/clang
/freebsd-11-stable/contrib/llvm/tools/lldb
/freebsd-11-stable/contrib/llvm/tools/llvm-dwarfdump
/freebsd-11-stable/contrib/llvm/tools/llvm-lto
/freebsd-11-stable/contrib/mdocml
/freebsd-11-stable/contrib/mtree
/freebsd-11-stable/contrib/ncurses
/freebsd-11-stable/contrib/netcat
/freebsd-11-stable/contrib/ntp
/freebsd-11-stable/contrib/nvi
/freebsd-11-stable/contrib/one-true-awk
/freebsd-11-stable/contrib/openbsm
/freebsd-11-stable/contrib/openpam
/freebsd-11-stable/contrib/openresolv
/freebsd-11-stable/contrib/pf
/freebsd-11-stable/contrib/sendmail
/freebsd-11-stable/contrib/serf
/freebsd-11-stable/contrib/sqlite3
/freebsd-11-stable/contrib/subversion
/freebsd-11-stable/contrib/tcpdump
/freebsd-11-stable/contrib/tcsh
/freebsd-11-stable/contrib/tnftp
/freebsd-11-stable/contrib/top
/freebsd-11-stable/contrib/top/install-sh
/freebsd-11-stable/contrib/tzcode/stdtime
/freebsd-11-stable/contrib/tzcode/zic
/freebsd-11-stable/contrib/tzdata
/freebsd-11-stable/contrib/unbound
/freebsd-11-stable/contrib/vis
/freebsd-11-stable/contrib/wpa
/freebsd-11-stable/contrib/xz
/freebsd-11-stable/crypto/heimdal
/freebsd-11-stable/crypto/openssh
/freebsd-11-stable/crypto/openssl
/freebsd-11-stable/gnu/lib
/freebsd-11-stable/gnu/usr.bin/binutils
/freebsd-11-stable/gnu/usr.bin/cc/cc_tools
/freebsd-11-stable/gnu/usr.bin/gdb
/freebsd-11-stable/lib/libc/locale/ascii.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris
/freebsd-11-stable/sys/contrib/dev/acpica
/freebsd-11-stable/sys/contrib/ipfilter
/freebsd-11-stable/sys/contrib/libfdt
/freebsd-11-stable/sys/contrib/octeon-sdk
/freebsd-11-stable/sys/contrib/x86emu
/freebsd-11-stable/sys/contrib/xz-embedded
/freebsd-11-stable/usr.sbin/bhyve/atkbdc.h
/freebsd-11-stable/usr.sbin/bhyve/bhyvegc.c
/freebsd-11-stable/usr.sbin/bhyve/bhyvegc.h
/freebsd-11-stable/usr.sbin/bhyve/console.c
/freebsd-11-stable/usr.sbin/bhyve/console.h
/freebsd-11-stable/usr.sbin/bhyve/pci_fbuf.c
/freebsd-11-stable/usr.sbin/bhyve/pci_xhci.c
/freebsd-11-stable/usr.sbin/bhyve/pci_xhci.h
/freebsd-11-stable/usr.sbin/bhyve/ps2kbd.c
/freebsd-11-stable/usr.sbin/bhyve/ps2kbd.h
/freebsd-11-stable/usr.sbin/bhyve/ps2mouse.c
/freebsd-11-stable/usr.sbin/bhyve/ps2mouse.h
/freebsd-11-stable/usr.sbin/bhyve/rfb.c
/freebsd-11-stable/usr.sbin/bhyve/rfb.h
/freebsd-11-stable/usr.sbin/bhyve/sockstream.c
/freebsd-11-stable/usr.sbin/bhyve/sockstream.h
/freebsd-11-stable/usr.sbin/bhyve/usb_emul.c
/freebsd-11-stable/usr.sbin/bhyve/usb_emul.h
/freebsd-11-stable/usr.sbin/bhyve/usb_mouse.c
/freebsd-11-stable/usr.sbin/bhyve/vga.c
/freebsd-11-stable/usr.sbin/bhyve/vga.h
# 302098 22-Jun-2016 bz

No longer mark TCP TW zone NO_FREE.
Timewait code does a proper cleanup after itself.

Reviewed by: gnn
Approved by: re (gjb)
Obtained from: projects/vnet
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D6922


# 296881 14-Mar-2016 glebius

Redo r294869. The array of counters for TCP states doesn't belong to
struct tcpstat, because the structure can be zeroed out by netstat(1) -z,
and of course running connection counts shouldn't be touched.

Place running connection counts into separate array, and provide
separate read-only sysctl oid for it.


# 294869 26-Jan-2016 glebius

Augment struct tcpstat with tcps_states[], which is used for book-keeping
the amount of TCP connections by state. Provides a cheap way to get
connection count without traversing the whole pcb list.

Sponsored by: Netflix


# 286227 03-Aug-2015 jch

Decompose TCP INP_INFO lock to increase short-lived TCP connections scalability:

- The existing TCP INP_INFO lock continues to protect the global inpcb list
stability during full list traversal (e.g. tcp_pcblist()).

- A new INP_LIST lock protects inpcb list actual modifications (inp allocation
and free) and inpcb global counters.

It allows to use TCP INP_INFO_RLOCK lock in critical paths (e.g. tcp_input())
and INP_INFO_WLOCK only in occasional operations that walk all connections.

PR: 183659
Differential Revision: https://reviews.freebsd.org/D2599
Reviewed by: jhb, adrian
Tested by: adrian, nitroboost-gmail.com
Sponsored by: Verisign, Inc.


# 282300 01-May-2015 gnn

Add a state transition call to show that we have entered TIME_WAIT.
Although this is not important to the rest of the TCP processing
it is a conveneint way to make the DTrace state-transition probe
catch this important state change.

MFC after: 1 week


# 274225 07-Nov-2014 glebius

Remove SYSCTL_VNET_* macros, and simply put CTLFLAG_VNET where needed.

Sponsored by: Nginx, Inc.


# 273850 30-Oct-2014 jch

Fix a race condition in TCP timewait between tcp_tw_2msl_reuse() and
tcp_tw_2msl_scan(). This race condition drives unplanned timewait
timeout cancellation. Also simplify implementation by holding inpcb
reference and removing tcptw reference counting.

Differential Revision: https://reviews.freebsd.org/D826
Submitted by: Marc De la Gueronniere <mdelagueronniere@verisign.com>
Submitted by: jch
Reviewed By: jhb (mentor), adrian, rwatson
Sponsored by: Verisign, Inc.
MFC after: 2 weeks
X-MFC-With: r264321


# 269526 04-Aug-2014 hiren

Add a comment for easier code understanding.


# 266907 30-May-2014 bz

While PAWS is disabled, there are no consumers for the tcp options
argument to tcp_twcheck(); thus mark it __unused.

MFC after: 2 weeks


# 266618 24-May-2014 bz

Make tcp_twrespond() file local private; this removes it from the
public KPI; it is not used anywhere else and seems it never was.

MFC after: 2 weeks


# 266205 15-May-2014 silby

Remove the function tcp_twrecycleable; it has been #if 0'd for
eight years. The original concept was to improve the
corner case where you run out of ephemeral ports, but it
was causing performance problems and the mechanism
of limiting the number of time_wait sockets serves
the same purpose in the end.

Reviewed by: bz


# 264356 11-Apr-2014 jhb

Some whitespace and style fixes.

Submitted by: bde


# 264351 11-Apr-2014 jhb

The tw_pcbrele() function does not need the global timewait lock.

Submitted by: Julien Charbon
Suggested by: glebius


# 264342 11-Apr-2014 jhb

Don't leak the TCP pcbinfo lock if a time wait connection is closed
in between grabbing a reference on the connection structure and obtaining
the pcbinfo lock.

Reviewed by: Julien Charbon


# 264321 10-Apr-2014 jhb

Currently, the TCP slow timer can starve TCP input processing while it
walks the list of connections in TIME_WAIT closing expired connections
due to contention on the global TCP pcbinfo lock.

To remediate, introduce a new global lock to protect the list of
connections in TIME_WAIT. Only acquire the TCP pcbinfo lock when
closing an expired connection. This limits the window of time when
TCP input processing is stopped to the amount of time needed to close
a single connection.

Submitted by: Julien Charbon <jcharbon@verisign.com>
Reviewed by: rwatson, rrs, adrian
MFC after: 2 months


# 257176 26-Oct-2013 glebius

The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 243882 05-Dec-2012 glebius

Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually


# 242854 10-Nov-2012 rdivacky

Initialize hdrlen to 0 to avoid clang warning in NOINET case.


# 241913 22-Oct-2012 glebius

Switch the entire IPv4 stack to keep the IP packet header
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.

After this change a packet processed by the stack isn't
modified at all[2] except for TTL.

After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.

[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.

[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.

Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>


# 236170 28-May-2012 bz

It turns out that too many drivers are not only parsing the L2/3/4
headers for TSO but also for generic checksum offloading. Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.

To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.

Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.

This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.

Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.

Individual driver updates will have to follow, as will SCTP.

Reported by: gallatin, dim, ..
Reviewed by: gallatin (glanced at?)
MFC after: 3 days
X-MFC with: r235961,235959,235958


# 235961 25-May-2012 bz

MFp4 bz_ipv6_fast:

Add code to handle pre-checked TCP checksums as indicated by mbuf
flags to save the entire computation for validation if not needed.

In the IPv6 TCP output path only compute the pseudo-header checksum,
set the checksum offset in the mbuf field along the appropriate flag
as done in IPv4.

In tcp_respond() just initialize the IPv6 payload length to 0 as
ip6_output() will properly set it.

Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems

Reviewed by: gnn (as part of the whole)
MFC After: 3 days


# 231767 15-Feb-2012 bz

Fix PAWS (Protect Against Wrapped Sequence numbers) in cases when
hz >> 1000 and thus getting outside the timestamp clock frequenceny of
1ms < x < 1s per tick as mandated by RFC1323, leading to connection
resets on idle connections.

Always use a granularity of 1ms using getmicrouptime() making all but
relevant callouts independent of hz.

Use getmicrouptime(), not getmicrotime() as the latter may make a jump
possibly breaking TCP nfsroot mounts having our timestamps move forward
for more than 24.8 days in a second without having been idle for that
long.

PR: kern/61404
Reviewed by: jhb, mav, rrs
Discussed with: silby, lstewart
Sponsored by: Sandvine Incorporated (originally in 2011)
MFC after: 6 weeks


# 229700 06-Jan-2012 jhb

Tweak the last fix to match what was actually tested.

Pointy hat to: jhb


# 229672 05-Jan-2012 pluknet

Fix a typo.

X-MFC-with: 229665


# 229665 05-Jan-2012 jhb

Remove the assertion from tcp_input() that rcv_nxt is always greater
than or equal to rcv_adv and fix tcp_twstart() to handle this case by
assuming the last window was zero rather than a negative value.

The code in tcp_input() already safely handled this case. It can happen
due to delayed ACKs along with a remote sender that sends data beyond
the window we previously advertised. If we have room in our socket buffer
for the extra data beyond the advertised window, we will accept it.
However, if the ACK for that segment is delayed, then we will not
effectively fixup rcv_adv to account for that extra data until the
next segment arrives and forces out an ACK. When that next segment
arrives, rcv_nxt will be beyond rcv_adv.

Tested by: pjd
MFC after: 1 week


# 221891 14-May-2011 jhb

Oops, fix order of sequence numbers in KASSERT()'s to catch negative
receive windows to match the labels in the panic message.

Submitted by: trociny


# 221346 02-May-2011 jhb

Handle a rare edge case with nearly full TCP receive buffers. If a TCP
buffer fills up causing the remote sender to enter into persist mode, but
there is still room available in the receive buffer when a window probe
arrives (either due to window scaling, or due to the local application
very slowing draining data from the receive buffer), then the single byte
of data in the window probe is accepted. However, this can cause rcv_nxt
to be greater than rcv_adv. This condition will only last until the next
ACK packet is pushed out via tcp_output(), and since the previous ACK
advertised a zero window, the ACK should be pushed out while the TCP
pcb is write-locked.

During the window while rcv_nxt is greather than rcv_adv, a few places
would compute the remaining receive window via rcv_adv - rcv_nxt.
However, this value was then (uint32_t)-1. On a 64 bit machine this
could expand to a positive 2^32 - 1 when cast to a long. In particular,
when calculating the receive window in tcp_output(), the result would be
that the receive window was computed as 2^32 - 1 resulting in advertising
a far larger window to the remote peer than actually existed.

Fix various places that compute the remaining receive window to either
assert that it is not negative (i.e. rcv_nxt <= rcv_adv), or treat the
window as full if rcv_nxt is greather than rcv_adv.

Reviewed by: bz
MFC after: 1 month


# 221250 30-Apr-2011 bz

Make the TCP code compile without INET. Sort #includes and add #ifdef INETs.
Add some comments at #endifs given more nestedness. To make the compiler
happy, some default initializations were added in accordance with the style
on the files.

Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 4 days


# 218909 21-Feb-2011 brucec

Fix typos - remove duplicate "the".

PR: bin/154928
Submitted by: Eitan Adler <lists at eitanadler.com>
MFC after: 3 days


# 215701 22-Nov-2010 dim

After some off-list discussion, revert a number of changes to the
DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various
people working on the affected files. A better long-term solution is
still being considered. This reversal may give some modules empty
set_pcpu or set_vnet sections, but these are harmless.

Changes reverted:

------------------------------------------------------------------------
r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines

Instead of unconditionally emitting .globl's for the __start_set_xxx and
__stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu
sections are actually defined.

------------------------------------------------------------------------
r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.

------------------------------------------------------------------------
r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines

Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.


# 215317 14-Nov-2010 dim

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.


# 207369 29-Apr-2010 bz

MFP4: @176978-176982, 176984, 176990-176994, 177441

"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by: jhb
Discussed with: rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
MFC after: 6 days


# 204838 07-Mar-2010 bz

Destroy TCP UMA zones (empty or not) upon network stack teardown
to not leak them, otherwise making UMA/vmstat unhappy with every stoped vnet.
We will still leak pages (especially for zones marked NOFREE).

Reshuffle cleanup order in tcp_destroy() to get rid of what we can
easily free first.

Sponsored by: ISPsystem
Reviewed by: rwatson
MFC after: 5 days


# 196410 20-Aug-2009 peter

Fix signed comparison bug when ticks goes negative after 24 days of
uptime. This causes the tcp time_wait state code to fail to expire
sockets in timewait state.

Approved by: re (kensmith)


# 196019 01-Aug-2009 rwatson

Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by: bz
Approved by: re (vimage blanket)


# 195727 16-Jul-2009 rwatson

Remove unused VNET_SET() and related macros; only VNET_GET() is
ever actually used. Rename VNET_GET() to VNET() to shorten
variable references.

Discussed with: bz, julian
Reviewed by: bz
Approved by: re (kensmith, kib)


# 195699 14-Jul-2009 rwatson

Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)


# 193731 08-Jun-2009 zec

Introduce an infrastructure for dismantling vnet instances.

Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state. The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.

While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers. Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.

Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels. Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.

Bump __FreeBSD_version to 800097.
Reviewed by: bz, julian
Approved by: rwatson, kib (re), julian (mentor)


# 193511 05-Jun-2009 rwatson

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# 191734 02-May-2009 zec

Unbreak options VIMAGE + nooptions INVARIANTS kernel builds.

Submitted by: julian
Approved by: julian (mentor)


# 191548 26-Apr-2009 zec

In preparation for turning on options VIMAGE in next commits,
rearrange / replace / adjust several INIT_VNET_* initializer
macros, all of which currently resolve to whitespace.

Reviewed by: bz (an older version of the patch)
Approved by: julian (mentor)


# 190948 11-Apr-2009 rwatson

Update stats in struct tcpstat using two new macros, TCPSTAT_ADD() and
TCPSTAT_INC(), rather than directly manipulating the fields across the
kernel. This will make it easier to change the implementation of
these statistics, such as using per-CPU versions of the data structures.

MFC after: 3 days


# 190787 06-Apr-2009 zec

First pass at separating per-vnet initializer functions
from existing functions for initializing global state.

At this stage, the new per-vnet initializer functions are
directly called from the existing global initialization code,
which should in most cases result in compiler inlining those
new functions, hence yielding a near-zero functional change.

Modify the existing initializer functions which are invoked via
protosw, like ip_init() et. al., to allow them to be invoked
multiple times, i.e. per each vnet. Global state, if any,
is initialized only if such functions are called within the
context of vnet0, which will be determined via the
IS_DEFAULT_VNET(curvnet) check (currently always true).

While here, V_irtualize a few remaining global UMA zones
used by net/netinet/netipsec networking code. While it is
not yet clear to me or anybody else whether this is the right
thing to do, at this stage this makes the code more readable,
and makes it easier to track uncollected UMA-zone-backed
objects on vnet removal. In the long run, it's quite possible
that some form of shared use of UMA zone pools among multiple
vnets should be considered.

Bump __FreeBSD_version due to changes in layout of structs
vnet_ipfw, vnet_inet and vnet_net.

Approved by: julian (mentor)


# 189848 15-Mar-2009 rwatson

Correct a number of evolved problems with inp_vflag and inp_flags:
certain flags that should have been in inp_flags ended up in inp_vflag,
meaning that they were inconsistently locked, and in one case,
interpreted. Move the following flags from inp_vflag to gaps in the
inp_flags space (and clean up the inp_flags constants to make gaps
more obvious to future takers):

INP_TIMEWAIT
INP_SOCKREF
INP_ONESBCAST
INP_DROPPED

Some aspects of this change have no effect on kernel ABI at all, as these
are UDP/TCP/IP-internal uses; however, netstat and sockstat detect
INP_TIMEWAIT when listing TCP sockets, so any MFC will need to take this
into account.

MFC after: 1 week (or after dependencies are MFC'd)
Reviewed by: bz


# 189196 28-Feb-2009 rwatson

Remove unreachable code for generating RST segments from tcp_twcheck();
this code became stale when T/TCP support was removed.

Discussed with: bz, sam
MFC after: 1 month


# 186222 17-Dec-2008 bz

Use inc_flags instead of the inc_isipv6 alias which so far
had been the only flag with random usage patterns.
Switch inc_flags to be used as a real bit field by using
INC_ISIPV6 with bitops to check for the 'isipv6' condition.

While here fix a place or two where in case of v4 inc_flags
were not properly initialized before.[1]

Found by: rwatson during review [1]
Discussed with: rwatson
Reviewed by: rwatson
MFC after: 4 weeks


# 185571 02-Dec-2008 bz

Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation


# 185370 27-Nov-2008 bz

Merge in6_pcbfree() into in_pcbfree() which after the previous
IPsec change in r185366 only differed in two additonal IPv6 lines.
Rather than splattering conditional code everywhere add the v6
check centrally at this single place.

Reviewed by: rwatson (as part of a larger changset)
MFC after: 6 weeks (*)
(*) possibly need to leave a stub wrapper in 7 to keep the symbol.


# 185348 26-Nov-2008 zec

Merge more of currently non-functional (i.e. resolving to
whitespace) macros from p4/vimage branch.

Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.

De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.

Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 185088 19-Nov-2008 zec

Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions. As a rule,
initialization at instatiation for such variables should never be
introduced again from now on. Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact. In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 183550 02-Oct-2008 zec

Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 181803 17-Aug-2008 bz

Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch


# 178285 17-Apr-2008 rwatson

Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.

This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.

MFC after: 3 months
Tested by: kris (superset of committered patch)


# 172930 24-Oct-2007 rwatson

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# 172467 07-Oct-2007 silby

Add FBSDID to all files in netinet so that people can more
easily include file version information in bug reports.

Approved by: re (kensmith)


# 170289 04-Jun-2007 dwmalone

Despite several examples in the kernel, the third argument of
sysctl_handle_int is not sizeof the int type you want to export.
The type must always be an int or an unsigned int.

Remove the instances where a sizeof(variable) is passed to stop
people accidently cut and pasting these examples.

In a few places this was sysctl_handle_int was being used on 64 bit
types, which would truncate the value to be exported. In these
cases use sysctl_handle_quad to export them and change the format
to Q so that sysctl(1) can still print them.


# 169635 16-May-2007 oleg

Unbreak IPv4 kernel build.


# 169608 16-May-2007 andre

Move TIME_WAIT related functions and timer handling from files
other than repo copied tcp_subr.c into tcp_timewait.c#1.284:

tcp_input.c#1.350 tcp_timewait() -> tcp_twcheck()

tcp_timer.c#1.92 tcp_timer_2msl_reset() -> tcp_tw_2msl_reset()
tcp_timer.c#1.92 tcp_timer_2msl_stop() -> tcp_tw_2msl_stop()
tcp_timer.c#1.92 tcp_timer_2msl_tw() -> tcp_tw_2msl_scan()

This is a mechanical move with appropriate renames and making
them static if used only locally.

The tcp_tw_2msl_scan() cleanup function is still run from the
tcp_slowtimo() in tcp_timer.c.


# 169541 13-May-2007 andre

Complete the (mechanical) move of the TCP reassembly and timewait
functions from their origininal place to their own files.

TCP Reassembly from tcp_input.c -> tcp_reass.c
TCP Timewait from tcp_subr.c -> tcp_timewait.c


# 169482 11-May-2007 andre

Drop everything that doesn't belong into this new file.
It's neither functional not connected to the build yet.


# 169478 11-May-2007 andre

Forced commit to note repo copy from sys/netinet/tcp_subr.c rev. 1.280.


# 169477 11-May-2007 andre

Add the timestamp offset to struct tcptw so we can generate proper
ACKs in TIME_WAIT state that don't get dropped by the PAWS check
on the receiver.


# 169454 10-May-2007 rwatson

Move universally to ANSI C function declarations, with relatively
consistent style(9)-ish layout.


# 169347 07-May-2007 rwatson

When setting up timewait state for a TCP connection, don't hold the
socket lock over a crhold() of so_cred: so_cred is constant after
socket creation, so doesn't require locking to read.


# 169317 06-May-2007 andre

Use existing TF_SACK_PERMIT flag in struct tcpcb t_flags field instead of
a decdicated sack_enable int for this bool. Change all users accordingly.


# 169154 30-Apr-2007 rwatson

Rename some fields of struct inpcbinfo to have the ipi_ prefix,
consistent with the naming of other structure field members, and
reducing improper grep matches. Clean up and comment structure
fields in structure definition.


# 168845 18-Apr-2007 andre

Make tcp_twrespond() use tcp_addoptions() instead of a home grown version.


# 168615 11-Apr-2007 andre

Change the TCP timer system from using the callout system five times
directly to a merged model where only one callout, the next to fire,
is registered.

Instead of callout_reset(9) and callout_stop(9) the new function
tcp_timer_activate() is used which then internally manages the callout.

The single new callout is a mutex callout on inpcb simplifying the
locking a bit.

tcp_timer() is the called function which handles all race conditions
in one place and then dispatches the individual timer functions.

Reviewed by: rwatson (earlier version)


# 168364 04-Apr-2007 andre

Retire unused TCP_SACK_DEBUG.


# 167785 21-Mar-2007 andre

ANSIfy function declarations and remove register keywords for variables.
Consistently apply style to all function declarations.


# 167772 21-Mar-2007 andre

Remove tcp_minmssoverload DoS detection logic. The problem it tried to
protect us from wasn't really there and it only bloats the code. Should
the problem surface in the future we can simply resurrect it from cvs
history.


# 167721 19-Mar-2007 andre

Match up SYSCTL declaration style.


# 167636 16-Mar-2007 rwatson

Remove unused and #if 0'd net.inet.tcp.tcp_rttdflt sysctl.


# 167036 26-Feb-2007 mohans

Reap FIN_WAIT_2 connections marked SOCANTRCVMORE faster. This mitigate
potential issues where the peer does not close, potentially leaving
thousands of connections in FIN_WAIT_2. This is controlled by a new sysctl
fast_finwait2_recycle, which is disabled by default.

Reviewed by: gnn, silby.


# 165657 30-Dec-2006 jhb

Whitespace fix and remove an extra cast.


# 164033 06-Nov-2006 rwatson

Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges. These may
require some future tweaking.

Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>


# 163606 22-Oct-2006 rwatson

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# 162768 29-Sep-2006 maxim

o Convert w/spaces to tabs in the previous commit.


# 162767 29-Sep-2006 silby

Rather than autoscaling the number of TIME_WAIT sockets to maxsockets / 5,
scale it to min(ephemeral port range / 2, maxsockets / 5) so that people
with large gobs of memory and/or large maxsockets settings will not
exhaust their entire ephemeral port range with sockets in the TIME_WAIT
state during periods of heavy load.

Those who wish to tweak the size of the TIME_WAIT zone can still do so with
net.inet.tcp.maxtcptw.

Reviewed by: glebius, ru


# 162151 08-Sep-2006 glebius

Add a sysctl net.inet.tcp.nolocaltimewait that allows to suppress
creating a compress TIME WAIT states, if both connection endpoints
are local. Default is off.


# 162111 07-Sep-2006 ru

Back when we had T/TCP support, we used to apply different
timeouts for TCP and T/TCP connections in the TIME_WAIT
state, and we had two separate timed wait queues for them.
Now that is has gone, the timeout is always 2*MSL again,
and there is no reason to keep two queues (the first was
unused anyway!).

Also, reimplement the remaining queue using a TAILQ (it
was technically impossible before, with two queues).


# 162084 06-Sep-2006 andre

First step of TSO (TCP segmentation offload) support in our network stack.

o add IFCAP_TSO[46] for drivers to announce this capability for IPv4 and IPv6
o add CSUM_TSO flag to mbuf pkthdr csum_flags field
o add tso_segsz field to mbuf pkthdr
o enhance ip_output() packet length check to allow for large TSO packets
o extend tcp_maxmtu[46]() with a flag pointer to pass interface capabilities
o adjust all callers of tcp_maxmtu[46]() accordingly

Discussed on: -current, -net
Sponsored by: TCP/IP Optimization Fundraise 2005


# 162064 06-Sep-2006 glebius

o Backout rev. 1.125 of in_pcb.c. It appeared to behave extremely
bad under high load. For example with 40k sockets and 25k tcptw
entries, connect() syscall can run for seconds. Debugging showed
that it iterates the cycle millions times and purges thousands of
tcptw entries at a time.
Besides practical unusability this change is architecturally
wrong. First, in_pcblookup_local() is used in connect() and bind()
syscalls. No stale entries purging shouldn't be done here. Second,
it is a layering violation.
o Return back the tcptw purging cycle to tcp_timer_2msl_tw(),
that was removed in rev. 1.78 by rwatson. The commit log of this
revision tells nothing about the reason cycle was removed. Now
we need this cycle, since major cleaner of stale tcptw structures
is removed.
o Disable probably necessary, but now unused
tcp_twrecycleable() function.

Reviewed by: ru


# 162035 05-Sep-2006 glebius

Finally fix rev. 1.256

Pointy hat to: glebius


# 162033 05-Sep-2006 glebius

Remove extra parenthesis in last commit.

Nitpicked by: ru


# 162031 05-Sep-2006 glebius

- Make net.inet.tcp.maxtcptw modifiable at run time.
- If net.inet.tcp.maxtcptw was ever set explicitly, do
not change it if kern.ipc.maxsockets is changed.


# 161645 26-Aug-2006 mohans

Fix for a bug that causes the computation of "len" in tcp_output() to
get messed up, resulting in an inconsistency between the TCP state
and so_snd.


# 161226 11-Aug-2006 mohans

Fixes an edge case bug in timewait handling where ticks rolling over causing
the timewait expiry to be exactly 0 corrupts the timewait queues (and that entry).
Reviewed by: silby


# 160925 02-Aug-2006 rwatson

Move soisdisconnected() in tcp_discardcb() to one of its calling contexts,
tcp_twstart(), but not to the other, tcp_detach(), as the socket is
already being torn down and therefore there are no listeners. This avoids
a panic if kqueue state is registered on the socket at close(), and
eliminates to XXX comments. There is one case remaining in which
tcp_discardcb() reaches up to the socket layer as part of the TCP host
cache, which would be good to avoid.

Reported by: Goran Gajic <ggajic at afrodita dot rcub dot bg dot ac dot yu>


# 160549 21-Jul-2006 rwatson

Change semantics of socket close and detach. Add a new protocol switch
function, pru_close, to notify protocols that the file descriptor or
other consumer of a socket is closing the socket. pru_abort is now a
notification of close also, and no longer detaches. pru_detach is no
longer used to notify of close, and will be called during socket
tear-down by sofree() when all references to a socket evaporate after
an earlier call to abort or close the socket. This means detach is now
an unconditional teardown of a socket, whereas previously sockets could
persist after detach of the protocol retained a reference.

This faciliates sharing mutexes between layers of the network stack as
the mutex is required during the checking and removal of references at
the head of sofree(). With this change, pru_detach can now assume that
the mutex will no longer be required by the socket layer after
completion, whereas before this was not necessarily true.

Reviewed by: gnn


# 160491 18-Jul-2006 ups

Fix race conditions on enumerating pcb lists by moving the initialization
( and where appropriate the destruction) of the pcb mutex to the init/finit
functions of the pcb zones.
This allows locking of the pcb entries and race condition free comparison
of the generation count.
Rearrange locking a bit to avoid extra locking operation to update the generation
count in in_pcballoc(). (in_pcballoc now returns the pcb locked)

I am planning to convert pcb list handling from a type safe to a reference count
model soon. ( As this allows really freeing the PCBs)

Reviewed by: rwatson@, mohans@
MFC after: 1 week


# 158009 25-Apr-2006 rwatson

Abstract inpcb drop logic, previously just setting of INP_DROPPED in TCP,
into in_pcbdrop(). Expand logic to detach the inpcb from its bound
address/port so that dropping a TCP connection releases the inpcb resource
reservation, which since the introduction of socket/pcb reference count
updates, has been persisting until the socket closed rather than being
released implicitly due to prior freeing of the inpcb on TCP drop.

MFC after: 3 months


# 157977 23-Apr-2006 rwatson

Replace isn_mtx direct use with ISN_*() lock macros so that locking
details/strategy can be changed without touching every use.

MFC after: 3 months


# 157967 22-Apr-2006 rwatson

Introduce a new TCP mutex, isn_mtx, which protects the initial sequence
number state, rather than re-using pcbinfo. This introduces some
additional mutex operations during isn query, but avoids hitting the TCP
pcbinfo lock out of yet another frequently firing TCP timer.

MFC after: 3 months


# 157927 21-Apr-2006 ps

Allow for nmbclusters and maxsockets to be increased via sysctl.
An eventhandler is used to update all the various zones that depend
on these values.


# 157478 04-Apr-2006 glebius

Add a tunable net.inet.tcp.maxtcptw, that allows to set a limit
on tcptw zone independently from setting a limit on socket zone.


# 157474 04-Apr-2006 rwatson

Before dereferencing intotw() when INP_TIMEWAIT, check for inp_ppcb being
NULL. We currently do allow this to happen, but may want to remove that
possibility in the future. This case can occur when a socket is left
open after TCP wraps up, and the timewait state is recycled. This will
be cleaned up in the future.

Found by: Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after: 3 months


# 157433 03-Apr-2006 rwatson

In TCP notify routines, check inpcb for INP_TIMEWAIT and INP_DROPPED.
The INP_DROPPED check replaces the current NULL checks; the INP_TIMEWAIT
checks appear to have always been required, but not been there, which
is/was a bug. This avoids unconditionally casting of in_ppcb to a tcpcb,
when it may be a twtcb, which may have resulted in obscure ICMP-related
panics in earlier releases.

MFC after: 3 months


# 157432 03-Apr-2006 rwatson

Change inp_ppcb from caddr_t to void *, fix/remove associated related
casts.

Consistently use intotw() to cast inp_ppcb pointers to struct tcptw *
pointers.

Consistently use intotcpcb() to cast inp_ppcb pointers to struct tcpcb *
pointers.

Don't assign tp to the results to intotcpcb() during variable declation
at the top of functions, as that is before the asserts relating to
locking have been performed. Do this later in the function after
appropriate assertions have run to allow that operation to be conisdered
safe.

MFC after: 3 months


# 157431 03-Apr-2006 rwatson

Style tweaks: convert to ANSI from K&R function prototypes.

MFC after: 3 months


# 157430 03-Apr-2006 rwatson

Update comment on tcp_close() for new world order.

MFC after: 3 months


# 157427 03-Apr-2006 rwatson

Fix up locking surrounding tcp_drop sysctl: in the new world order, we
don't free inpcbs until after the socket is closed, so we always need
to unlock an inpcb after calling tcp_drop() on it.

MFC after: 3 months


# 157386 01-Apr-2006 rwatson

Properly handle an edge case previously not handled correctly: a
socket can have a tcp connection that has entered time wait
attached to it, in the event that shutdown() is called on the
socket and the FINs properly exchange before close(). In this
case we don't detach or free the inpcb, just leave the tcptw
detached and freed, but we must release the inpcb lock (which we
didn't previously).

MFC after: 3 months


# 157376 01-Apr-2006 rwatson

Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():

- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.

- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.

- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.

- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.

- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.

- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.

- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.

- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.

- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.

These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.

MFC after: 3 months


# 155767 16-Feb-2006 andre

Have TCP Inflight disable itself if the RTT is below a certain
threshold. Inflight doesn't make sense on a LAN as it has
trouble figuring out the maximal bandwidth because of the coarse
tick granularity.

The sysctl net.inet.tcp.inflight.rttthresh specifies the threshold
in milliseconds below which inflight will disengage. It defaults
to 10ms.

Tested by: Joao Barros <joao.barros-at-gmail.com>,
Rich Murphey <rich-at-whiteoaklabs.com>
Sponsored by: TCP/IP Optimization Fundraise 2005


# 151967 02-Nov-2005 andre

Retire MT_HEADER mbuf type and change its users to use MT_DATA.

Having an additional MT_HEADER mbuf type is superfluous and redundant
as nothing depends on it. It only adds a layer of confusion. The
distinction between header mbuf's and data mbuf's is solely done
through the m->m_flags M_PKTHDR flag.

Non-native code is not changed in this commit. For compatibility
MT_HEADER is mapped to MT_DATA.

Sponsored by: TCP/IP Optimization Fundraise 2005


# 151254 12-Oct-2005 philip

Unbreak the net.inet6.tcp6.getcred sysctl.

This makes inetd/auth work again in IPv6 setups.

Pointy hat to: ume/KAME


# 150804 02-Oct-2005 maxim

o Teach sysctl_drop() how to deal with the sockets in TIME_WAIT state.
This is a special case because tcp_twstart() destroys a tcp control
block via tcp_discardcb() so we cannot call tcp_drop(struct *tcpcb) on
such connections. Use tcp_twclose() instead.

MFC after: 5 days


# 149929 10-Sep-2005 andre

In tcp_ctlinput() do not swap ip->ip_len a second time. It
has been done in icmp_input() already.

This fixes the ICMP_UNREACH_NEEDFRAG case where no MTU was
proposed in the ICMP reply.

PR: kern/81813
Submitted by: Vitezslav Novy <vita at fio.cz>
MFC after: 3 days


# 149635 30-Aug-2005 andre

Use the correct mbuf type for MGET().


# 148616 01-Aug-2005 ume

recover the line which was wrongly disappeared during scope cleanup.
tcpdrop(8) should work for IPv6, again.


# 148385 25-Jul-2005 ume

scope cleanup. with this change
- most of the kernel code will not care about the actual encoding of
scope zone IDs and won't touch "s6_addr16[1]" directly.
- similarly, most of the kernel code will not care about link-local
scoped addresses as a special case.
- scope boundary check will be stricter. For example, the current
*BSD code allows a packet with src=::1 and dst=(some global IPv6
address) to be sent outside of the node, if the application do:
s = socket(AF_INET6);
bind(s, "::1");
sendto(s, some_global_IPv6_addr);
This is clearly wrong, since ::1 is only meaningful within a single
node, but the current implementation of the *BSD kernel cannot
reject this attempt.

Submitted by: JINMEI Tatuya <jinmei__at__isl.rdc.toshiba.co.jp>
Obtained from: KAME


# 148156 19-Jul-2005 rwatson

Remove no-op spl's and most comment references to spls, as TCP locking
is believed to be basically done (modulo any remaining bugs).

MFC after: 3 days


# 147735 01-Jul-2005 ps

Fix for a bug in the change that defers sack option processing until
after PAWS checks. The symptom of this is an inconsistency in the cached
sack state, caused by the fact that the sack scoreboard was not being
updated for an ACK handled in the header prediction path.

Found by: Andrey Chernov.
Submitted by: Noritoshi Demizu, Raja Mukerji.
Approved by: re


# 146864 01-Jun-2005 rwatson

Assert tcbinfo lock in tcp_drop() due to its call of tcp_close()
Assert tcbinfo lock in tcp_close() due to its call to in{,6}_detach()
Assert tcbinfo lock in tcp_drop_syn_sent() due to its call to tcp_drop()

MFC after: 7 days


# 145978 06-May-2005 cperciva

Fix two issues which were missed in FreeBSD-SA-05:08.kmem.

Reported by: Uwe Doering


# 145869 04-May-2005 andre

If we don't get a suggested MTU during path MTU discovery
look up the packet size of the packet that generated the
response, step down the MTU by one step through ip_next_mtu()
and try again.

Suggested by: dwmalone


# 145370 21-Apr-2005 ps

- Make the sack scoreboard logic use the TAILQ macros. This improves
code readability and facilitates some anticipated optimizations in
tcp_sack_option().
- Remove tcp_print_holes() and TCP_SACK_DEBUG.

Submitted by: Raja Mukerji.
Reviewed by: Mohan Srinivasan, Noritoshi Demizu.


# 145360 21-Apr-2005 andre

Move Path MTU discovery ICMP processing from icmp_input() to
tcp_ctlinput() and subject it to active tcpcb and sequence
number checking. Previously any ICMP unreachable/needfrag
message would cause an update to the TCP hostcache. Now only
ICMP PMTU messages belonging to an active TCP session with
the correct src/dst/port and sequence number will update the
hostcache and complete the path MTU discovery process.

Note that we don't entirely implement the recommended counter
measures of Section 7.2 of the paper. However we close down
the possible degradation vector from trivially easy to really
complex and resource intensive. In addition we have limited
the smallest acceptable MTU with net.inet.tcp.minmss sysctl
for some time already, further reducing the effect of any
degradation due to an attack.

Security: draft-gont-tcpm-icmp-attacks-03.txt Section 7.2
MFC after: 3 days


# 145355 21-Apr-2005 andre

Ignore ICMP Source Quench messages for TCP sessions. Source Quench is
ineffective, depreciated and can be abused to degrade the performance
of active TCP sessions if spoofed.

Replace a bogus call to tcp_quench() in tcp_output() with the direct
equivalent tcpcb variable assignment.

Security: draft-gont-tcpm-icmp-attacks-03.txt Section 7.1
MFC after: 3 days


# 144857 10-Apr-2005 ps

- If the reassembly queue limit was reached or if we couldn't allocate
a reassembly queue state structure, don't update (receiver) sack
report.
- Similarly, if tcp_drain() is called, freeing up all items on the
reassembly queue, clean the sack report.

Found, Submitted by: Noritoshi Demizu <demizu at dd dot iij4u dot or dot jp>
Reviewed by: Mohan Srinivasan (mohans at yahoo-inc dot com),
Raja Mukerji (raja at moselle dot com).


# 142906 01-Mar-2005 glebius

Use NET_CALLOUT_MPSAFE macro.


# 141886 14-Feb-2005 maxim

o Add handling of an IPv4-mapped IPv6 address.
o Use SYSCTL_IN() macro instead of direct call of copyin(9).

Submitted by: ume

o Move sysctl_drop() implementation to sys/netinet/tcp_subr.c where
most of tcp sysctls live.
o There are net.inet[6].tcp[6].getcred sysctls already, no needs in
a separate struct tcp_ident_mapping.

Suggested by: ume


# 141282 04-Feb-2005 ume

teach scope of IPv6 address to net.inet6.tcp6.getcred.

MFC after: 1 week


# 141078 30-Jan-2005 rwatson

Update an additional reference to the rate of ISN tick callouts that was
missed in tcp_subr.c:1.216: projected_offset must also reflect how often
the tcp_isn_tick() callout will fire.

MFC after: 2 weeks
Submitted by: silby


# 141072 30-Jan-2005 rwatson

Have tcp_isn_tick() fire 100 times a second, rather than HZ times a
second; since the default hz has changed to 1000 times a second,
this resulted in unecessary work being performed.

MFC after: 2 weeks
Discussed with: phk, cperciva
General head nod: silby


# 139823 06-Jan-2005 imp

/* -> /*- for license, minor formatting changes


# 139222 22-Dec-2004 rwatson

Attempt to consistently use () around return values in calls to
return() in newer code (sysctl, ISN, timewait).


# 139221 22-Dec-2004 rwatson

Remove an XXXRW comment relating to whether or not the TCP timers are
MPSAFE: they are now believed to be.

Correct a typo in a second comment.

MFC after: 2 weeks


# 138410 05-Dec-2004 rwatson

Assert inpcb lock in:

tcpip_fillheaders()
tcp_discardcb()
tcp_close()
tcp_notify()
tcp_new_isn()
tcp_xmit_bandwidth_limit()

Fix a locking comment in tcp_twstart(): the pcbinfo will be locked (and
is asserted).

MFC after: 2 weeks


# 138025 23-Nov-2004 rwatson

tcp_timewait() performs multiple non-atomic reads on the tcptw
structure, so assert the inpcb lock associated with the tcptw.
Also assert the tcbinfo lock, as tcp_timewait() may call
tcp_twclose() or tcp_2msl_rest(), which require it. Since
tcp_timewait() is already called with that lock from tcp_input(),
this doesn't change current locking, merely documents reasons for
it.

In tcp_twstart(), assert the tcbinfo lock, as tcp_timer_2msl_rest()
is called, which requires that lock.

In tcp_twclose(), assert the tcbinfo lock, as tcp_timer_2msl_stop()
is called, which requires that lock.

Document the locking strategy for the time wait queues in tcp_timer.c,
which consists of protecting the time wait queues in the same manner
as the tcbinfo structure (using the tcbinfo lock).

In tcp_timer_2msl_reset(), assert the tcbinfo lock, as the time wait
queues are modified.

In tcp_timer_2msl_stop(), assert the tcbinfo lock, as the time wait
queues may be modified.

In tcp_timer_2msl_tw(), assert the tcbinfo lock, as the time wait
queues may be modified.

MFC after: 2 weeks


# 138020 23-Nov-2004 rwatson

Assert the inpcb lock in tcp_twstart(), which does both read-modify-write
on the tcpcb, but also calls into tcp_close() and tcp_twrespond().

Annotate that tcp_twrecycleable() requires the inpcb lock because it does
a series of non-atomic reads of the tcpcb, but is currently called
without the inpcb lock by the caller. This is a bug.

Assert the inpcb lock in tcp_twclose() as it performs a read-modify-write
of the timewait structure/inpcb, and calls in_pcbdetach() which requires
the lock.

Assert the inpcb lock in tcp_twrespond(), as it performs multiple
non-atomic reads of the tcptw and inpcb structures, as well as calling
mac_create_mbuf_from_inpcb(), tcpip_fillheaders(), which require the
inpcb lock.

MFC after: 2 weeks


# 138019 23-Nov-2004 rwatson

Assert inpcb lock in tcp_quench(), tcp_drop_syn_sent(), tcp_mtudisc(),
and tcp_drop(), due to read-modify-write of TCP state variables.

MFC after: 2 weeks


# 138018 23-Nov-2004 rwatson

Assert the tcbinfo write lock in tcp_new_isn(), as the tcbinfo lock
protects access to the ISN state variables.

Acquire the tcbinfo write lock in tcp_isn_tick() to synchronize
timer-driven isn bumping.

Staticize internal ISN variables since they're not used outside of
tcp_subr.c.

MFC after: 2 weeks


# 137396 08-Nov-2004 suz

support TCP-MD5(IPv4) in KAME-IPSEC, too.

MFC after: 3 week


# 137139 02-Nov-2004 andre

Remove RFC1644 T/TCP support from the TCP side of the network stack.

A complete rationale and discussion is given in this message
and the resulting discussion:

http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706

Note that this commit removes only the functional part of T/TCP
from the tcp_* related functions in the kernel. Other features
introduced with RFC1644 are left intact (socket layer changes,
sendmsg(2) on connection oriented protocols) and are meant to
be reused by a simpler and less intrusive reimplemention of the
previous T/TCP functionality.

Discussed on: -arch


# 136682 18-Oct-2004 rwatson

Push acquisition of the accept mutex out of sofree() into the caller
(sorele()/sotryfree()):

- This permits the caller to acquire the accept mutex before the socket
mutex, avoiding sofree() having to drop the socket mutex and re-order,
which could lead to races permitting more than one thread to enter
sofree() after a socket is ready to be free'd.

- This also covers clearing of the so_pcb weak socket reference from
the protocol to the socket, preventing races in clearing and
evaluation of the reference such that sofree() might be called more
than once on the same socket.

This appears to close a race I was able to easily trigger by repeatedly
opening and resetting TCP connections to a host, in which the
tcp_close() code called as a result of the RST raced with the close()
of the accepted socket in the user process resulting in simultaneous
attempts to de-allocate the same socket. The new locking increases
the overhead for operations that may potentially free the socket, so we
will want to revise the synchronization strategy here as we normalize
the reference counting model for sockets. The use of the accept mutex
in freeing of sockets that are not listen sockets is primarily
motivated by the potential need to remove the socket from the
incomplete connection queue on its parent (listen) socket, so cleaning
up the reference model here may allow us to substantially weaken the
synchronization requirements.

RELENG_5_3 candidate.

MFC after: 3 days
Reviewed by: dwhite
Discussed with: gnn, dwhite, green
Reported by: Marc UBM Bocklet <ubm at u-boot-man dot de>
Reported by: Vlad <marchenko at gmail dot com>


# 136151 05-Oct-2004 ps

- Estimate the amount of data in flight in sack recovery and use it
to control the packets injected while in sack recovery (for both
retransmissions and new data).
- Cleanups to the sack codepaths in tcp_output.c and tcp_sack.c.
- Add a new sysctl (net.inet.tcp.sack.initburst) that controls the
number of sack retransmissions done upon initiation of sack recovery.

Submitted by: Mohan Srinivasan <mohans@yahoo-inc.com>


# 134793 05-Sep-2004 jmg

fix up socket/ip layer violation... don't assume/know that
SO_DONTROUTE == IP_ROUTETOIF and SO_BROADCAST == IP_ALLOWBROADCAST...


# 134041 19-Aug-2004 andre

For IPv6 access pointer to tcpcb only after we have checked it is valid.

Found by: Coverity's automated analysis (via Ted Unangst)


# 133874 16-Aug-2004 rwatson

White space cleanup for netinet before branch:

- Trailing tab/space cleanup
- Remove spurious spaces between or before tabs

This change avoids touching files that Andre likely has in his working
set for PFIL hooks changes for IPFW/DUMMYNET.

Approved by: re (scottl)
Submitted by: Xin LI <delphij@frontfree.net>


# 133591 12-Aug-2004 dwmalone

In tcp6_ctlinput, lock tcbinfo around the call to syncache_unreach
so that the locks held are the same as the IPv4 case.

Reviewed by: rwatson


# 133517 11-Aug-2004 andre

Backout removal of UMA_ZONE_NOFREE flag for all zones which are established
for structures with timers in them. It might be that a timer might fire
even when the associated structure has already been free'd. Having type-
stable storage in this case is beneficial for graceful failure handling and
debugging.

Discussed with: bosko, tegge, rwatson


# 133509 11-Aug-2004 andre

Remove the UMA_ZONE_NOFREE flag to all uma_zcreate() calls in the IP and
TCP code. This flag would have prevented giving back excessive free slabs
to the global pool after a transient peak usage.


# 133192 06-Aug-2004 rwatson

Pass pcbinfo structures to in6_pcbnotify() rather than pcbhead
structures, allowing in6_pcbnotify() to lock the pcbinfo and each
inpcb that it notifies of ICMPv6 events. This prevents inpcb
assertions from firing when IPv6 generates and delievers event
notifications for inpcbs.

Reported by: kuriyama
Tested by: kuriyama


# 133072 03-Aug-2004 andre

o Move the inflight sysctls to their own sub-tree under net.inet.tcp to be
more consistent with the other sysctls around it.


# 132653 26-Jul-2004 cperciva

Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.

The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)

Discussed with: rwatson, scottl
Requested by: jhb


# 132418 19-Jul-2004 jayanth

Let IN_FASTREOCOVERY macro decide if we are in recovery mode.

Nuke sackhole_limit for now. We need to add it back to limit the total
number of sack blocks in the system.


# 130993 23-Jun-2004 ps

Move the sack sysctl's under net.inet.tcp.sack

net.inet.tcp.do_sack -> net.inet.tcp.sack.enable
net.inet.tcp.sackhole_limit -> net.inet.tcp.sack.sackhole_limit

Requested by: wollman


# 130989 23-Jun-2004 ps

Add support for TCP Selective Acknowledgements. The work for this
originated on RELENG_4 and was ported to -CURRENT.

The scoreboarding code was obtained from OpenBSD, and many
of the remaining changes were inspired by OpenBSD, but not
taken directly from there.

You can enable/disable sack using net.inet.tcp.do_sack. You can
also limit the number of sack holes that all senders can have in
the scoreboard with net.inet.tcp.sackhole_limit.

Reviewed by: gnn
Obtained from: Yahoo! (Mohan Srinivasan, Jayanth Vijayaraghavan)


# 130821 20-Jun-2004 rwatson

If debug.mpsafenet is set, initialize TCP callouts as CALLOUT_MPSAFE.


# 130387 12-Jun-2004 rwatson

Extend coverage of SOCK_LOCK(so) to include so_count, the socket
reference count:

- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
soref(), sorele().

- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
so_count: sofree(), sotryfree().

- Acquire SOCK_LOCK(so) before calling these functions or macros in
various contexts in the stack, both at the socket and protocol
layers.

- In some cases, perform soisdisconnected() before sotryfree(), as
this could result in frobbing of a non-present socket if
sotryfree() actually frees the socket.

- Note that sofree()/sotryfree() will release the socket lock even if
they don't free the socket.

Submitted by: sam
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS


# 128905 04-May-2004 rwatson

Switch to using the inpcb MAC label instead of socket MAC label when
labeling new mbufs created from sockets/inpcbs in IPv4. This helps avoid
the need for socket layer locking in the lower level network paths
where inpcb locks are already frequently held where needed. In
particular:

- Use the inpcb for label instead of socket in raw_append().
- Use the inpcb for label instead of socket in tcp_output().
- Use the inpcb for label instead of socket in tcp_respond().
- Use the inpcb for label instead of socket in tcp_twrespond().
- Use the inpcb for label instead of socket in syncache_respond().

While here, modify tcp_respond() to avoid assigning NULL to a stack
variable and centralize assertions about the inpcb when inp is
assigned.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, McAfee Research


# 128452 20-Apr-2004 silby

Enhance our RFC1948 implementation to perform better in some pathlogical
TIME_WAIT recycling cases I was able to generate with http testing tools.

In short, as the old algorithm relied on ticks to create the time offset
component of an ISN, two connections with the exact same host, port pair
that were generated between timer ticks would have the exact same sequence
number. As a result, the second connection would fail to pass the TIME_WAIT
check on the server side, and the SYN would never be acknowledged.

I've "fixed" this by adding random positive increments to the time component
between clock ticks so that ISNs will *always* be increasing, no matter how
quickly the port is recycled.

Except in such contrived benchmarking situations, this problem should never
come up in normal usage... until networks get faster.

No MFC planned, 4.x is missing other optimizations that are needed to even
create the situation in which such quick port recycling will occur.


# 128019 07-Apr-2004 imp

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson


# 127871 04-Apr-2004 rwatson

Two missed in previous commit -- compare pointer with NULL rather than
using it as a boolean.


# 127870 04-Apr-2004 rwatson

Prefer NULL to 0 when checking pointer values as integers or booleans.


# 126351 28-Feb-2004 rwatson

Remove now unneeded arguments to tcp_twrespond() -- so and msrc. These
were needed by the MAC Framework until inpcbs gained labels.

Submitted by: sam


# 126253 25-Feb-2004 truckman

Split the mlock() kernel code into two parts, mlock(), which unpacks
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire(). Split munlock() in a similar way.

Enable the RLIMIT_MEMLOCK checking code in kern_mlock().

Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.

Nuke the vslock() and vsunlock() implementations, which are no
longer used.

Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.

Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails. Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.

Modify the callers of sysctl_wire_old_buffer() to look for the
error return.

Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.

Reviewed by: bms


# 126193 24-Feb-2004 andre

Convert the tcp segment reassembly queue to UMA and limit the maximum
amount of segments it will hold.

The following tuneables and sysctls control the behaviour of the tcp
segment reassembly queue:

net.inet.tcp.reass.maxsegments (loader tuneable)
specifies the maximum number of segments all tcp reassemly queues can
hold (defaults to 1/16 of nmbclusters).

net.inet.tcp.reass.maxqlen
specifies the maximum number of segments any individual tcp session queue
can hold (defaults to 48).

net.inet.tcp.reass.cursegments (readonly)
counts the number of segments currently in all reassembly queues.

net.inet.tcp.reass.overflows (readonly)
counts how often either the global or local queue limit has been reached.

Tested by: bms, silby
Reviewed by: bms, silby


# 126002 19-Feb-2004 pjd

Fixed ucred structure leak.

Approved by: scottl (mentor)
PR: 54163
MFC after: 3 days


# 125819 14-Feb-2004 bms

Final brucification pass. Spell types consistently (u_int). Remove bogus
casts. Remove unnecessary parenthesis.

Submitted by: bde


# 125783 13-Feb-2004 bms

Brucification.

Submitted by: bde


# 125776 13-Feb-2004 ume

supported IPV6_RECVPATHMTU socket option.

Obtained from: KAME


# 125742 12-Feb-2004 bms

Update the prototype for tcpsignature_apply() to reflect the spelling of
the types used by m_apply()'s callback function, f, as documented in mbuf(9).

Noticed by: njl


# 125741 12-Feb-2004 bms

style(9) pass; whitespace and comments.

Submitted by: njl


# 125680 11-Feb-2004 bms

Initial import of RFC 2385 (TCP-MD5) digest support.

This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.

For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.

Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.

There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.

Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.

This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.

Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.

Sponsored by: sentex.net


# 124258 08-Jan-2004 andre

Limiters and sanity checks for TCP MSS (maximum segement size)
resource exhaustion attacks.

For network link optimization TCP can adjust its MSS and thus
packet size according to the observed path MTU. This is done
dynamically based on feedback from the remote host and network
components along the packet path. This information can be
abused to pretend an extremely low path MTU.

The resource exhaustion works in two ways:

o during tcp connection setup the advertized local MSS is
exchanged between the endpoints. The remote endpoint can
set this arbitrarily low (except for a minimum MTU of 64
octets enforced in the BSD code). When the local host is
sending data it is forced to send many small IP packets
instead of a large one.

For example instead of the normal TCP payload size of 1448
it forces TCP payload size of 12 (MTU 64) and thus we have
a 120 times increase in workload and packets. On fast links
this quickly saturates the local CPU and may also hit pps
processing limites of network components along the path.

This type of attack is particularly effective for servers
where the attacker can download large files (WWW and FTP).

We mitigate it by enforcing a minimum MTU settable by sysctl
net.inet.tcp.minmss defaulting to 256 octets.

o the local host is reveiving data on a TCP connection from
the remote host. The local host has no control over the
packet size the remote host is sending. The remote host
may chose to do what is described in the first attack and
send the data in packets with an TCP payload of at least
one byte. For each packet the tcp_input() function will
be entered, the packet is processed and a sowakeup() is
signalled to the connected process.

For example an attack with 2 Mbit/s gives 4716 packets per
second and the same amount of sowakeup()s to the process
(and context switches).

This type of attack is particularly effective for servers
where the attacker can upload large amounts of data.
Normally this is the case with WWW server where large POSTs
can be made.

We mitigate this by calculating the average MSS payload per
second. If it goes below 'net.inet.tcp.minmss' and the pps
rate is above 'net.inet.tcp.minmssoverload' defaulting to
1000 this particular TCP connection is resetted and dropped.

MITRE CVE: CAN-2004-0002
Reviewed by: sam (mentor)
MFC after: 1 day


# 124248 08-Jan-2004 andre

If path mtu discovery is enabled set the DF bit in all cases we
send packets on a tcp connection.

PR: kern/60889
Tested by: Richard Wendland <richard@wendland.org.uk>
Approved by: re (scottl)


# 124199 06-Jan-2004 andre

Enable the following TCP options by default to give it more exposure:

rfc3042 Limited retransmit
rfc3390 Increasing TCP's initial congestion Window
inflight TCP inflight bandwidth limiting

All my production server have it enabled and there have been no
issues. I am confident about having them on by default and it gives
us better overall TCP performance.

Reviewed by: sam (mentor)


# 123608 17-Dec-2003 jhb

Fix some becuase -> because typos.

Reported by: Marco Wertejuk <wertejuk@mwcis.com>


# 123607 17-Dec-2003 rwatson

Switch TCP over to using the inpcb label when responding in timed
wait, rather than the socket label. This avoids reaching up to
the socket layer during connection close, which requires locking
changes. To do this, introduce MAC Framework entry point
mac_create_mbuf_from_inpcb(), which is called from tcp_twrespond()
instead of calling mac_create_mbuf_from_socket() or
mac_create_mbuf_netlayer(). Introduce MAC Policy entry point
mpo_create_mbuf_from_inpcb(), and implementations for various
policies, which generally just copy label data from the inpcb to
the mbuf. Assert the inpcb lock in the entry point since we
require consistency for the inpcb label reference.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 122996 26-Nov-2003 andre

Make sure all uses of stack allocated struct route's are properly
zeroed. Doing a bzero on the entire struct route is not more
expensive than assigning NULL to ro.ro_rt and bzero of ro.ro_dst.

Reviewed by: sam (mentor)
Approved by: re (scottl)


# 122922 20-Nov-2003 andre

Introduce tcp_hostcache and remove the tcp specific metrics from
the routing table. Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.

It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination. Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.

tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.

It removes significant locking requirements from the tcp stack with
regard to the routing table.

Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)


# 122327 08-Nov-2003 sam

o correct locking problem: the inpcb must be held across tcp_respond
o add assertions in tcp_respond to validate inpcb locking assumptions
o use local variable instead of chasing pointers in tcp_respond

Supported by: FreeBSD Foundation


# 121884 02-Nov-2003 silby

Add an additional check to the tcp_twrecycleable function; I had
previously only considered the send sequence space. Unfortunately,
some OSes (windows) still use a random positive increments scheme for
their syn-ack ISNs, so I must consider receive sequence space as well.

The value of 250000 bytes / second for Microsoft's ISN rate of increase
was determined by testing with an XP machine.


# 121850 01-Nov-2003 silby

- Add a new function tcp_twrecycleable, which tells us if the ISN which
we will generate for a given ip/port tuple has advanced far enough
for the time_wait socket in question to be safely recycled.

- Have in_pcblookup_local use tcp_twrecycleable to determine if
time_Wait sockets which are hogging local ports can be safely
freed.

This change preserves proper TIME_WAIT behavior under normal
circumstances while allowing for safe and fast recycling whenever
ephemeral port space is scarce.


# 121453 24-Oct-2003 silby

Reduce the number of tcp time_wait structs to maxsockets / 5; this ensures
that at most 20% of sockets can be in time_wait at one time, ensuring
that time_wait sockets do not starve real connections from inpcb
structures.

No implementation change is needed, jlemon already implemented a nice
LRU-ish algorithm for tcp_tw structure recycling.

This should reduce the need for sysadmins to lower the default msl on
busy servers.


# 121307 21-Oct-2003 silby

Change all SYSCTLS which are readonly and have a related TUNABLE
from CTLFLAG_RD to CTLFLAG_RDTUN so that sysctl(8) can provide
more useful error messages.


# 119995 11-Sep-2003 ru

Fix a bunch of off-by-one errors in the range checking code.


# 119245 21-Aug-2003 rwatson

Introduce two new MAC Framework and MAC policy entry points:

mac_reflect_mbuf_icmp()
mac_reflect_mbuf_tcp()

These entry points permit MAC policies to do "update in place"
changes to the labels on ICMP and TCP mbuf headers when an ICMP or
TCP response is generated to a packet outside of the context of
an existing socket. For example, in respond to a ping or a RST
packet to a SYN on a closed port.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 114794 07-May-2003 rwatson

Correct a bug introduced with reduced TCP state handling; make
sure that the MAC label on TCP responses during TIMEWAIT is
properly set from either the socket (if available), or the mbuf
that it's responding to.

Unfortunately, this is made somewhat difficult by the TCP code,
as tcp_twstart() calls tcp_twrespond() after discarding the socket
but without a reference to the mbuf that causes the "response".
Passing both the socket and the mbuf works arounds this--eventually
it might be good to make sure the mbuf always gets passed in in
"response" scenarios but working through this provided to
complicate things too much.

Approved by: re (scottl)
Reviewed by: hsu
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 113345 10-Apr-2003 rwatson

Remove a potential panic condition introduced by reduced TCP wait
state. Those changed attempted to work around the changed invariant
that inp->in_socket was sometimes now NULL, but the logic wasn't
quite right, meaning that inp->in_socket would be dereferenced by
cr_canseesocket() if security.bsd.see_other_uids, jail, or MAC
were in use. Attempt to clarify and correct the logic.

Note: the work-around originally introduced with the reduced TCP
wait state handling to use cr_cansee() instead of cr_canseesocket()
in this case isn't really right, although it "Does the right thing"
for most of the cases in the base system. We'll need to address
this at some point in the future.

Pointed out by: dcs
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 112009 08-Mar-2003 jlemon

Remove a panic(); if the zone allocator can't provide more timewait
structures, reuse the oldest one. Also move the expiry timer from
a per-structure callout to the tcp slow timer.

Sponsored by: DARPA, NAI Labs


# 111748 02-Mar-2003 des

More low-hanging fruit: kill caddr_t in calls to wakeup(9) / [mt]sleep(9).


# 111483 25-Feb-2003 rwatson

When generating a TCP response to a connection, not only test if the
tcpcb is NULL, but also its connected inpcb, since we now allow
elements of a TCP connection to hang around after other state, such
as the socket, has been recycled.

Tested by: dcs
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 111231 21-Feb-2003 phk

- m = m_gethdr(M_NOWAIT, MT_HEADER);
+ m = m_gethdr(M_DONTWAIT, MT_HEADER);

'nuff said.


# 111153 19-Feb-2003 jlemon

Unbreak non-IPV6 compilation.

Caught by: phk
Sponsored by: DARPA, NAI Labs


# 111145 19-Feb-2003 jlemon

Add a TCP TIMEWAIT state which uses less space than a fullblown TCP
control block. Allow the socket and tcpcb structures to be freed
earlier than inpcb. Update code to understand an inp w/o a socket.

Reviewed by: hsu, silby, jayanth
Sponsored by: DARPA, NAI Labs


# 111144 19-Feb-2003 jlemon

Convert tcp_fillheaders(tp, ...) -> tcpip_fillheaders(inp, ...) so the
routine does not require a tcpcb to operate. Since we no longer keep
template mbufs around, move pseudo checksum out of this routine, and
merge it with the length update.

Sponsored by: DARPA, NAI Labs


# 111119 19-Feb-2003 imp

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 110896 15-Feb-2003 hsu

Take advantage of pre-existing lock-free synchronization and type stable memory
to avoid acquiring SMP locks during expensive copyout process.


# 109623 21-Jan-2003 alfred

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 108265 24-Dec-2002 hsu

Validate inp to prevent an use after free.


# 107881 14-Dec-2002 dillon

Change tcp.inflight_min from 1024 to a production default of 6144. Create
a sysctl for the stabilization value for the bandwidth delay product (inflight)
algorithm and document it.

MFC after: 3 days


# 105586 20-Oct-2002 phk

Fix two instances of variant struct definitions in sys/netinet:

Remove the never completed _IP_VHL version, it has not caught on
anywhere and it would make us incompatible with other BSD netstacks
to retain this version.

Add a CTASSERT protecting sizeof(struct ip) == 20.

Don't let the size of struct ipq depend on the IPDIVERT option.

This is a functional no-op commit.

Approved by: re


# 105199 16-Oct-2002 sam

Tie new "Fast IPsec" code into the build. This involves the usual
configuration stuff as well as conditional code in the IPv4 and IPv6
areas. Everything is conditional on FAST_IPSEC which is mutually
exclusive with IPSEC (KAME IPsec implmentation).

As noted previously, don't use FAST_IPSEC with INET6 at the moment.

Reviewed by: KAME, rwatson
Approved by: silence
Supported by: Vernier Networks


# 105194 15-Oct-2002 sam

Replace aux mbufs with packet tags:

o instead of a list of mbufs use a list of m_tag structures a la openbsd
o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit
ABI/module number cookie
o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and
use this in defining openbsd-compatible m_tag_find and m_tag_get routines
o rewrite KAME use of aux mbufs in terms of packet tags
o eliminate the most heavily used aux mbufs by adding an additional struct
inpcb parameter to ip_output and ip6_output to allow the IPsec code to
locate the security policy to apply to outbound packets
o bump __FreeBSD_version so code can be conditionalized
o fixup ipfilter's call to ip_output based on __FreeBSD_version

Reviewed by: julian, luigi (silent), -arch, -net, darren
Approved by: julian, silence from everyone else
Obtained from: openbsd (mostly)
MFC after: 1 month


# 104825 10-Oct-2002 dillon

turn off debugging by default if bandwidth delay product limiting is
turned on (it is already off in -stable).


# 102368 24-Aug-2002 dillon

Correct bug in t_bw_rtttime rollover, #undef USERTT


# 102017 17-Aug-2002 dillon

Implement TCP bandwidth delay product window limiting, similar to (but
not meant to duplicate) TCP/Vegas. Add four sysctls and default the
implementation to 'off'.

net.inet.tcp.inflight_enable enable algorithm (defaults to 0=off)
net.inet.tcp.inflight_debug debugging (defaults to 1=on)
net.inet.tcp.inflight_min minimum window limit
net.inet.tcp.inflight_max maximum window limit

MFC after: 1 week


# 101137 01-Aug-2002 rwatson

Document the undocumented assumption that at least one of the PCB
pointer and incoming mbuf pointer will be non-NULL in tcp_respond().
This is relied on by the MAC code for correctness, as well as
existing code.

Obtained from: TrustedBSD PRoject
Sponsored by: DARPA, NAI Labs


# 101106 31-Jul-2002 rwatson

Introduce support for Mandatory Access Control and extensible
kernel access control.

Instrument the TCP socket code for packet generation and delivery:
label outgoing mbufs with the label of the socket, and check socket and
mbuf labels before permitting delivery to a socket. Assign labels
to newly accepted connections when the syncache/cookie code has done
its business. Also set peer labels as convenient. Currently,
MAC policies cannot influence the PCB matching algorithm, so cannot
implement polyinstantiation. Note that there is at least one case
where a PCB is not available due to the TCP packet not being associated
with any socket, so we don't label in that case, but need to handle
it in a special manner.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 100831 28-Jul-2002 truckman

Wire the sysctl output buffer before grabbing any locks to prevent
SYSCTL_OUT() from blocking while locks are held. This should
only be done when it would be inconvenient to make a temporary copy of
the data and defer calling SYSCTL_OUT() until after the locks are
released.


# 100335 18-Jul-2002 dillon

Introduce two new sysctl's:

net.inet.tcp.rexmit_min (default 3 ticks equiv)

This sysctl is the retransmit timer RTO minimum,
specified in milliseconds. This value is
designed for algorithmic stability only.

net.inet.tcp.rexmit_slop (default 200ms)

This sysctl is the retransmit timer RTO slop
which is added to every retransmit timeout and
is designed to handle protocol stack overheads
and delayed ack issues.

Note that the *original* code applied a 1-second
RTO minimum but never applied real slop to the RTO
calculation, so any RTO calculation over one second
would have no slop and thus not account for
protocol stack overheads (TCP timestamps are not
a measure of protocol turnaround!). Essentially,
the original code made the RTO calculation almost
completely irrelevant.

Please note that the 200ms slop is debateable.
This commit is not meant to be a line in the sand,
and if the community winds up deciding that increasing
it is the correct solution then it's easy to do.
Note that larger values will destroy performance
on lossy networks while smaller values may result in
a greater number of unnecessary retransmits.


# 99838 11-Jul-2002 truckman

Defer calling SYSCTL_OUT() until after the locks have been released.


# 99837 11-Jul-2002 truckman

Reduce the nesting level of a code block that doesn't need to be in
an else clause.


# 99156 30-Jun-2002 jesper

Extend the effect of the sysctl net.inet.tcp.icmp_may_rst
so that, if we recieve a ICMP "time to live exceeded in transit",
(type 11, code 0) for a TCP connection on SYN-SENT state, close
the connection.

MFC after: 2 weeks


# 98596 21-Jun-2002 hsu

TCP notify functions can change the pcb list.


# 98211 14-Jun-2002 hsu

Notify functions can destroy the pcb, so they have to return an
indication of whether this happenned so the calling function
knows whether or not to unlock the pcb.

Submitted by: Jennifer Yang (yangjihui@yahoo.com)
Bug reported by: Sid Carter (sidcarter@symonds.net)


# 98135 12-Jun-2002 hsu

Fix logic which resulted in missing a call to INP_UNLOCK().


# 98102 10-Jun-2002 hsu

Lock up inpcb.

Submitted by: Jennifer Yang <yangjihui@yahoo.com>


# 97658 31-May-2002 tanimura

Back out my lats commit of locking down a socket, it conflicts with hsu's work.

Requested by: hsu


# 96972 20-May-2002 tanimura

Lock down a socket, milestone 1.

o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

- so_count
- so_options
- so_linger
- so_state

o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:

- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()

Reviewed by: alfred


# 94390 10-Apr-2002 silby

Remove some ISN generation code which has been unused since the
syncache went in.

MFC after: 3 days


# 93593 01-Apr-2002 jhb

Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on: smp@


# 92976 22-Mar-2002 rwatson

Merge from TrustedBSD MAC branch:

Move the network code from using cr_cansee() to check whether a
socket is visible to a requesting credential to using a new
function, cr_canseesocket(), which accepts a subject credential
and object socket. Implement cr_canseesocket() so that it does a
prison check, a uid check, and add a comment where shortly a MAC
hook will go. This will allow MAC policies to seperately
instrument the visibility of sockets from the visibility of
processes.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 92760 20-Mar-2002 jeff

Switch vm_zone.h with uma.h. Change over to uma interfaces.


# 92723 19-Mar-2002 alfred

Remove __P.


# 91406 27-Feb-2002 jhb

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


# 91357 27-Feb-2002 alfred

More IPV6 const fixes.


# 91354 27-Feb-2002 dd

Introduce a version field to `struct xucred' in place of one of the
spares (the size of the field was changed from u_short to u_int to
reflect what it really ends up being). Accordingly, change users of
xucred to set and check this field as appropriate. In the kernel,
this is being done inside the new cru2x() routine which takes a
`struct ucred' and fills out a `struct xucred' according to the
former. This also has the pleasant sideaffect of removing some
duplicate code.

Reviewed by: rwatson


# 90198 04-Feb-2002 ume

In tcp_respond(), correctly reset returned IPv6 header. This is essential
when the original packet contains an IPv6 extension header.

Obtained from: KAME
MFC after: 1 week


# 86764 22-Nov-2001 jlemon

Introduce a syncache, which enables FreeBSD to withstand a SYN flood
DoS in an improved fashion over the existing code.

Reviewed by: silby (in a previous iteration)
Sponsored by: DARPA, NAI Labs


# 86183 08-Nov-2001 rwatson

o Replace reference to 'struct proc' with 'struct thread' in 'struct
sysctl_req', which describes in-progress sysctl requests. This permits
sysctl handlers to have access to the current thread, permitting work
on implementing td->td_ucred, migration of suser() to using struct
thread to derive the appropriate ucred, and allowing struct thread to be
passed down to other code, such as network code where td is not currently
available (and curproc is used).

o Note: netncp and netsmb are not updated to reflect this change, as they
are not currently KSE-adapted.

Reviewed by: julian
Obtained from: TrustedBSD Project


# 84736 09-Oct-2001 rwatson

- Combine kern.ps_showallprocs and kern.ipc.showallsockets into
a single kern.security.seeotheruids_permitted, describes as:
"Unprivileged processes may see subjects/objects with different real uid"
NOTE: kern.ps_showallprocs exists in -STABLE, and therefore there is
an API change. kern.ipc.showallsockets does not.
- Check kern.security.seeotheruids_permitted in cr_cansee().
- Replace visibility calls to socheckuid() with cr_cansee() (retain
the change to socheckuid() in ipfw, where it is used for rule-matching).
- Remove prison_unpcb() and make use of cr_cansee() against the UNIX
domain socket credential instead of comparing root vnodes for the
UDS and the process. This allows multiple jails to share the same
chroot() and not see each others UNIX domain sockets.
- Remove unused socheckproc().

Now that cr_cansee() is used universally for socket visibility, a variety
of policies are more consistently enforced, including uid-based
restrictions and jail-based restrictions. This also better-supports
the introduction of additional MAC models.

Reviewed by: ps, billf
Obtained from: TrustedBSD Project


# 84527 05-Oct-2001 ps

Only allow users to see their own socket connections if
kern.ipc.showallsockets is set to 0.

Submitted by: billf (with modifications by me)
Inspired by: Dave McKay (aka pm aka Packet Magnet)
Reviewed by: peter
MFC after: 2 weeks


# 83742 20-Sep-2001 rwatson

o Rename u_cansee() to cr_cansee(), making the name more comprehensible
in the face of a rename of ucred to cred, and possibly generally.

Obtained from: TrustedBSD Project


# 82122 21-Aug-2001 silby

Much delayed but now present: RFC 1948 style sequence numbers

In order to ensure security and functionality, RFC 1948 style
initial sequence number generation has been implemented. Barring
any major crypographic breakthroughs, this algorithm should be
unbreakable. In addition, the problems with TIME_WAIT recycling
which affect our currently used algorithm are not present.

Reviewed by: jesper


# 80429 26-Jul-2001 peter

Fix a warning.


# 80428 26-Jul-2001 peter

Patch up some style(9) stuff in tcp_new_isn()


# 80427 26-Jul-2001 peter

s/OpemBSD/OpenBSD/


# 79413 08-Jul-2001 silby

Temporary feature: Runtime tuneable tcp initial sequence number
generation scheme. Users may now select between the currently used
OpenBSD algorithm and the older random positive increment method.

While the OpenBSD algorithm is more secure, it also breaks TIME_WAIT
handling; this is causing trouble for an increasing number of folks.

To switch between generation schemes, one sets the sysctl
net.inet.tcp.tcp_seq_genscheme. 0 = random positive increments,
1 = the OpenBSD algorithm. 1 is still the default.

Once a secure _and_ compatible algorithm is implemented, this sysctl
will be removed.

Reviewed by: jlemon
Tested by: numerous subscribers of -net


# 78697 24-Jun-2001 dwmalone

Allow getcred sysctl to work in jailed root processes. Processes can
only do getcred calls for sockets which were created in the same jail.
This should allow the ident to work in a reasonable way within jails.

PR: 28107
Approved by: des, rwatson


# 78671 23-Jun-2001 jlemon

Replace bzero() of struct ip with explicit zeroing of structure members,
which is faster.


# 78642 23-Jun-2001 silby

Eliminate the allocation of a tcp template structure for each
connection. The information contained in a tcptemp can be
reconstructed from a tcpcb when needed.

Previously, tcp templates required the allocation of one
mbuf per connection. On large systems, this change should
free up a large number of mbufs.

Reviewed by: bmilekic, jlemon, ru
MFC after: 2 weeks


# 78492 20-Jun-2001 ume

made sure to use the correct sa_len for rtalloc().
sizeof(ro_dst) is not necessarily the correct one.
this change would also fix the recent path MTU discovery problem for the
destination of an incoming TCP connection.

Submitted by: JINMEI Tatuya <jinmei@kame.net>
Obtained from: KAME
MFC after: 2 weeks


# 78100 11-Jun-2001 ume

This is force commit to mention about previous commit.

- do not assume that the ro_dst member of route_in6{} is sockaddr_in6.
- repair IPsec header size prediction.
- validate mbuf chain better in {tcp,udp}6_ctlinput.
- loosened validation inner packets of icmp6 errors as much as possible.
- scope-awareness.
- be friendly with pfctlinput2.
- simplified address scope handling in the ctlinput function.
- type change of in6_pcbnotify.
- pass error from ipsec_setsocket() all the way up.


# 78064 11-Jun-2001 ume

Sync with recent KAME.
This work was based on kame-20010528-freebsd43-snap.tgz and some
critical problem after the snap was out were fixed.
There are many many changes since last KAME merge.

TODO:
- The definitions of SADB_* in sys/net/pfkeyv2.h are still different
from RFC2407/IANA assignment because of binary compatibility
issue. It should be fixed under 5-CURRENT.
- ip6po_m member of struct ip6_pktopts is no longer used. But, it
is still there because of binary compatibility issue. It should
be removed under 5-CURRENT.

Reviewed by: itojun
Obtained from: KAME
MFC after: 3 weeks


# 77900 08-Jun-2001 peter

"Fix" the previous initial attempt at fixing TUNABLE_INT(). This time
around, use a common function for looking up and extracting the tunables
from the kernel environment. This saves duplicating the same function
over and over again. This way typically has an overhead of 8 bytes + the
path string, versus about 26 bytes + the path string.


# 77853 07-Jun-2001 peter

Back out part of my previous commit. This was a last minute change
and I botched testing. This is a perfect example of how NOT to do
this sort of thing. :-(


# 77843 06-Jun-2001 peter

Make the TUNABLE_*() macros look and behave more consistantly like the
SYSCTL_*() macros. TUNABLE_INT_DECL() was an odd name because it didn't
actually declare the int, which is what the name suggests it would do.


# 75733 20-Apr-2001 jesper

Say goodbye to TCP_COMPAT_42

Reviewed by: wollman
Requested by: wollman


# 75620 17-Apr-2001 kris

Note that the previous commit also restored some historical behaviour
in the TCP_COMPAT_42 case (e.g. choosing '1' as the initial sequence
number at boot-time, instead of randomizing it). TCP_COMPAT_42 is the
repository for old security holes, too :-)


# 75619 17-Apr-2001 kris

Randomize the TCP initial sequence numbers more thoroughly.

Obtained from: OpenBSD
Reviewed by: jesper, peter, -developers


# 74937 28-Mar-2001 jesper

MFC candidate.

Change code from PRC_UNREACH_ADMIN_PROHIB to PRC_UNREACH_PORT for
ICMP_UNREACH_PROTOCOL and ICMP_UNREACH_PORT

And let TCP treat PRC_UNREACH_PORT like PRC_UNREACH_ADMIN_PROHIB

This should fix the case where port unreachables for udp returned
ENETRESET instead of ECONNREFUSED

Problem found by: Bill Fenner <fenner@research.att.com>
Reviewed by: jlemon


# 74362 16-Mar-2001 phk

<sys/queue.h> makeover.


# 73109 26-Feb-2001 jlemon

Remove in_pcbnotify and use in_pcblookup_hash to find the cb directly.

For TCP, verify that the sequence number in the ICMP packet falls within
the tcp receive window before performing any actions indicated by the
icmp packet.

Clean up some layering violations (access to tcp internals from in_pcb)


# 72960 23-Feb-2001 jlemon

When converting soft error into a hard error, drop the connection. The
error will be passed up to the user, who will close the connection, so
it does not appear to make a sense to leave the connection open.

This also fixes a bug with kqueue, where the filter does not set EOF
on the connection, because the connection is still open.

Also remove calls to so{rw}wakeup, as we aren't doing anything with
them at the moment anyway.

Reviewed by: alfred, jesper


# 72959 23-Feb-2001 jlemon

Allow ICMP unreachables which map into PRC_UNREACH_ADMIN_PROHIB to
reset TCP connections which are in the SYN_SENT state, if the sequence
number in the echoed ICMP reply is correct. This behavior can be
controlled by the sysctl net.inet.tcp.icmp_may_rst.

Currently, only subtypes 2,3,10,11,12 are treated as such
(port, protocol and administrative unreachables).

Assocaiate an error code with these resets which is reported to the
user application: ENETRESET.

Disallow resetting TCP sessions which are not in a SYN_SENT state.

Reviewed by: jesper, -net


# 72922 22-Feb-2001 jesper

Redo the security update done in rev 1.54 of src/sys/netinet/tcp_subr.c
and 1.84 of src/sys/netinet/udp_usrreq.c

The changes broken down:

- remove 0 as a wildcard for addresses and port numbers in
src/sys/netinet/in_pcb.c:in_pcbnotify()
- add src/sys/netinet/in_pcb.c:in_pcbnotifyall() used to notify
all sessions with the specific remote address.
- change
- src/sys/netinet/udp_usrreq.c:udp_ctlinput()
- src/sys/netinet/tcp_subr.c:tcp_ctlinput()
to use in_pcbnotifyall() to notify multiple sessions, instead of
using in_pcbnotify() with 0 as src address and as port numbers.
- remove check for src port == 0 in
- src/sys/netinet/tcp_subr.c:tcp_ctlinput()
- src/sys/netinet/udp_usrreq.c:udp_ctlinput()
as they are no longer needed.
- move handling of redirects and host dead from in_pcbnotify() to
udp_ctlinput() and tcp_ctlinput(), so they will call
in_pcbnotifyall() to notify all sessions with the specific
remote address.

Approved by: jlemon
Inspired by: NetBSD


# 72778 20-Feb-2001 jesper

Only call in_pcbnotify if the src port number != 0, as we
treat 0 as a wildcard in src/sys/in_pbc.c:in_pcbnotify()

It's sufficient to check for src|local port, as we'll have no
sessions with src|local port == 0

Without this a attacker sending ICMP messages, where the attached
IP header (+ 8 bytes) has the address and port numbers == 0, would
have the ICMP message applied to all sessions.

PR: kern/25195
Submitted by: originally by jesper, reimplimented by jlemon's advice
Reviewed by: jlemon
Approved by: jlemon


# 72650 18-Feb-2001 green

Switch to using a struct xucred instead of a struct xucred when not
actually in the kernel. This structure is a different size than
what is currently in -CURRENT, but should hopefully be the last time
any application breakage is caused there. As soon as any major
inconveniences are removed, the definition of the in-kernel struct
ucred should be conditionalized upon defined(_KERNEL).

This also changes struct export_args to remove dependency on the
constantly-changing struct ucred, as well as limiting the bounds
of the size fields to the correct size. This means: a) mountd and
friends won't break all the time, b) mountd and friends won't crash
the kernel all the time if they don't know what they're doing wrt
actual struct export_args layout.

Reviewed by: bde


# 72638 18-Feb-2001 phk

Remove unneeded loop increment in src/sys/netinet/in_pcb.c:in_pcbnotify

Add new PRC_UNREACH_ADMIN_PROHIB in sys/sys/protosw.h

Remove condition on TCP in src/sys/netinet/ip_icmp.c:icmp_input

In src/sys/netinet/ip_icmp.c:icmp_input set code = PRC_UNREACH_ADMIN_PROHIB
or PRC_UNREACH_HOST for all unreachables except ICMP_UNREACH_NEEDFRAG

Rename sysctl icmp_admin_prohib_like_rst to icmp_unreach_like_rst
to reflect the fact that we also react on ICMP unreachables that
are not administrative prohibited. Also update the comments to
reflect this.

In sys/netinet/tcp_subr.c:tcp_ctlinput add code to treat
PRC_UNREACH_ADMIN_PROHIB and PRC_UNREACH_HOST different.

PR: 23986
Submitted by: Jesper Skriver <jesper@skriver.dk>


# 71999 04-Feb-2001 phk

Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)


# 70330 24-Dec-2000 phk

Update the "icmp_admin_prohib_like_rst" code to check the tcp-window and
to be configurable with respect to acting only in SYN or in all TCP states.

PR: 23665
Submitted by: Jesper Skriver <jesper@skriver.dk>


# 70103 16-Dec-2000 phk

We currently does not react to ICMP administratively prohibited
messages send by routers when they deny our traffic, this causes
a timeout when trying to connect to TCP ports/services on a remote
host, which is blocked by routers or firewalls.

rfc1122 (Requirements for Internet Hosts) section 3.2.2.1 actually
requi re that we treat such a message for a TCP session, that we
treat it like if we had recieved a RST.

quote begin.

A Destination Unreachable message that is received MUST be
reported to the transport layer. The transport layer SHOULD
use the information appropriately; for example, see Sections
4.1.3.3, 4.2.3.9, and 4.2.4 below. A transport protocol
that has its own mechanism for notifying the sender that a
port is unreachable (e.g., TCP, which sends RST segments)
MUST nevertheless accept an ICMP Port Unreachable for the
same purpose.

quote end.

I've written a small extension that implement this, it also create
a sysctl "net.inet.tcp.icmp_admin_prohib_like_rst" to control if
this new behaviour is activated.

When it's activated (set to 1) we'll treat a ICMP administratively
prohibited message (icmp type 3 code 9, 10 and 13) for a TCP
sessions, as if we recived a TCP RST, but only if the TCP session
is in SYN_SENT state.

The reason for only reacting when in SYN_SENT state, is that this
will solve the problem, and at the same time minimize the risk of
this being abused.

I suggest that we enable this new behaviour by default, but it
would be a change of current behaviour, so if people prefer to
leave it disabled by default, at least for now, this would be ok
for me, the attached diff actually have the sysctl set to 0 by
default.

PR: 23086
Submitted by: Jesper Skriver <jesper@skriver.dk>


# 69147 25-Nov-2000 jlemon

Revert the last commit to the callout interface, and add a flag to
callout_init() indicating whether the callout is safe or not. Update
the callers of callout_init() to reflect the new interface.

Okayed by: Jake


# 67708 27-Oct-2000 phk

Convert all users of fldoff() to offsetof(). fldoff() is bad
because it only takes a struct tag which makes it impossible to
use unions, typedefs etc.

Define __offsetof() in <machine/ansi.h>

Define offsetof() in terms of __offsetof() in <stddef.h> and <sys/types.h>

Remove myriad of local offsetof() definitions.

Remove includes of <stddef.h> in kernel code.

NB: Kernelcode should *never* include from /usr/include !

Make <sys/queue.h> include <machine/ansi.h> to avoid polluting the API.

Deprecate <struct.h> with a warning. The warning turns into an error on
01-12-2000 and the file gets removed entirely on 01-01-2001.

Paritials reviews by: various.
Significant brucifications by: bde


# 67456 23-Oct-2000 itojun

be careful on mbuf overrun on ctlinput.
short icmp6 packet may be able to panic the kernel.
sync with kame.


# 66433 28-Sep-2000 kris

Use stronger random number generation for TCP_ISSINCR and tcp_iss.

Reviewed by: peter, jlemon


# 66376 25-Sep-2000 bmilekic

Finally make do_tcpdrain sysctl live under correct parent, _net_inet_tcp,
as opposed to _debug. Like before, default value remains 1.


# 63745 21-Jul-2000 jayanth

When a connection is being dropped due to a listen queue overflow,
delete the cloned route that is associated with the connection.
This does not exhaust the routing table memory when the system
is under a SYN flood attack. The route entry is not deleted if there
is any prior information cached in it.

Reviewed by: Peter Wemm,asmodai


# 62587 04-Jul-2000 itojun

sync with kame tree as of july00. tons of bug fixes/improvements.

API changes:
- additional IPv6 ioctls
- IPsec PF_KEY API was changed, it is mandatory to upgrade setkey(8).
(also syntax change)


# 62573 04-Jul-2000 phk

Previous commit changing SYSCTL_HANDLER_ARGS violated KNF.

Pointed out by: bde


# 62454 03-Jul-2000 phk

Style police catches up with rev 1.26 of src/sys/sys/sysctl.h:

Sanitize SYSCTL_HANDLER_ARGS so that simplistic tools can grog our
sources:

-sysctl_vm_zone SYSCTL_HANDLER_ARGS
+sysctl_vm_zone (SYSCTL_HANDLER_ARGS)


# 59392 19-Apr-2000 shin

Let initialize th_sum before in6_cksum(), again.
Without this fix, all IPv6 TCP RST packet has wrong cksum value,
so IPv6 connect() trial to 5.0 machine won't fail until tcp connect timeout,
when they should fail soon.

Thanks to haro@tk.kubota.co.jp (Munehiro Matsuda) for his much debugging
help and detailed info.


# 58698 27-Mar-2000 jlemon

Add support for offloading IP/TCP/UDP checksums to NIC hardware which
supports them.


# 57576 28-Feb-2000 ps

Limit the maximum permissible TCP window size to 65535 octets if
window scaling is disabled.

PR: kern/16914
Submitted by: Jayanth Vijayaraghavan <jayanth@yahoo-inc.com>
Reviewed by: wollman
Approved by: jkh


# 56564 24-Jan-2000 shin

Fix the bug that IPv4 ttl is not initialized when AF_INET6 socket is used
for IPv4 communication.(IPv4 mapped IPv6 addr.)
Also removed IPv6 hoplimit initialization because it is alway done at
tcp_output.

Confirmed by: Bernd Walter <ticso@cicely5.cicely.de>


# 56041 15-Jan-2000 shin

Fixed the problem that IPsec connection hangs when bigger data is sent.
-opt_ipsec.h was missing on some tcp files (sorry for basic mistake)
-made buildable as above fix
-also added some missing IPv4 mapped IPv6 addr consideration into
ipsec4_getpolicybysock


# 56039 15-Jan-2000 shin

Added missing 'else' for 'if (isipv6)' at IPv6 length setting in tcp_respond().
By this bug, IPv6 reset was not sent.
(I checked around same kind of bug, but no other found.)


# 56019 15-Jan-2000 shin

Removed wrong(unnecessary) & operators for pointer, in ipsec_hdrsiz_tcp().
This must be one of the reason why connections over IPsec hangs for
bigger packets.(which was reported on freebsd-current@freebsd.org)

But there still seems to be another bug and the problem is not yet fixed.


# 55913 13-Jan-2000 shin

Clear rt after RTFREE. This might have sometime caused kernel panic at rtfree()
on INET6 enabled environment.


# 55874 13-Jan-2000 shin

removed incorrect ip6 length setting for IPv6 tcp reset packet.


# 55679 09-Jan-2000 shin

tcp updates to support IPv6.
also a small patch to sys/nfs/nfs_socket.c, as max_hdr size change.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 55198 28-Dec-1999 msmith

Make tcp_drain() actually do something. When invoked (usually as a
desperation measure in low-memory situations), walk the tcpbs and
flush the reassembly queues.

This behaviour is currently controlled by the debug.do_tcpdrain sysctl
(defaults to on).

Submitted by: Bosko Milekic <bmilekic@dsuper.net>
Reviewed by: wollman


# 54263 07-Dec-1999 shin

udp IPv6 support, IPv6/IPv4 tunneling support in kernel,
packet divert at kernel for IPv6/IPv4 translater daemon

This includes queue related patch submitted by jburkhol@home.com.

Submitted by: queue related patch from jburkhol@home.com
Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 53541 22-Nov-1999 shin

KAME netinet6 basic part(no IPsec,no V6 Multicast Forwarding, no UDP/TCP
for IPv6 yet)

With this patch, you can assigne IPv6 addr automatically, and can reply to
IPv6 ping.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 52904 05-Nov-1999 shin

KAME related header files additions and merges.
(only those which don't affect c source files so much)

Reviewed by: cvs-committers
Obtained from: KAME project


# 51381 19-Sep-1999 green

Change so_cred's type to a ucred, not a pcred. THis makes more sense, actually.
Make a sonewconn3() which takes an extra argument (proc) so new sockets created
with sonewconn() from a user's system call get the correct credentials, not
just the parent's credentials.


# 50673 30-Aug-1999 jlemon

Restructure TCP timeout handling:

- eliminate the fast/slow timeout lists for TCP and instead use a
callout entry for each timer.
- increase the TCP timer granularity to HZ
- implement "bad retransmit" recovery, as presented in
"On Estimating End-to-End Network Path Properties", by Allman and Paxson.

Submitted by: jlemon, wollmann


# 50477 27-Aug-1999 peter

$Id$ -> $FreeBSD$


# 50426 26-Aug-1999 jlemon

Add readonly OID ``net.inet.tcp.tcbhashsize'' so it is possible to
discover the size of the TCB hashtable on a running system.


# 48758 11-Jul-1999 green

Two new sysctls: net.inet.tcp.getcred and net.inet.udp.getcred. These take
a sockaddr_in[2] (local, then remote) and return a struct ucred. Example
code for these is at:
http://www.FreeBSD.org/~green/inetd_ident.patch
http://www.FreeBSD.org/~green/freebsd4.c (for pidentd)

Reviewed by: bde


# 48578 05-Jul-1999 msmith

Use the new tunable macros for the net.inet.tcp.tcbhashsize tunable.


# 47960 16-Jun-1999 tegge

Close a race window where a tcp socket is closed while tcp_pcblist is
copying out tcp socket info, causing a NULL pointer to be dereferenced.


# 46381 03-May-1999 billf

Add sysctl descriptions to many SYSCTL_XXXs

PR: kern/11197
Submitted by: Adrian Chadd <adrian@FreeBSD.org>
Reviewed by: billf(spelling/style/minor nits)
Looked at by: bde(style)


# 46155 28-Apr-1999 phk

This Implements the mumbled about "Jail" feature.

This is a seriously beefed up chroot kind of thing. The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.

For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact: "real virtual servers".

Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.

Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.

It generally does what one would expect, but setting up a jail
still takes a little knowledge.

A few notes:

I have no scripts for setting up a jail, don't ask me for them.

The IP number should be an alias on one of the interfaces.

mount a /proc in each jail, it will make ps more useable.

/proc/<pid>/status tells the hostname of the prison for
jailed processes.

Quotas are only sensible if you have a mountpoint per prison.

There are no privisions for stopping resource-hogging.

Some "#ifdef INET" and similar may be missing (send patches!)

If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!

Tools, comments, patches & documentation most welcome.

Have fun...

Sponsored by: http://www.rndassociates.com/
Run for almost a year by: http://www.servetheweb.com/


# 43576 04-Feb-1999 msmith

Nuke all the stupid ffs() stuff and use powerof2() instead.
Submitted by: Bruce Evans <bde@zeta.org.au>


# 43575 04-Feb-1999 msmith

Fix power-of-2 check for the TCB hash size.

Submitted by: Brian Feldman <green@unixhelp.org>


# 43562 03-Feb-1999 msmith

Make TCBHASHSIZE a boot-time tunable as well, taking its value from the
variable net.inet.tcp.tcbhashsize.

Requested by: David Filo <filo@yahoo-inc.com>


# 41591 07-Dec-1998 archie

The "easy" fixes for compiling the kernel -Wunused: remove unreferenced static
and local variables, goto labels, and functions declared but not defined.


# 41187 15-Nov-1998 guido

The below patch helps to reduce the leakage of internal socket information
when a TCP "stealth" scan is directed at a *BSD box by ensuring the window
is 0 for all RST packets generated through tcp_respond()
Reviewed by: Don Lewis <Don.Lewis@tsc.tdk.com>
Obtained from: Bugtraq (from: Darren Reed <avalon@COOMBS.ANU.EDU.AU>)


# 38875 06-Sep-1998 phk

RFC 1644 has the status "Experimental Protocol", which means:

4.1.4. Experimental Protocol

A system should not implement an experimental protocol unless it
is participating in the experiment and has coordinated its use of
the protocol with the developer of the protocol.

Pointed out by: Steinar Haug <sthaug@nethelp.no>


# 38513 24-Aug-1998 dfr

Re-implement tcp and ip fragment reassembly to not store pointers in the
ip header which can't work on alpha since pointers are too big.

Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>


# 36079 15-May-1998 wollman

Convert socket structures to be type-stable and add a version number.

Define a parameter which indicates the maximum number of sockets in a
system, and use this to size the zone allocators used for sockets and
for certain PCBs.

Convert PF_LOCAL PCB structures to be type-stable and add a version number.

Define an external format for infomation about socket structures and use
it in several places.

Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through
sysctl(3) without blocking network interrupts for an unreasonable
length of time. This probably still has some bugs and/or race
conditions, but it seems to work well enough on my machines.

It is now possible for `netstat' to get almost all of its information
via the sysctl(3) interface rather than reading kmem (changes to follow).


# 34923 28-Mar-1998 bde

Fixed style bugs (mostly) in previous commit.


# 34881 24-Mar-1998 wollman

Use the zone allocator to allocate inpcbs and tcpcbs. Each protocol creates
its own zone; this is used particularly by TCP which allocates both inpcb and
tcpcb in a single allocation. (Some hackery ensures that the tcpcb is
reasonably aligned.) Also keep track of the number of pcbs of each type
allocated, and keep a generation count (instance version number) for future
use.


# 32821 27-Jan-1998 dg

Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.

Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.

These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.

Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.

WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).


# 32752 25-Jan-1998 eivind

Make TCP_COMPAT_42 a new style option.


# 31848 19-Dec-1997 julian

Fix an incredibly horrible bug in the ipfw code
where if you are using the "reset tcp" firewall command,
the kernel would write ethernet headers onto random kernel stack locations.

Fought to the death by: terry, julian, archie.
fix valid for 2.2 series as well.


# 30813 28-Oct-1997 bde

Removed unused #includes.


# 29514 16-Sep-1997 joerg

Make TCPDEBUG a new-style option.


# 29506 16-Sep-1997 bde

Fixed gratuitous ANSIisms.


# 24570 03-Apr-1997 dg

Reorganize elements of the inpcb struct to take better advantage of
cache lines. Removed the struct ip proto since only a couple of chars
were actually being used in it. Changed the order of compares in the
PCB hash lookup to take advantage of partial cache line fills (on PPro).

Discussed-with: wollman


# 23324 03-Mar-1997 dg

Improved performance of hash algorithm while (hopefully) not reducing
the quality of the hash distribution. This does not fix a problem dealing
with poor distribution when using lots of IP aliases and listening
on the same port on every one of them...some other day perhaps; fixing
that requires significant code changes.
The use of xor was inspired by David S. Miller <davem@jenolan.rutgers.edu>


# 22975 22-Feb-1997 peter

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 22719 14-Feb-1997 wollman

Fix the mechanism for choosing wehether to save the slow-start threshold
in the route. This allows us to remove the unconditional setting of the
pipesize in the route, which should mean that SO_SNDBUF and SO_RCVBUF
should actually work again. While we're at it:

- Convert udp_usrreq from `mondo switch statement from Hell' to new-style.
- Delete old TCP mondo switch statement from Hell, which had previously
been diked out.


# 21673 14-Jan-1997 jkh

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 17269 24-Jul-1996 wollman

Eliminate some more references to separate ip_v and ip_hl fields.


# 16367 14-Jun-1996 wollman

Better selection of initial retransmit timeout when no cached
RTT information is available.

Submitted by: kbracey@art.acorn.co.uk (Kevin Bracey)
(slightly modified by me)


# 16141 05-Jun-1996 wollman

Correct formula for TCP RTO calculation. Also try to do a better job in
filling in a new PCB's rttvar (but this is not the last word on the subject).
And get rid of `#ifdef RTV_RTT', it's been true for four years now...


# 14841 27-Mar-1996 wollman

In tcp_respond(), check that ro->ro_rt is non-null before RTFREEing
it.


# 14754 22-Mar-1996 wollman

Make sure tcp_respond() always calls ip_output() with a valid
route pointer. This has no effect in the current ip_output(),
but my version requires that ip_output() always be passed a route.


# 14546 11-Mar-1996 dg

Move or add #include <queue.h> in preparation for upcoming struct socket
changes.


# 12939 20-Dec-1995 wollman

Fix a nagging divide-by-zero error resulting from the MTU discovery code
getting triggered at a bad time.


# 12881 16-Dec-1995 bde

Uniformized pr_ctlinput protosw functions. The third arg is now `void
*' instead of caddr_t and it isn't optional (it never was). Most of the
netipx (and netns) pr_ctlinput functions abuse the second arg instead of
using the third arg but fixing this is beyond the scope of this round
of changes.


# 12635 05-Dec-1995 wollman

Path MTU Discovery is now standard.


# 12296 14-Nov-1995 phk

New style sysctl & staticize alot of stuff.


# 12172 09-Nov-1995 phk

Start adding new style sysctl here too.


# 11537 16-Oct-1995 wollman

The ability to administratively change the MTU of an interface presents
a few new wrinkles for MTU discovery which tcp_output() had better
be prepared to handle. ip_output() is also modified to do something
helpful in this case, since it has already calculated the information
we need.


# 11450 12-Oct-1995 wollman

The additional checks involving sequence numbers in MTU discovery resends
turned out not to be necessary; simply watching for MTU decreases (which
we already did) automagically eliminates all the cases we were trying to
protect against.


# 11415 10-Oct-1995 wollman

More MTU discovery: avoid over-retransmission if route changes in the
middle of a fully-open window. Also, keep track of how many retransmits
we do as a result of MTU discovery. This may actually do more work than
necessary, but it's an unusual condition...

Suggested by: Janey Hoe <janey@lcs.mit.edu>


# 11150 03-Oct-1995 wollman

Finish 4.4-Lite-2 merge: randomize TCP initial sequence numbers
to make ISS-guessing spoofing attacks harder.


# 10956 22-Sep-1995 wollman

Correct spelling error in MTUDISC code.


# 10930 20-Sep-1995 wollman

Add support in TCP for Path MTU discovery. This is highly experimental
and gated on `options MTUDISC' in the source. It is also practically
untested becausse (sniff!) I don't have easy access to a network with
an MTU of less than an Ethernet. If you have a small MTU network,
please try it and tell me if it works!


# 10881 18-Sep-1995 wollman

Initial back-end support for IP MTU discovery, gated on MTUDISC. The support
for TCP has yet to be written.


# 9373 29-Jun-1995 wollman

Keep track of the number of samples through the srtt filter so that we
know better when to cache values in the route, rather than relying on a
heuristic involving sequence numbers that broke when tcp_sendspace
was increased to 16k.


# 9263 19-Jun-1995 wollman

Now that we've gone to all sorts of effort to allow TCP to cache some of
its connection parameters, we want to keep statistics on how often this
actually happens to see whether there is any work that needs to be done in
TCP itself.

Suggested by: John Wroclawski <jtw@lcs.mit.edu>


# 8876 30-May-1995 rgrimes

Remove trailing whitespace.


# 7684 08-Apr-1995 dg

Implemented PCB hashing. Includes new functions in_pcbinshash, in_pcbrehash,
and in_pcblookuphash.


# 7090 16-Mar-1995 bde

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 6922 06-Mar-1995 nate

Removed unnecessary define for TCPOUTFLAGS since they are not used.


# 6475 15-Feb-1995 wollman

Transaction TCP support now standard. Hack away!


# 6283 09-Feb-1995 wollman

Merge Transaction TCP, courtesy of Andras Olah <olah@cs.utwente.nl> and
Bob Braden <braden@isi.edu>.

NB: This has not had David's TCP ACK hack re-integrated. It is not clear
what the correct solution to this problem is, if any. If a better solution
doesn't pop up in response to this message, I'll put David's code back in
(or he's welcome to do so himself).


# 3444 08-Oct-1994 phk

Cosmetics: silences gcc -Wall.


# 3311 02-Oct-1994 phk

GCC cleanup.
Reviewed by:
Submitted by:
Obtained from:


# 1817 02-Aug-1994 dg

Added $Id$


# 1549 25-May-1994 rgrimes

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# 1541 24-May-1994 rgrimes

BSD 4.4 Lite Kernel Sources