History log of /freebsd-current/sys/netinet/tcp_usrreq.c
Revision Date Author Comments
# e7381521 30-May-2024 Michael Tuexen <tuexen@FreeBSD.org>

tcp: remove unused code in tcp_usr_attach

pr_attach is only called on a socket (so) with so->so_listen != NULL
via sonewconn. However, sonewconn is not called from the TCP code.
The listening sockets are handled in tcp_syncache.c without using
sonewconn. Therefore, the code removed is never executed.
No functional change intended.

Reviewed by: rrs, peter.lei_ieee.org
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D45412


# fe136aec 23-May-2024 Michael Tuexen <tuexen@FreeBSD.org>

tcp: improve inp locking in setsockopt

Ensure that the inp is not dropped when starting a stack switch.
While there, clean-up the code by using INP_WLOCK_RECHECK, which
also re-assigns tp.

Reviewed by: glebius
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D45241


# fce03f85 05-May-2024 Randall Stewart <rrs@FreeBSD.org>

TCP can be subject to Sack Attacks lets fix this issue.

There is a type of attack that a TCP peer can launch on a connection. This is for sure in Rack or BBR and probably even the default stack if it uses lists in sack processing. The idea of the attack is that the attacker is driving you to look at 100's of sack blocks that only update 1 byte. So for example if you have 1 - 10,000 bytes outstanding the attacker sends in something like:

ACK 0 SACK(1-512) SACK(1024 - 1536), SACK(2048-2536), SACK(4096 - 4608), SACK(8192-8704)
This first sack looks fine but then the attacker sends

ACK 0 SACK(1-512) SACK(1025 - 1537), SACK(2049-2537), SACK(4097 - 4609), SACK(8193-8705)
ACK 0 SACK(1-512) SACK(1027 - 1539), SACK(2051-2539), SACK(4099 - 4611), SACK(8195-8707)
...
These blocks are making you hunt across your linked list and split things up so that you have an entry for every other byte. Has your list grows you spend more and more CPU running through the lists. The idea here is the attacker chooses entries as far apart as possible that make you run through the list. This example is small but in theory if the window is open to say 1Meg you could end up with 100's of thousands link list entries.

To combat this we introduce three things.

when the peer requests a very small MSS we stop processing SACK's from them. This prevents a malicious peer from just using a small MSS to do the same thing.
Any time we get a sack block, we use the sack-filter to remove sacks that are smaller than the smallest v4 mss (minus 40 for max TCP options) unless it ties up to snd_max (since that is legal). All other sacks in theory should be at least an MSS. If we get such an attacker that means we basically start skipping all but MSS sized Sacked blocks.
The sack filter used to throw away data when its bounds were exceeded, instead now we increase its size to 15 and then throw away sack's if the filter gets over-run to prevent the malicious attacker from over-running the sack filter and thus we start to process things anyway.
The default stack will need to start using the sack-filter which we have talked about in past conference calls to take full advantage of the protections offered by it (and reduce cpu consumption when processing sacks).

After this set of changes is in rack can drop its SAD detection completely

Reviewed by:tuexen@, rscheff@
Differential Revision: <https://reviews.freebsd.org/D44903>


# dd7b86e2 18-Mar-2024 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove IS_FASTOPEN() macro

The macro is more obfuscating than helping as it just checks a single flag
of t_flags. All other t_flags bits are checked without a macro.

A bigger problem was that declaration of the macro in tcp_var.h depended
on a kernel option. It is a bad practice to create such definitions in
installable headers.

Reviewed by: rscheff, tuexen, kib
Differential Revision: https://reviews.freebsd.org/D44362


# 85df11a1 12-Mar-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

ktls: deep copy tls_enable struct for in-kernel tcp consumers

Doing a deep copy of the keys early allows users of the
tls_enable structure to assume kernel memory.
This enables the socket options to be set by kernel threads.

Reviewed By: #transport, tuexen, jhb, rrs
Sponsored by: NetApp, Inc.
X-NetApp-PR: #79
Differential Revision: https://reviews.freebsd.org/D44250


# e18b97bd 12-Mar-2024 Randall Stewart <rrs@FreeBSD.org>

Update to bring the rack stack with all its fixes in.

This brings the rack stack up to the current level used at NF. Many fixes
and improvements have been added. I also add in a fix to BBR to deal with
the changes that have been in hpts for a while i.e. only one call no matter
if mbuf queue or tcp_output.

It basically does little except BBlogs and is a placemark for future work on
doing path capacity measurements.

With a bit of a struggle with git I finally got rack_pcm.c into place (apologies
for not noticing this error). The LINT kernel is running on my box now .. sigh.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision:https://reviews.freebsd.org/D43986


# c112243f 11-Mar-2024 Brooks Davis <brooks@FreeBSD.org>

Revert "Update to bring the rack stack with all its fixes in."

This commit was incomplete and breaks LINT kernels. The tree has been
broken for 8+ hours.

This reverts commit f6d489f402c320f1a6eaa473491a0b8c3878113e.


# f6d489f4 11-Mar-2024 Randall Stewart <rrs@FreeBSD.org>

Update to bring the rack stack with all its fixes in.

This brings the rack stack up to the current level used at NF. Many fixes
and improvements have been added. I also add in a fix to BBR to deal with
the changes that have been in hpts for a while i.e. only one call no matter
if mbuf queue or tcp_output.

Note there is a new file that I can't figure out how to get in rack_pcm.c

It basically does little except BBlogs and is a placemark for future work on
doing path capacity measurements.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision:https://reviews.freebsd.org/D43986


# abe8379b 15-Feb-2024 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: repair wakeup of accept(2) by shutdown(2)

That was lost in transition from one-for-all soshutdown() to protocol
specific methods. Only protocols that listen(2) were affected. This is
not a documented or specified feature, but some software relies on it. At
least the FreeSWITCH telephony software uses this behavior on
PF_INET/SOCK_STREAM.

Fixes: 5bba2728079ed4da33f727dbc2b6ae1de02ba897


# 3eeb22cb 10-Feb-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: clean scoreboard when releasing the socket buffer

The SACK scoreboard is conceptually an extention of the socket
buffer. Remove it when the socket buffer goes away with
soisdisconnected(). Verify that this is also the expected
state in tcp_discardcb().

PR: 276761
Reviewed by: glebius, tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43805


# ce69e373 03-Feb-2024 Gleb Smirnoff <glebius@FreeBSD.org>

Revert "sockets: retire sorflush()"

Provide a comment in sorflush() why the socket I/O sx(9) lock is actually
important.

This reverts commit 507f87a799cf0811ce30f0ae7f10ba19b2fd3db3.


# 507f87a7 16-Jan-2024 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: retire sorflush()

With removal of dom_dispose method the function boils down to two
meaningful function calls: socantrcvmore() and sbrelease(). The latter is
only relevant for protocols that use generic socket buffers.

The socket I/O sx(9) lock acquisition in sorflush() is not relevant for
shutdown(2) operation as it doesn't do any I/O that may interleave with
read(2) or write(2). The socket buffer mutex acquisition inside
sbrelease() is what guarantees thread safety. This sx(9) acquisition in
soshutdown() can be tracked down to 4.4BSD times, where it used to be
sblock(), and it was carried over through the years evolving together with
sockets with no reconsideration of why do we carry it over. I can't tell
if that sblock() made sense back then, but it doesn't make any today.

Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D43415


# 5bba2728 16-Jan-2024 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: make pr_shutdown fully protocol specific method

Disassemble a one-for-all soshutdown() into protocol specific methods.
This creates a small amount of copy & paste, but makes code a lot more
self documented, as protocol specific method would execute only the code
that is relevant to that protocol and nothing else. This also fixes a
couple recent regressions and reduces risk of future regressions. The
extended KPI for the new pr_shutdown removes need for the extra pr_flush
which was added for the sake of SCTP which could not perform its shutdown
properly with the old one. Particularly for SCTP this change streamlines
a lot of code.

Some notes on why certain parts of code were copied or were not to certain
protocols:
* The (SS_ISCONNECTED | SS_ISCONNECTING | SS_ISDISCONNECTING) check is
needed only for those protocols that may be connected or disconnected.
* The above reduces into only SS_ISCONNECTED for those protocols that
always connect instantly.
* The ENOTCONN and continue processing hack is left only for datagram
protocols.
* The SOLISTENING(so) block is copied to those protocols that listen(2).
* sorflush() on SHUT_RD is copied almost to every protocol, but that
will be refactored later.
* wakeup(&so->so_timeo) is copied to protocols that can make a non-instant
connect(2), can SO_LINGER or can accept(2).

There are three protocols (netgraph(4), Bluetooth, SDP) that did not have
pr_shutdown, but old soshutdown() would still perform sorflush() on
SHUT_RD for them and also wakeup(9). Those protocols partially supported
shutdown(2) returning EOPNOTSUP for SHUT_WR/SHUT_RDWR, now they fully lost
shutdown(2) support. I'm pretty sure netgraph(4) and Bluetooth are okay
about that and SDP is almost abandoned anyway.

Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D43413


# a13039e2 27-Dec-2023 Gleb Smirnoff <glebius@FreeBSD.org>

inpcb: reoder inpcb destruction

First, merge in_pcbdetach() with in_pcbfree(). The comment for
in_pcbdetach() was no longer correct. Then, make sure we remove
the inpcb from the hash before we commit any destructive actions
on it. There are couple functions that rely on the hash lock
skipping SMR + inpcb lock to lookup an inpcb. Although there are
no known functions that similarly rely on the global inpcb list
lock, also do list removal before destructive actions.

PR: 273890
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D43122


# d2ef52ef 04-Dec-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp/hpts: make stacks responsible for clearing themselves out HPTS

There already is the tfb_tcp_timer_stop_all method that is supposed to stop
all time events associated with a given tcpcb by given stack. Some time
ago it was doing actual callout_stop(). Today bbr/rack just mark their
internal state as inactive in their tfb_tcp_timer_stop_all methods, but
tcpcb stays in HPTS wheel and potentially called in from HPTS. Change the
methods to also call tcp_hpts_remove(). Note: I'm not sure if internal
flag is still relevant once we are out of HPTS wheel.

Call the method when connection goes into TCP_CLOSED state, instead of
calling it later when tcpcb is freed. Also call it when we switch between
stacks.

Reviewed by: tuexen, rrs
Differential Revision: https://reviews.freebsd.org/D42857


# f42518ff 30-Nov-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: for LRD move sysctl from tcp.do_lrd tp tcp.sack.lrd, remove sockopt

Moving lrd sysctl to the tcp.sack branch, since LRD only works with SACK.
Remove the sockopt to programmatically control LRD per session.

Reviewed By: #transport, tuexen, rrs
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D42851


# cfb1e929 30-Nov-2023 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: don't malloc/free sockaddr memory on accept(2)

Let the accept functions provide stack memory for protocols to fill it in.
Generic code should provide sockaddr_storage, specialized code may provide
smaller structure.

While rewriting accept(2) make 'addrlen' a true in/out parameter, reporting
required length in case if provided length was insufficient. Our manual
page accept(2) and POSIX don't explicitly require that, but one can read
the text as they do. Linux also does that. Update tests accordingly.

Reviewed by: rscheff, tuexen, zlei, dchagin
Differential Revision: https://reviews.freebsd.org/D42635


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 70e30add 16-Nov-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove extraneous network epoch entry

accept(2) on IPv6 TCP doesn't need epoch. Some leaf functions may
need it, but they will enter accordingly, see sa6_recoverscope().

Reviewed by: rscheff, tuexen (implicitly, see deleted XXXMT)
Differential Revision: https://reviews.freebsd.org/D42634


# dc485b96 22-Aug-2023 Marius Strobl <marius@FreeBSD.org>

tcp_info: Add and export more FreeBSD-specific fields

This change adds struct tcp_info fields corresponding to the following
struct tcpcb ones:
- snd_una
- snd_max
- rcv_numsacks
- rcv_adv
- dupacks

Note that while both tcp_fill_info() and fill_tcp_info_from_tcb() are
extended accordingly, no counterpart of rcv_numsacks is available in
the cxgbe(4) TOE PCB, though.

Sponsored by: NetApp, Inc. (originally)


# 8c6104c4 22-Aug-2023 Marius Strobl <marius@FreeBSD.org>

tcp_fill_info(): Change lock assertion on INPCB to locked only

This function actually only ever reads from the TCP PCB. Consequently,
also make the pointer to its TCP PCB parameter const.

Sponsored by: NetApp, Inc. (originally)


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# de0a2eb2 23-Jun-2023 Mark Johnston <markj@FreeBSD.org>

tcp: Disallow connecting a disconnected socket

Currently nothing prevents tcp_usr_connect() from attempting to connect
when the socket has been disconnected. At the moment, doing so triggers
an assertion in in_pcbconnect() because inp_faddr is not unspecified. I
believe this may have been caught in the past by TIMEWAIT checks, but
those are now removed.

Check for additional socket states in tcp_connect().

Reported by: syzbot+f0f7871ec5397602b446@syzkaller.appspotmail.com
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D40579


# 04682968 20-Jun-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: expose AccECN mode and TCP FastOpen (TFO) in TCPI

Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D40621


# c2399dd2 06-May-2023 Michael Tuexen <tuexen@FreeBSD.org>

tcp: improve BBLoging for PRUs

Log all errors for PRUs, except when INP_DROPPED is set. In that case,
don't log it.

Reviewed by: glebius, rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D39591


# c2a69e84 25-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_hpts: move HPTS related fields from inpcb to tcpcb

This makes inpcb lighter and allows future cache line optimizations
of tcpcb. The reason why HPTS originally used inpcb is the compressed
TIME-WAIT state (see 0d7445193ab), that used to free a tcpcb, while the
associated connection is still on the HPTS ring.

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39697


# 66fbc19f 07-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: pass tcpcb in the tfb_tcp_ctloutput() method instead of inpcb

Just matches rest of the KPI.

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39435


# 945f9a7c 07-Apr-2023 Randall Stewart <rrs@FreeBSD.org>

tcp: misc cleanup of options for rack as well as socket option logging.

Both BBR and Rack have the ability to log socket options, which is currently disabled. Rack
has an experimental SaD (Sack Attack Detection) algorithm that should be made available. Also
there is a t_maxpeak_rate that needs to be removed (its un-used).

Reviewed by: tuexen, cc
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D39427


# 73ee5756 31-Mar-2023 Randall Stewart <rrs@FreeBSD.org>

Fixes in the tcp infrastructure with respect to stack changes as well as other infrastructure updates for incoming rack features.

So stack switching as always been a bit of a issue. We currently use a break before make setup which means that
if something goes wrong you have to try to get back to a stack. This patch among a lot of other things changes that so
that it is a make before break. We also expand some of the function blocks in prep for new features in rack that will allow
more controlled pacing. We also add other abilities such as the pathway for a stack to query a previous stack to acquire from
it critical state information so things in flight don't get dropped or mis-handled when switching stacks. We also add the
concept of a timer granularity. This allows an alternate stack to change from the old ticks granularity to microseconds and
of course this even gives us a pathway to go to nanosecond timekeeping if we need to (something for the data center to consider
for sure).

Once all this lands I will then update rack to begin using all these new features.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D39210


# 69c7c811 16-Mar-2023 Randall Stewart <rrs@FreeBSD.org>

Move access to tcp's t_logstate into inline functions and provide new tracepoint and bbpoint capabilities.

The TCP stacks have long accessed t_logstate directly, but in order to do tracepoints and the new bbpoints
we need to move to using the new inline functions. This adds them and moves rack to now use
the tcp_tracepoints.

Reviewed by: tuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D38831


# 399a5655 28-Feb-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Make TCP PCAP buffer properly configurable.

Reviewed By: tuexen, cc, #transport
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D38824


# 453aa7fa 22-Feb-2023 Michael Tuexen <tuexen@FreeBSD.org>

tcp: ensure the tcpcb is not NULL when logging an event

When calling tcp_bblog_pru() on some error paths, tp is NULL,
therefore handle it.

Sponsored by: Netflix, Inc.


# 4065becf 21-Feb-2023 Michael Tuexen <tuexen@FreeBSD.org>

bblog: unbreak build

Ensure that tp is always declared and set.

Reported by: Michael Butler
Sponsored by: Netflix, Inc.


# 00812bbd 20-Feb-2023 Michael Tuexen <tuexen@FreeBSD.org>

bblog: add logging of protocol user requests

This information was available in trpt and is useful. So provide
a way to get this information via TCP BBLog.

Reviewed by: rscheff@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D38701


# cda6bdba 17-Feb-2023 John Baldwin <jhb@FreeBSD.org>

tcp: Don't try to disconnect a socket multiple times.

When the checks for INP_TIMEWAIT were removed, tcp_usr_close() and
tcp_usr_disconnect() were no longer prevented from calling
tcp_disconnect() on a socket that was already disconnected. This
triggered a panic in cxgbe(4) for TOE where the tcp_disconnect() on an
already-disconnected socket invoked tcp_output() on a socket that was
already in time-wait.

Reviewed by: rrs, np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D37112


# 96871af0 15-Feb-2023 Gleb Smirnoff <glebius@FreeBSD.org>

inpcb: use family specific sockaddr argument for bind functions

Do the cast from sockaddr to either IPv4 or IPv6 sockaddr in the
protocol's pr_bind method and from there on go down the call
stack with family specific argument.

Reviewed by: zlei, melifaro, markj
Differential Revision: https://reviews.freebsd.org/D38601


# 636b19ea 14-Feb-2023 Mark Johnston <markj@FreeBSD.org>

tcp: Disallow re-connection of a connected socket

soconnectat() tries to ensure that one cannot connect a connected
socket. However, the check is racy and does not really prevent two
threads from attempting to connect the same TCP socket.

Modify tcp_connect() and tcp6_connect() to perform the check again, this
time synchronized by the inpcb lock, under which we call
soisconnecting().

Reported by: syzkaller
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38507


# 775da7f8 13-Feb-2023 Mark Johnston <markj@FreeBSD.org>

tcp: Remove a redundant net_epoch entry in tcp6_connect()

tcp6_connect() is always called in a net_epoch read section.

Fixes: 3d76be28ec60 ("netinet6: require network epoch for in6_pcbconnect()")
Reviewed by: tuexen, glebius
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D38506


# dfc4d218 07-Feb-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: use straight in_pcbconnect() in tcp_connect()

This brings tcp_connect() par with tcp6_connect(). The code removed
now is a remnant of "truncating old TIME-WAIT" removed back in 2004
in c94c54e4df9a.

Reviewed by: markj, tuexen
Differential Revision: https://reviews.freebsd.org/D38405


# fb8f221a 06-Feb-2023 Maxim Konovalov <maxim@FreeBSD.org>

db_printf: fix a typo

PR: 269377


# 9e46ff4d 03-Feb-2023 Gleb Smirnoff <glebius@FreeBSD.org>

netinet: don't return conflicting inpcb in in_pcbconnect_setup()

Last time this inpcb was actually used was in tcp_connect()
before c94c54e4df9a.


# a9afe086 03-Feb-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: bring comment for tcp_connect() up to date

We no longer use in_pcbbind() since 25102351509. The comment about
truncating old TIME-WAIT describes a code that had been removed back
in 2004 in c94c54e4df9a.


# a9d22cce 03-Feb-2023 Gleb Smirnoff <glebius@FreeBSD.org>

inpcb: use family specific sockaddr argument for connect functions

Do the cast from sockaddr to either IPv4 or IPv6 sockaddr in the
protocol's pr_connect method and from there on go down the call
stack with family specific argument.

Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D38356


# 3d76be28 03-Feb-2023 Gleb Smirnoff <glebius@FreeBSD.org>

netinet6: require network epoch for in6_pcbconnect()

This removes recursive epoch entry in the syncache case. Fixes
unprotected access to V_in6_ifaddrhead in in6_pcbladdr(), as
well as access to prison IP address lists. It also matches what
IPv4 in_pcbconnect() does.

Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D38355


# 221b9e3d 03-Feb-2023 Gleb Smirnoff <glebius@FreeBSD.org>

inpcb: merge two versions of in6_pcbconnect() into one

No functional change.

Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D38354


# 76f1499f 03-Feb-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: retire net.inet.tcp.tcp_require_unique_port

It was a safe belt just in case if the new port allocation
behaviour introduced in 25102351509 would cause a problem.

Reviewed by: markj, rscheff, tuexen
Differential revision: https://reviews.freebsd.org/D38353


# 18b83b62 26-Jan-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: reduce the size of t_rttupdated in tcpcb

During tcp session start, various mechanisms need to
track a few initial RTTs before becoming active.
Prevent overflows of the corresponding tracking counter
and reduce the size of tcpcb simultaneously.

Reviewed By: #transport, tuexen, guest-ccui
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D21117


# eaabc937 14-Dec-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: retire TCPDEBUG

This subsystem is superseded by modern debugging facilities,
e.g. DTrace probes and TCP black box logging.

We intentionally leave SO_DEBUG in place, as many utilities may
set it on a socket. Also the tcp::debug DTrace probes look at
this flag on a socket.

Reviewed by: gnn, tuexen
Discussed with: rscheff, rrs, jtl
Differential revision: https://reviews.freebsd.org/D37694


# 446ccdd0 07-Dec-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: use single locked callout per tcpcb for the TCP timers

Use only one callout structure per tcpcb that is responsible for handling
all five TCP timeouts. Use locked version of callout, of course. The
callout function tcp_timer_enter() chooses soonest timer and executes it
with lock held. Unless the timer reports that the tcpcb has been freed,
the callout is rescheduled for next soonest timer, if there is any.

With single callout per tcpcb on connection teardown we should be able
to fully stop the callout and immediately free it, avoiding use of
callout_async_drain(). There is one gotcha here: callout_stop() can
actually touch our memory when a rare race condition happens. See
comment above tcp_timer_stop(). Synchronous stop of the callout makes
tcp_discardcb() the single entry point for tcpcb destructor, merging the
tcp_freecb() to the end of the function.

While here, also remove lots of lingering checks in the beginning of
TCP timer functions. With a locked callout they are unnecessary.

While here, clean unused parts of timer KPI for the pluggable TCP stacks.

While here, remove TCPDEBUG from tcp_timer.c, as this allows for more
simplification of TCP timers. The TCPDEBUG is scheduled for removal.

Move the DTrace probes in timers to the beginning of a function, where
a tcpcb is always existing.

Discussed with: rrs, tuexen, rscheff (the TCP part of the diff)
Reviewed by: hselasky, kib, mav (the callout part)
Differential revision: https://reviews.freebsd.org/D37321


# e68b3792 07-Dec-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: embed inpcb into tcpcb

For the TCP protocol inpcb storage specify allocation size that would
provide space to most of the data a TCP connection needs, embedding
into struct tcpcb several structures, that previously were allocated
separately.

The most import one is the inpcb itself. With embedding we can provide
strong guarantee that with a valid TCP inpcb the tcpcb is always valid
and vice versa. Also we reduce number of allocs/frees per connection.
The embedded inpcb is placed in the beginning of the struct tcpcb,
since in_pcballoc() requires that. However, later we may want to move
it around for cache line efficiency, and this can be done with a little
effort. The new intotcpcb() macro is ready for such move.

The congestion algorithm data, the TCP timers and osd(9) data are
also embedded into tcpcb, and temprorary struct tcpcb_mem goes away.
There was no extra allocation here, but we went through extra pointer
every time we accessed this data.

One interesting side effect is that now TCP data is allocated from
SMR-protected zone. Potentially this allows the TCP stacks or other
TCP related modules to utilize that for their own synchronization.

Large part of the change was done with sed script:

s/tp->ccv->/tp->t_ccv./g
s/tp->ccv/\&tp->t_ccv/g
s/tp->cc_algo/tp->t_cc/g
s/tp->t_timers->tt_/tp->tt_/g
s/CCV\(ccv, osd\)/\&CCV(ccv, t_osd)/g

Dependency side effect is that code that needs to know struct tcpcb
should also know struct inpcb, that added several <netinet/in_pcb.h>.

Differential revision: https://reviews.freebsd.org/D37127


# bd4f9866 16-Nov-2022 Michael Tuexen <tuexen@FreeBSD.org>

tcp: remove unused t_rttbest

No functional change intended.

Reviewed by: rscheff@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D37401


# 9eb0e832 08-Nov-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: provide macros to access inpcb and socket from a tcpcb

There should be no functional changes with this commit.

Reviewed by: rscheff
Differential revision: https://reviews.freebsd.org/D37123


# 22c81cc5 06-Nov-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: add AccECN CE packet counters to tcpinfo

Provide diagnostics information around AccECN into
the tcpinfo struct.

Event: IETF 115 Hackathon
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D37280


# 53af6903 06-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove INP_TIMEWAIT flag

Mechanically cleanup INP_TIMEWAIT from the kernel sources. After
0d7445193ab, this commit shall not cause any functional changes.

Note: this flag was very often checked together with INP_DROPPED.
If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we
will be able to remove most of this checks and turn them to
assertions. Some of them can be turned into assertions right now,
but that should be carefully done on a case by case basis.

Differential revision: https://reviews.freebsd.org/D36400


# 9c3507f9 06-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: in tcp_usr_detach() remove special handling of compressed time-wait

Differential revision: https://reviews.freebsd.org/D36399


# 08af8aac 27-Sep-2022 Randall Stewart <rrs@FreeBSD.org>

Tcp progress timeout

Rack has had the ability to timeout connections that just sit idle automatically. This
feature of course is off by default and requires the user set it on (though the socket option
has been missing in tcp_usrreq.c). Lets get the progress timeout fully supported in
the base stack as well as rack.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36716


# 493105c2 21-Sep-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: fix simultaneous open and refine e80062a2d43

- The soisconnected() call on transition from SYN_RCVD to ESTABLISHED
is also necessary for a half-synchronized connection. Fix that
just setting the flag, when we transfer SYN-SENT -> SYN-RECEIVED.
- Provide a comment that explains at what conditions the call to
soisconnected() is necessary.
- Hence mechanically rename the TF_INCQUEUE flag to TF_SONOTCONN.
- Extend the change to the BBR and RACK stacks.

Note: the interaction between the accept_filter(9) and the socket layer
is not fully consistent, yet. For most accept filters this call to
soisconnected() will not move the connection from the incomplete queue
to the complete. The move would happen only when the filter has received
the desired data, and soisconnected() would be called once again from
sorwakeup(). Ideally, we should mark socket as connected only there,
and leave the soisconnected() from SYN_RCVD->ESTABLISHED only for the
simultaneous open case. However, this doesn't yet work.

Reviewed by: rscheff, tuexen, rrs
Differential revision: https://reviews.freebsd.org/D36641


# e80062a2 08-Sep-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: avoid call to soisconnected() on transition to ESTABLISHED

This call existed since pre-FreeBSD times, and it is hard to understand
why it was there in the first place. After 6f3caa6d815 it definitely
became necessary always and commit message from f1ee30ccd60 confirms that.
Now that 6f3caa6d815 is effectively backed out by 07285bb4c22, the call
appears to be useful only for sockets that landed on the incomplete queue,
e.g. sockets that have accept_filter(9) enabled on them.

Provide a new TCP flag to mark connections that are known to be on the
incomplete queue, and call soisconnected() only for those connections.

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D36488


# 0773b44e 05-Sep-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: tcp6_connect() requires net epoch

PR: 262663
Reported & tested by: dch
MFC after: 2 weeks


# e7d02be1 17-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protosw: refactor protosw and domain static declaration and load

o Assert that every protosw has pr_attach. Now this structure is
only for socket protocols declarations and nothing else.
o Merge struct pr_usrreqs into struct protosw. This was suggested
in 1996 by wollman@ (see 7b187005d18ef), and later reiterated
in 2006 by rwatson@ (see 6fbb9cf860dcd).
o Make struct domain hold a variable sized array of protosw pointers.
For most protocols these pointers are initialized statically.
Those domains that may have loadable protocols have spacers. IPv4
and IPv6 have 8 spacers each (andre@ dff3237ee54ea).
o For inetsw and inet6sw leave a comment noting that many protosw
entries very likely are dead code.
o Refactor pf_proto_[un]register() into protosw_[un]register().
o Isolate pr_*_notsupp() methods into uipc_domain.c

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36232


# d9f6ac88 17-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protosw: retire PRU_ flags and their char names

For many years only TCP debugging used them, but relatively recently
TCP DTrace probes also start to use them. Move their declarations
into tcp_debug.h, but start including tcp_debug.h unconditionally,
so that compilation with DTrace and without TCPDEBUG is possible.


# 07285bb4 10-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: utilize new solisten_clone() and solisten_enqueue()

This streamlines cloning of a socket from a listener. Now we do not
drop the inpcb lock during creation of a new socket, do not do useless
state transitions, and put a fully initialized socket+inpcb+tcpcb into
the listen queue.

Before this change, first we would allocate the socket and inpcb+tcpcb via
tcp_usr_attach() as TCPS_CLOSED, link them into global list of pcbs, unlock
pcb and put this onto incomplete queue (see 6f3caa6d815). Then, after
sonewconn() we would lock it again, transition into TCPS_SYN_RECEIVED,
insert into inpcb hash, finalize initialization of tcpcb. And then, in
call into tcp_do_segment() and upon transition to TCPS_ESTABLISHED call
soisconnected(). This call would lock the listening socket once again
with a LOR protection sequence and then we would relocate the socket onto
the complete queue and only now it is ready for accept(2).

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D36064


# c7a62c92 10-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

inpcb: gather v4/v6 handling code into in_pcballoc() from protocols

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D36062


# 1b91978f 06-Jul-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove a condition in tcp_usr_detach() that never happens

The comment from Robert Watson doubts that this condition ever happens.
Our analysis confirm that. Also, we found that if you manage to create
such a connection with help of some other bug, then after the "second
case" code is executed, the kernel will panic in other part of the stack.

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D35714


# d8596171 04-Jul-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: use only soref()/sorele() as socket reference count

o Retire SS_FDREF as it is basically a debug flag on top of already
existing soref()/sorele().
o Convert SS_PROTOREF into soref()/sorele().
o Change reference model for the listen queues, see below.
o Make sofree() private. The correct KPI to use is only sorele().
o Make soabort() respect the model and sorele() instead of sofree().

Note on listening queues. Until now the sockets on a queue had zero
reference count. And the reference were given only upon accept(2). The
assumption was that there is no way to see the queued socket from anywhere
except its head. This is not true, since queued sockets already have pcbs,
which are linked at least into the global pcb lists. With this change we
put the reference right in the sonewconn() and on accept(2) path we just
hand the existing reference to the file descriptor.

Differential revision: https://reviews.freebsd.org/D35679


# 74703901 04-Jul-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: use a TCP flag to check if connection has been close(2)d

The flag SS_NOFDREF is a private flag of the socket layer. It also
is supposed to be read with SOCK_LOCK(), which we don't own here.

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D35663


# 97453e5e 23-Jun-2022 Claudio Jeker <claudio@openbsd.org>

Unlock inp when handling TCP_MD5SIG socket options

Unlock the inp when hanlding TCP_MD5SIG socket options. tcp_ipsec_pcbctl
handles locking the inp when the option is being modified.

This was found by Claudio Jeker while working on the OpenBGPd port.

On 14 we get a panic when trying to call getsockopt, on 13.1 the process
locks up using 100% CPU.

Reviewed by: rscheff (transport), tuexen
MFC after: 3 days
Sponsored by: Klara Inc.
Differential Revision: https://reviews.freebsd.org/D35532


# b338b1fd 18-Apr-2022 Mateusz Guzik <mjg@FreeBSD.org>

tcp: plug set-but-not-used vars

Sponsored by: Rubicon Communications, LLC ("Netgate")


# ea9017fb 21-Feb-2022 Randall Stewart <rrs@FreeBSD.org>

tcp: Congestion control move to using reference counting.

In the transport call on 12/3 Gleb asked to move the CC modules towards
using reference counting to prevent folks from unloading a module in use.
It was also agreed that Michael would do a user space utility like tcp_drop
that could be used to move all connections that are using a specific CC
to some other CC.

This is the half I committed to doing, making it so that we maintain a refcount
on a cc module every time a pcb refers to it and decrementing that every
time a pcb no longer uses a cc module. This also helps us simplify the
whole unloading process by getting rid of tcp_ccunload() which munged
through all the tcb's. Instead we mark a module as being removed and
prevent further references to it. We also make sure that if a module is
marked as being removed it cannot be made as the default and also
the opposite of that, if its a default it fails and does not mark it as being
removed.

Reviewed by: Michael Tuexen, Gleb Smirnoff
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D33249


# 3f169c54 09-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Add/update AccECN related statistics and numbers

Reserve couters in the tcps struct in preparation
for AccECN, extend the debugging output for TF2
flags, optimize the syncache flags from individual
bits to a codepoint for the specifc ECN handshake.

This is in preparation of AccECN.

No functional chance except for extended debug
output capabilities.

Reviewed By: #transport, rrs
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34161


# 528c7649 08-Feb-2022 Michael Tuexen <tuexen@FreeBSD.org>

tcp: fix compliation when KERN_TLS is not defined

Reported by: Gary Jennejohn
Fixes: fd7daa727126 - main - tcp: make tcp_ctloutput_set() non-static
Sponsored by: Netflix, Inc.


# fd7daa72 08-Feb-2022 Michael Tuexen <tuexen@FreeBSD.org>

tcp: make tcp_ctloutput_set() non-static

tcp_ctloutput_set() will be used via the sysctl interface in a
upcoming command line tool tcpsso.

Reviewed by: glebius, rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D34164


# 3b0ee680 03-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Prevent setting of ECN bits with setsockopt()

setsockopt() grants full access to the deprecated
TOS byte. For TCP, mask out the ECN codepoint, so that
only the DSCP portion can be adjusted.

Reviewed By: tuexen, hselasky, #manpages, #transport, debdrup
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34154


# 3b3c08c1 02-Feb-2022 Michael Tuexen <tuexen@FreeBSD.org>

tcp: cleanup functions related to socket option handling

Consistently only pass the inp and the sopt around. Don't pass the
so around, since in a upcoming commit tcp_ctloutput_set() will be
called from a context different from setsockopt(). Also expect
the inp to be locked when calling tcp_ctloutput_[gs]et(), this is
also required for the upcoming use by tcpsso, a command line tool
to set socket options.
Reviewed by: glebius, rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D34151


# aac52f94 18-Jan-2022 Randall Stewart <rrs@FreeBSD.org>

tcp: Warning cleanup from new compiler.

The clang compiler recently got an update that generates warnings of unused
variables where they were set, and then never used. This revision goes through
the tcp stack and cleans all of those up.

Reviewed by: Michael Tuexen, Gleb Smirnoff
Sponsored by: Netflix Inc.
Differential Revision:


# 1d41a494 13-Jan-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_usr_connect: report actual error code when stack requests drop


# 4287aa56 28-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_usr_shutdown: don't cast inp_ppcb to tcpcb before checking inp_flags

While here move out one more erroneous condition out of the epoch and
common return. The only functional change is that if we send control
on a shut down socket we would get EINVAL instead of ECONNRESET.

Reviewed by: tuexen
Reported by: syzbot+8388cf7f401a7b6bece6@syzkaller.appspotmail.com
Fixes: f64dc2ab5be38e5366271ef85ea90d8cb1c7841a


# 0af4ce45 27-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_usr_shutdown: don't cast inp_ppcb to tcpcb before checking inp_flags

Fixes: f64dc2ab5be38e5366271ef85ea90d8cb1c7841a


# 37a7f557 27-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_usr_rcvd: don't cast inp_ppcb to tcpcb before checking inp_flags

Fixes: f64dc2ab5be38e5366271ef85ea90d8cb1c7841a


# a370832b 26-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove delayed drop KPI

No longer needed after tcp_output() can ask caller to drop.

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D33371


# f64dc2ab 26-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: TCP output method can request tcp_drop

The advanced TCP stacks (bbr, rack) may decide to drop a TCP connection
when they do output on it. The default stack never does this, thus
existing framework expects tcp_output() always to return locked and
valid tcpcb.

Provide KPI extension to satisfy demands of advanced stacks. If the
output method returns negative error code, it means that caller must
call tcp_drop().

In tcp_var() provide three inline methods to call tcp_output():
- tcp_output() is a drop-in replacement for the default stack, so that
default stack can continue using it internally without modifications.
For advanced stacks it would perform tcp_drop() and unlock and report
that with negative error code.
- tcp_output_unlock() handles the negative code and always converts
it to positive and always unlocks.
- tcp_output_nodrop() just calls the method and leaves the responsibility
to drop on the caller.

Sweep over the advanced stacks and use new KPI instead of using HPTS
delayed drop queue for that.

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D33370


# 40fa3e40 26-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: mechanically substitute call to tfb_tcp_output to new method.

Made with sed(1) execution:

sed -Ef sed -i "" $(grep --exclude tcp_var.h -lr tcp_output sys/)

sed:
s/tp->t_fb->tfb_tcp_output\(tp\)/tcp_output(tp)/
s/to tfb_tcp_output\(\)/to tcp_output()/

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D33366


# ef396441 10-Nov-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_usr_detach: revert debugging piece from f5cf1e5f5a500.

The code was probably useful during the problem being chased down,
but for brevity makes sense just to return to the original KASSERT.

Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D32968


# df07bfda 12-Nov-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: Fix a locking issue

INP_WLOCK_RECHECK_CLEANUP() and INP_WLOCK_RECHECK() might return
from the function, so any locks held must be released.

Reported by: syzbot+b1a888df08efaa7b4bf1@syzkaller.appspotmail.com
Reviewed by: markj
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D32975


# b8d60729 11-Nov-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Congestion control cleanup.

NOTE: HEADS UP read the note below if your kernel config is not including GENERIC!!

This patch does a bit of cleanup on TCP congestion control modules. There were some rather
interesting surprises that one could get i.e. where you use a socket option to change
from one CC (say cc_cubic) to another CC (say cc_vegas) and you could in theory get
a memory failure and end up on cc_newreno. This is not what one would expect. The
new code fixes this by requiring a cc_data_sz() function so we can malloc with M_WAITOK
and pass in to the init function preallocated memory. The CC init is expected in this
case *not* to fail but if it does and a module does break the
"no fail with memory given" contract we do fall back to the CC that was in place at the time.

This also fixes up a set of common newreno utilities that can be shared amongst other
CC modules instead of the other CC modules reaching into newreno and executing
what they think is a "common and understood" function. Lets put these functions in
cc.c and that way we have a common place that is easily findable by future developers or
bug fixers. This also allows newreno to evolve and grow support for its features i.e. ABE
and HYSTART++ without having to dance through hoops for other CC modules, instead
both newreno and the other modules just call into the common functions if they desire
that behavior or roll there own if that makes more sense.

Note: This commit changes the kernel configuration!! If you are not using GENERIC in
some form you must add a CC module option (one of CC_NEWRENO, CC_VEGAS, CC_CUBIC,
CC_CDG, CC_CHD, CC_DCTCP, CC_HTCP, CC_HD). You can have more than one defined
as well if you desire. Note that if you create a kernel configuration that does not
define a congestion control module and includes INET or INET6 the kernel compile will
break. Also you need to define a default, generic adds 'options CC_DEFAULT=\"newreno\"
but you can specify any string that represents the name of the CC module (same names
that show up in the CC module list under net.inet.tcp.cc). If you fail to add the
options CC_DEFAULT in your kernel configuration the kernel build will also break.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
RELNOTES:YES
Differential Revision: https://reviews.freebsd.org/D32693


# f581a26e 25-Oct-2021 Gleb Smirnoff <glebius@FreeBSD.org>

Factor out tcp6_use_min_mtu() to handle IPV6_USE_MIN_MTU by TCP.

Pass control for IP/IP6 level options from generic tcp_ctloutput_set()
down to per-stack ctloutput.

Call tcp6_use_min_mtu() from tcp stack tcp_default_ctloutput().

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D32655


# de156263 25-Oct-2021 Gleb Smirnoff <glebius@FreeBSD.org>

Several IP level socket options may affect TCP.

After handling them in IP level ctloutput, pass them down to TCP
ctloutput.

We already have a hack to handle IPV6_USE_MIN_MTU. Leave it in place
for now, but comment out how it should be handled.

For IPv4 we are interested in IP_TOS and IP_TTL.

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D32655


# fc4d53cc 25-Oct-2021 Gleb Smirnoff <glebius@FreeBSD.org>

Split tcp_ctloutput() into set/get parts.

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D32655


# e2833083 25-Oct-2021 Peter Lei <peterlei@netflix.com>

tcp: socket option to get stack alias name

TCP stack sysctl nodes are currently inserted using the stack
name alias. Allow the user to get the current stack's alias to
allow for programatic sysctl access.

Obtained from: Netflix


# bf256782 16-Sep-2021 Mark Johnston <markj@FreeBSD.org>

ktls: Fix error/mode confusion in TCP_*TLS_MODE getsockopt handlers

ktls_get_(rx|tx)_mode() can return an errno value or a TLS mode, so
errors are effectively hidden. Fix this by using a separate output
parameter. Convert to the new socket buffer locking macros while here.

Note that the socket buffer lock is not needed to synchronize the
SOLISTENING check here, we can rely on the PCB lock.

Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31977


# bd4a39cc 07-Sep-2021 Mark Johnston <markj@FreeBSD.org>

socket: Properly interlock when transitioning to a listening socket

Currently, most protocols implement pru_listen with something like the
following:

SOCK_LOCK(so);
error = solisten_proto_check(so);
if (error) {
SOCK_UNLOCK(so);
return (error);
}
solisten_proto(so);
SOCK_UNLOCK(so);

solisten_proto_check() fails if the socket is connected or connecting.
However, the socket lock is not used during I/O, so this pattern is
racy.

The change modifies solisten_proto_check() to additionally acquire
socket buffer locks, and the calling thread holds them until
solisten_proto() or solisten_proto_abort() is called. Now that the
socket buffer locks are preserved across a listen(2), this change allows
socket I/O paths to properly interlock with listen(2).

This fixes a large number of syzbot reports, only one is listed below
and the rest will be dup'ed to it.

Reported by: syzbot+9fece8a63c0e27273821@syzkaller.appspotmail.com
Reviewed by: tuexen, gallatin
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31659


# 3f1f6b6e 05-Aug-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp, udp: improve input validation in handling bind()

Reported by: syzbot+24fcfd8057e9bc339295@syzkaller.appspotmail.com
Reported by: syzbot+6e90ceb5c89285b2655b@syzkaller.appspotmail.com
Reviewed by: markj, rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D31422


# 4747500d 04-Jun-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: A better fix for the previously attempted fix of the ack-war issue with tcp.

So it turns out that my fix before was not correct. It ended with us failing
some of the "improved" SYN tests, since we are not in the correct states.
With more digging I have figured out the root of the problem is that when
we receive a SYN|FIN the reassembly code made it so we create a segq entry
to hold the FIN. In the established state where we were not in order this
would be correct i.e. a 0 len with a FIN would need to be accepted. But
if you are in a front state we need to strip the FIN so we correctly handle
the ACK but ignore the FIN. This gets us into the proper states
and avoids the previous ack war.

I back out some of the previous changes but then add a new change
here in tcp_reass() that fixes the root cause of the issue. We still
leave the rack panic fixes in place however.

Reviewed by: mtuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30627


# f96603b5 31-May-2021 Mark Johnston <markj@FreeBSD.org>

tcp, udp: Permit binding with AF_UNSPEC if the address is INADDR_ANY

Prior to commit f161d294b we only checked the sockaddr length, but now
we verify the address family as well. This breaks at least ttcp. Relax
the check to avoid breaking compatibility too much: permit AF_UNSPEC if
the address is INADDR_ANY.

Fixes: f161d294b
Reported by: Bakul Shah <bakul@iitbombay.org>
Reviewed by: tuexen
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30539


# 086a3556 25-May-2021 Andrew Gallatin <gallatin@FreeBSD.org>

tcp: enter network epoch when calling tfb_tcp_fb_fini

We need to enter the network epoch when calling into
tfb_tcp_fb_fini. I noticed this when I hit an assert
running the latest rack

Differential Revision: https://reviews.freebsd.org/D30407
Reviewed by: rrs, tuexen
Sponsored by: Netflix


# 13c0e198 25-May-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Fix bugs related to the PUSH bit and rack and an ack war

Michaels testing with UDP tunneling found an issue with the push bit, which was only partly fixed
in the last commit. The problem is the left edge gets transmitted before the adjustments are done
to the send_map, this means that right edge bits must be considered to be added only if
the entire RSM is being retransmitted.

Now syzkaller also continued to find a crash, which Michael sent me the reproducer for. Turns
out that the reproducer on default (freebsd) stack made the stack get into an ack-war with itself.
After fixing the reference issues in rack the same ack-war was found in rack (and bbr). Basically
what happens is we go into the reassembly code and lose the FIN bit. The trick here is we
should not be going into the reassembly code if tlen == 0 i.e. the peer never sent you anything.
That then gets the proper action on the FIN bit but then you end up in LAST_ACK with no
timers running. This is because the usrclosed function gets called and the FIN's and such have
already been exchanged. So when we should be entering FIN_WAIT2 (or even FIN_WAIT1) we get
stuck in LAST_ACK. Fixing this means tweaking the usrclosed function so that we properly
recognize the condition and drop into FIN_WAIT2 where a timer will allow at least TP_MAXIDLE
before closing (to allow time for the peer to retransmit its FIN if the ack is lost). Setting the fast_finwait2
timer can speed this up in testing.

Reviewed by: mtuexen,rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30451


# 7d2608a5 21-May-2021 Mark Johnston <markj@FreeBSD.org>

tcp: Make error handling in tcp_usr_send() more consistent

- Free the input mbuf in a single place instead of in every error path.
- Handle PRUS_NOTREADY consistently.
- Flush the socket's send buffer if an implicit connect fails. At that
point the mbuf has already been enqueued but we don't want to keep it
in the send buffer.

Reviewed by: gallatin, tuexen
Discussed with: jhb
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30349


# d8acd268 12-May-2021 Mark Johnston <markj@FreeBSD.org>

Fix mbuf leaks in various pru_send implementations

The various protocol implementations are not very consistent about
freeing mbufs in error paths. In general, all protocols must free both
"m" and "control" upon an error, except if PRUS_NOTREADY is specified
(this is only implemented by TCP and unix(4) and requires further work
not handled in this diff), in which case "control" still must be freed.

This diff plugs various leaks in the pru_send implementations.

Reviewed by: tuexen
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30151


# 0471a8c7 10-May-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: SACK Lost Retransmission Detection (LRD)

Recover from excessive losses without reverting to a
retransmission timeout (RTO). Disabled by default, enable
with sysctl net.inet.tcp.do_lrd=1

Reviewed By: #transport, rrs, tuexen, #manpages
Sponsored by: Netapp, Inc.
Differential Revision: https://reviews.freebsd.org/D28931


# f161d294 02-May-2021 Mark Johnston <markj@FreeBSD.org>

Add missing sockaddr length and family validation to various protocols

Several protocol methods take a sockaddr as input. In some cases the
sockaddr lengths were not being validated, or were validated after some
out-of-bounds accesses could occur. Add requisite checking to various
protocol entry points, and convert some existing checks to assertions
where appropriate.

Reported by: syzkaller+KASAN
Reviewed by: tuexen, melifaro
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29519


# 9e644c23 18-Apr-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: add support for TCP over UDP

Adding support for TCP over UDP allows communication with
TCP stacks which can be implemented in userspace without
requiring special priviledges or specific support by the OS.
This is joint work with rrs.

Reviewed by: rrs
Sponsored by: Netflix, Inc.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D29469


# 4c0bef07 20-Jan-2021 Kyle Evans <kevans@FreeBSD.org>

kern: net: remove TCP_LINGERTIME

TCP_LINGERTIME can be traced back to BSD 4.4 Lite and perhaps beyond, in
exactly the same form that it appears here modulo slightly different
context. It used to be the case that there was a single pr_usrreq
method with requests dispatched to it; these exact two lines appeared in
tcp_usrreq's PRU_ATTACH handling.

The only purpose of this that I can find is to cause surprising behavior
on accepted connections. Newly-created sockets will never hit these
paths as one cannot set SO_LINGER prior to socket(2). If SO_LINGER is
set on a listening socket and inherited, one would expect the timeout to
be inherited rather than changed arbitrarily like this -- noting that
SO_LINGER is nonsense on a listening socket beyond inheritance, since
they cannot be 'connected' by definition.

Neither Illumos nor Linux reset the timer like this based on testing and
inspection of Illumos, and testing of Linux.

Reviewed by: rscheff, tuexen
Differential Revision: https://reviews.freebsd.org/D28265


# a034518a 19-Dec-2020 Andrew Gallatin <gallatin@FreeBSD.org>

Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain

In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.

This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.

This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.

Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636


# 662c1305 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

net: clean up empty lines in .c and .h files


# f903a308 16-Jul-2020 Michael Tuexen <tuexen@FreeBSD.org>

(Re)-allow 0.0.0.0 to be used as an address in connect() for TCP
In r361752 an error handling was introduced for using 0.0.0.0 or
255.255.255.255 as the address in connect() for TCP, since both
addresses can't be used. However, the stack maps 0.0.0.0 implicitly
to a local address and at least two regressions were reported.
Therefore, re-allow the usage of 0.0.0.0.
While there, change the error indicated when using 255.255.255.255
from EAFNOSUPPORT to EACCES as mentioned in the man-page of connect().

Reviewed by: rrs
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D25401


# e854dd38 08-Jun-2020 Randall Stewart <rrs@FreeBSD.org>

An important statistic in determining if a server process (or client) is being delayed
is to know the time to first byte in and time to first byte out. Currently we
have no way to know these all we have is t_starttime. That (t_starttime) tells us
what time the 3 way handshake completed. We don't know when the first
request came in or how quickly we responded. Nor from a client perspective
do we know how long from when we sent out the first byte before the
server responded.

This small change adds the ability to track the TTFB's. This will show up in
BB logging which then can be pulled for later analysis. Note that currently
the tracking is via the ticks variable of all three variables. This provides
a very rough estimate (hz=1000 its 1ms). A follow-on set of work will be
to change all three of these values into something with a much finer resolution
(either microseconds or nanoseconds), though we may want to make the resolution
configurable so that on lower powered machines we could still use the much
cheaper ticks variable.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D24902


# 2cf21ae5 03-Jun-2020 Randall Stewart <rrs@FreeBSD.org>

We should never allow either the broadcast or IN_ADDR_ANY to be
connected to or sent to. This was fond when working with Michael
Tuexen and Skyzaller. Skyzaller seems to want to use either of
these two addresses to connect to at times. And it really is
an error to do so, so lets not allow that behavior.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D24852


# d442a657 03-Jun-2020 Michael Tuexen <tuexen@FreeBSD.org>

Restrict enabling TCP-FASTOPEN to end-points in CLOSED or LISTEN state

Enabling TCP-FASTOPEN on an end-point which is in a state other than
CLOSED or LISTEN, is a bug in the application. So it should not work.
Also the TCP code does not (and needs not to) handle this.
While there, also simplify the setting of the TF_FASTOPEN flag.

This issue was found by running syzkaller.

Reviewed by: rrs
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D25115


# 25102351 18-May-2020 Mike Karels <karels@FreeBSD.org>

Allow TCP to reuse local port with different destinations

Previously, tcp_connect() would bind a local port before connecting,
forcing the local port to be unique across all outgoing TCP connections
for the address family. Instead, choose a local port after selecting
the destination and the local address, requiring only that the tuple
is unique and does not match a wildcard binding.

Reviewed by: tuexen (rscheff, rrs previous version)
MFC after: 1 month
Sponsored by: Forcepoint LLC
Differential Revision: https://reviews.freebsd.org/D24781


# e240ce42 15-May-2020 Michael Tuexen <tuexen@FreeBSD.org>

Allow only IPv4 addresses in sendto() for TCP on AF_INET sockets.

This problem was found by looking at syzkaller reproducers for some other
problems.

Reviewed by: rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D24831


# 9d176904 10-May-2020 Michael Tuexen <tuexen@FreeBSD.org>

Remove trailing whitespace.


# d3b6c96b 04-May-2020 Randall Stewart <rrs@FreeBSD.org>

Adjust the fb to have a way to ask the underlying stack
if it can support the PRUS option (OOB). And then have
the new function call that to validate and give the
correct error response if needed to the user (rack
and bbr do not support obsoleted OOB data).

Sponsoered by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D24574


# f1f93475 27-Apr-2020 John Baldwin <jhb@FreeBSD.org>

Initial support for kernel offload of TLS receive.

- Add a new TCP_RXTLS_ENABLE socket option to set the encryption and
authentication algorithms and keys as well as the initial sequence
number.

- When reading from a socket using KTLS receive, applications must use
recvmsg(). Each successful call to recvmsg() will return a single
TLS record. A new TCP control message, TLS_GET_RECORD, will contain
the TLS record header of the decrypted record. The regular message
buffer passed to recvmsg() will receive the decrypted payload. This
is similar to the interface used by Linux's KTLS RX except that
Linux does not return the full TLS header in the control message.

- Add plumbing to the TOE KTLS interface to request either transmit
or receive KTLS sessions.

- When a socket is using receive KTLS, redirect reads from
soreceive_stream() into soreceive_generic().

- Note that this interface is currently only defined for TLS 1.1 and
1.2, though I believe we will be able to reuse the same interface
and structures for 1.3.


# ec1db6e1 27-Apr-2020 John Baldwin <jhb@FreeBSD.org>

Add the initial sequence number to the TLS enable socket option.

This will be needed for KTLS RX.

Reviewed by: gallatin
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D24451


# a3574665 13-Feb-2020 Michael Tuexen <tuexen@FreeBSD.org>

sack_newdata and snd_recover hold the same value. Therefore, use only
a single instance: use snd_recover also where sack_newdata was used.

Submitted by: Richard Scheffenegger
Differential Revision: https://reviews.freebsd.org/D18811


# 481be5de 12-Feb-2020 Randall Stewart <rrs@FreeBSD.org>

White space cleanup -- remove trailing tab's or spaces
from any line.

Sponsored by: Netflix Inc.


# 42ce7937 29-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Fix missing NET_EPOCH_ENTER() when compiled with TCP_OFFLOAD.

Reported by: Coverity
CID: 1413162


# 7754e281 22-Jan-2020 Bjoern A. Zeeb <bz@FreeBSD.org>

Fix NOINET kernels after r356983.

All gotos to the label are within the #ifdef INET section, which leaves
us with an unused label. Cover the label under #ifdef INET as well to
avoid the warning and compile time error.


# c1604fe4 21-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Make in_pcbladdr() require network epoch entered by its callers. Together
with this widen network epoch coverage up to tcp_connect() and udp_connect().

Revisions from r356974 and up to this revision cover D23187.

Differential Revision: https://reviews.freebsd.org/D23187


# e2636f0a 21-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Remove extraneous NET_EPOCH_ASSERT - the full function is covered.


# 3fed74e9 21-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Re-absorb tcp_detach() back into tcp_usr_detach() as the comment suggests.
Not a functional change.


# 5fc8df3c 21-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Don't enter network epoch in tcp_usr_detach. A PCB removal doesn't
require that.


# 7669c586 21-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_usr_attach() doesn't need network epoch. in_pcbfree() and
in_pcbdetach() perform all necessary synchronization themselves.


# 0f6385e7 21-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Inline tcp_attach() into tcp_usr_attach(). Not a functional change.


# 109eb549 21-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Make tcp_output() require network epoch.

Enter the epoch before calling into tcp_output() from those
functions, that didn't do that before.

This eliminates a bunch of epoch recursions in TCP.


# adc56f5a 02-Dec-2019 Edward Tomasz Napierala <trasz@FreeBSD.org>

Make use of the stats(3) framework in the TCP stack.

This makes it possible to retrieve per-connection statistical
information such as the receive window size, RTT, or goodput,
using a newly added TCP_STATS getsockopt(3) option, and extract
them using the stats_voistat_fetch(3) API.

See the net/tcprtt port for an example consumer of this API.

Compared to the existing TCP_INFO system, the main differences
are that this mechanism is easy to extend without breaking ABI,
and provides statistical information instead of raw "snapshots"
of values at a given point in time. stats(3) is more generic
and can be used in both userland and the kernel.

Reviewed by: thj
Tested by: thj
Obtained from: Netflix
Relnotes: yes
Sponsored by: Klara Inc, Netflix
Differential Revision: https://reviews.freebsd.org/D20655


# 3cf38784 01-Dec-2019 Michael Tuexen <tuexen@FreeBSD.org>

Move all ECN related flags from the flags to the flags2 field.
This allows adding more ECN related flags in the future.
No functional change intended.

Submitted by: Richard Scheffenegger
Reviewed by: rrs@, tuexen@
Differential Revision: https://reviews.freebsd.org/D22497


# 97a95ee1 06-Nov-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER() in
TCP functions that are executed in syscall context. No
functional change here.


# 4a91aa8f 24-Oct-2019 Michael Tuexen <tuexen@FreeBSD.org>

Ensure that the flags indicating IPv4/IPv6 are not changed by failing
bind() calls. This would lead to inconsistent state resulting in a panic.
A fix for stable/11 was committed in
https://svnweb.freebsd.org/base?view=revision&revision=338986
An accelerated MFC is planned as discussed with emaste@.

Reported by: syzbot+2609a378d89264ff5a42@syzkaller.appspotmail.com
Obtained from: jtl@
MFC after: 1 day
Sponsored by: Netflix, Inc.


# 9e14430d 08-Oct-2019 John Baldwin <jhb@FreeBSD.org>

Add a TOE KTLS mode and a TOE hook for allocating TLS sessions.

This adds the glue to allocate TLS sessions and invokes it from
the TLS enable socket option handler. This also adds some counters
for active TOE sessions.

The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when
TOE KTLS is in use on a socket, but cannot be set via setsockopt().

To simplify various checks, a TLS session now includes an explicit
'mode' member set to the value returned by TLSTX_TLS_MODE. Various
places that used to check 'sw_encrypt' against NULL to determine
software vs ifnet (NIC) TLS now check 'mode' instead.

Reviewed by: np, gallatin
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21891


# b2e60773 26-Aug-2019 John Baldwin <jhb@FreeBSD.org>

Add kernel-side support for in-kernel TLS.

KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277


# 0ecd976e 02-Aug-2019 Bjoern A. Zeeb <bz@FreeBSD.org>

IPv6 cleanup: kernel

Finish what was started a few years ago and harmonize IPv6 and IPv4
kernel names. We are down to very few places now that it is feasible
to do the change for everything remaining with causing too much disturbance.

Remove "aliases" for IPv6 names which confusingly could indicate
that we are talking about a different data structure or field or
have two fields, one for each address family.
Try to follow common conventions used in FreeBSD.

* Rename sin6p to sin6 as that is how it is spelt in most places.
* Remove "aliases" (#defines) for:
- in6pcb which really is an inpcb and nothing separate
- sotoin6pcb which is sotoinpcb (as per above)
- in6p_sp which is inp_sp
- in6p_flowinfo which is inp_flow
* Try to use ia6 for in6_addr rather than in6p.
* With all these gone also rename the in6p variables to inp as
that is what we call it in most of the network stack including
parts of netinet6.

The reasons behind this cleanup are that we try to further
unify netinet and netinet6 code where possible and that people
will less ignore one or the other protocol family when doing
code changes as they may not have spotted places due to different
names for the same thing.

No functional changes.

Discussed with: tuexen (SCTP changes)
MFC after: 3 months
Sponsored by: Netflix


# 82334850 28-Jun-2019 John Baldwin <jhb@FreeBSD.org>

Add an external mbuf buffer type that holds multiple unmapped pages.

Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.

For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer. This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers). It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused. To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.

Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.

NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.

If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.

Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: ae, kp (firewalls)
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616


# 68cea2b1 18-Apr-2019 John Baldwin <jhb@FreeBSD.org>

Push down INP_WLOCK slightly in tcp_ctloutput.

The inp lock is not needed for testing the V6 flag as that flag is set
once when the inp is created and never changes. For non-TCP socket
options the lock is immediately dropped after checking that flag.
This just pushes the lock down to only be acquired for TCP socket
options.

This isn't a hot-path, more a cosmetic cleanup I noticed while reading
the code.

Reviewed by: bz
MFC after: 1 month
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19740


# c8b53ced 30-Nov-2018 Michael Tuexen <tuexen@FreeBSD.org>

Limit option_len for the TCP_CCALGOOPT.

Limiting the length to 2048 bytes seems to be acceptable, since
the values used right now are using 8 bytes.

Reviewed by: glebius, bz, rrs
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D18366


# c6c0be27 24-Aug-2018 Michael Tuexen <tuexen@FreeBSD.org>

Fix a shadowed variable warning.
Thanks to Peter Lei for reporting the issue.

Approved by: re(kib@)
MFH: 1 month
Sponsored by: Netflix, Inc.


# 5dff1c38 21-Aug-2018 Michael Tuexen <tuexen@FreeBSD.org>

Enabling the IPPROTO_IPV6 level socket option IPV6_USE_MIN_MTU on a TCP
socket resulted in sending fragmented IPV6 packets.

This is fixes by reducing the MSS to the appropriate value. In addtion,
if the socket option is set before the handshake happens, announce this
MSS to the peer. This is not stricly required, but done since TCP
is conservative.

PR: 173444
Reviewed by: bz@, rrs@
MFC after: 1 month
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16796


# c28440db 19-Aug-2018 Randall Stewart <rrs@FreeBSD.org>

This change represents a substantial restructure of the way we
reassembly inbound tcp segments. The old algorithm just blindly
dropped in segments without coalescing. This meant that every
segment could take up greater and greater room on the linked list
of segments. This of course is now subject to a tighter limit (100)
of segments which in a high BDP situation will cause us to be a
lot more in-efficent as we drop segments beyond 100 entries that
we receive. What this restructure does is cause the reassembly
buffer to coalesce segments putting an emphasis on the two
common cases (which avoid walking the list of segments) i.e.
where we add to the back of the queue of segments and where we
add to the front. We also have the reassembly buffer supporting
a couple of debug options (black box logging as well as counters
for code coverage). These are compiled out by default but can
be added by uncommenting the defines.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D16626


# 8e02b4e0 19-Aug-2018 Michael Tuexen <tuexen@FreeBSD.org>

Don't expose the uptime via the TCP timestamps.

The TCP client side or the TCP server side when not using SYN-cookies
used the uptime as the TCP timestamp value. This patch uses in all
cases an offset, which is the result of a keyed hash function taking
the source and destination addresses and port numbers into account.
The keyed hash function is the same a used for the initial TSN.

Reviewed by: rrs@
MFC after: 1 month
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16636


# 51e08d53 31-Jul-2018 Michael Tuexen <tuexen@FreeBSD.org>

Fix INET only builds.

r336940 introduced an "unused variable" warning on platforms which
support INET, but not INET6, like MALTA and MALTA64 as reported
by Mark Millard. Improve the #ifdefs to address this issue.

Sponsored by: Netflix, Inc.


# 888973f5 30-Jul-2018 Michael Tuexen <tuexen@FreeBSD.org>

Allow implicit TCP connection setup for TCP/IPv6.

TCP/IPv4 allows an implicit connection setup using sendto(), which
is used for TTCP and TCP fast open. This patch adds support for
TCP/IPv6.
While there, improve some tests for detecting multicast addresses,
which are mapped.

Reviewed by: bz@, kbowling@, rrs@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16458


# 8db239dc 30-Jul-2018 Michael Tuexen <tuexen@FreeBSD.org>

Fix some TCP fast open issues.

The following issues are fixed:
* Whenever a TCP server with TCP fast open enabled, calls accept(),
recv(), send(), and close() before the TCP-ACK segment has been received,
the TCP connection is just dropped and the reception of the TCP-ACK
segment triggers the sending of a TCP-RST segment.
* Whenever a TCP server with TCP fast open enabled, calls accept(), recv(),
send(), send(), and close() before the TCP-ACK segment has been received,
the first byte provided in the second send call is not transferred.
* Whenever a TCP client with TCP fast open enabled calls sendto() followed
by close() the TCP connection is just dropped.

Reviewed by: jtl@, kbowling@, rrs@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16485


# 22699887 21-Jul-2018 Matt Macy <mmacy@FreeBSD.org>

NULL out cc_data in pluggable TCP {cc}_cb_destroy

When ABE was added (rS331214) to NewReno and leak fixed (rS333699) , it now has
a destructor (newreno_cb_destroy) for per connection state. Other congestion
controls may allocate and free cc_data on entry and exit, but the field is
never explicitly NULLed if moving back to NewReno which only internally
allocates stateful data (no entry contstructor) resulting in a situation where
newreno_cb_destory might be called on a junk pointer.

- NULL out cc_data in the framework after calling {cc}_cb_destroy
- free(9) checks for NULL so there is no need to perform not NULL checks
before calling free.
- Improve a comment about NewReno in tcp_ccalgounload

This is the result of a debugging session from Jason Wolfe, Jason Eggleston,
and mmacy@ and very helpful insight from lstewart@.

Submitted by: Kevin Bowling
Reviewed by: lstewart
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16282


# 6573d758 03-Jul-2018 Matt Macy <mmacy@FreeBSD.org>

epoch(9): allow preemptible epochs to compose

- Add tracker argument to preemptible epochs
- Inline epoch read path in kernel and tied modules
- Change in_epoch to take an epoch as argument
- Simplify tfb_tcp_do_segment to not take a ti_locked argument,
there's no longer any benefit to dropping the pcbinfo lock
and trying to do so just adds an error prone branchfest to
these functions
- Remove cases of same function recursion on the epoch as
recursing is no longer free.
- Remove the the TAILQ_ENTRY and epoch_section from struct
thread as the tracker field is now stack or heap allocated
as appropriate.

Tested by: pho and Limelight Networks
Reviewed by: kbowling at llnw dot com
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16066


# fd389e7c 19-Apr-2018 Randall Stewart <rrs@FreeBSD.org>

These two modules need the tcp_hpts.h file for
when the option is enabled (not sure how LINT/build-universe
missed this) opps.

Sponsored by: Netflix Inc


# 3ee9c3c4 19-Apr-2018 Randall Stewart <rrs@FreeBSD.org>

This commit brings in the TCP high precision timer system (tcp_hpts).
It is the forerunner/foundational work of bringing in both Rack and BBR
which use hpts for pacing out packets. The feature is optional and requires
the TCPHPTS option to be enabled before the feature will be active. TCP
modules that use it must assure that the base component is compile in
the kernel in which they are loaded.

MFC after: Never
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D15020


# 8fa799bd 06-Apr-2018 Jonathan T. Looney <jtl@FreeBSD.org>

If a user closes the socket before we call tcp_usr_abort(), then
tcp_drop() may unlock the INP. Currently, tcp_usr_abort() does not
check for this case, which results in a panic while trying to unlock
the already-unlocked INP (not to mention, a use-after-free violation).

Make tcp_usr_abort() check the return value of tcp_drop(). In the case
where tcp_drop() returns NULL, tcp_usr_abort() can skip further steps
to abort the connection and simply unlock the INP_INFO lock prior to
returning.

Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Netflix, Inc.


# c73b6f4d 04-Apr-2018 Ed Maste <emaste@FreeBSD.org>

Fix kernel memory disclosure in tcp_ctloutput

strcpy was used to copy a string into a buffer copied to userland, which
left uninitialized data after the terminating 0-byte. Use the same
approach as in tcp_subr.c: strncpy and explicit '\0'.

admbugs: 765, 822
MFC after: 1 day
Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Reported by: Vlad Tsyrklevich
Security: Kernel memory disclosure
Sponsored by: The FreeBSD Foundation


# a6456410 02-Apr-2018 Navdeep Parhar <np@FreeBSD.org>

Add a hook to allow the toedev handling an offloaded connection to
provide accurate TCP_INFO.

Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D14816


# e24e5683 23-Mar-2018 Jonathan T. Looney <jtl@FreeBSD.org>

Make the TCP blackbox code committed in r331347 be an optional feature
controlled by the TCP_BLACKBOX option.

Enable this as part of amd64 GENERIC. For now, leave it disabled on
other platforms.

Sponsored by: Netflix, Inc.


# 2529f56e 22-Mar-2018 Jonathan T. Looney <jtl@FreeBSD.org>

Add the "TCP Blackbox Recorder" which we discussed at the developer
summits at BSDCan and BSDCam in 2017.

The TCP Blackbox Recorder allows you to capture events on a TCP connection
in a ring buffer. It stores metadata with the event. It optionally stores
the TCP header associated with an event (if the event is associated with a
packet) and also optionally stores information on the sockets.

It supports setting a log ID on a TCP connection and using this to correlate
multiple connections that share a common log ID.

You can log connections in different modes. If you are doing a coordinated
test with a particular connection, you may tell the system to put it in
mode 4 (continuous dump). Or, if you just want to monitor for errors, you
can put it in mode 1 (ring buffer) and dump all the ring buffers associated
with the connection ID when we receive an error signal for that connection
ID. You can set a default mode that will be applied to a particular ratio
of incoming connections. You can also manually set a mode using a socket
option.

This commit includes only basic probes. rrs@ has added quite an abundance
of probes in his TCP development work. He plans to commit those soon.

There are user-space programs which we plan to commit as ports. These read
the data from the log device and output pcapng files, and then let you
analyze the data (and metadata) in the pcapng files.

Reviewed by: gnn (previous version)
Obtained from: Netflix, Inc.
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D11085


# dd388cfd 21-Mar-2018 Gleb Smirnoff <glebius@FreeBSD.org>

The net.inet.tcp.nolocaltimewait=1 optimization prevents local TCP connections
from entering the TIME_WAIT state. However, it omits sending the ACK for the
FIN, which results in RST. This becomes a bigger deal if the sysctl
net.inet.tcp.blackhole is 2. In this case RST isn't send, so the other side of
the connection (also local) keeps retransmitting FINs.

To fix that in tcp_twstart() we will not call tcp_close() immediately. Instead
we will allocate a tcptw on stack and proceed to the end of the function all
the way to tcp_twrespond(), to generate the correct ACK, then we will drop the
last PCB reference.

While here, make a few tiny improvements:
- use bools for boolean variable
- staticize nolocaltimewait
- remove pointless acquisiton of socket lock

Reported by: jtl
Reviewed by: jtl
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D14697


# 18a75309 25-Feb-2018 Patrick Kelsey <pkelsey@FreeBSD.org>

Greatly reduce the number of #ifdefs supporting the TCP_RFC7413 kernel option.

The conditional compilation support is now centralized in
tcp_fastopen.h and tcp_var.h. This doesn't provide the minimum
theoretical code/data footprint when TCP_RFC7413 is disabled, but
nearly all the TFO code should wind up being removed by the optimizer,
the additional footprint in the syncache entries is a single pointer,
and the additional overhead in the tcpcb is at the end of the
structure.

This enables the TCP_RFC7413 kernel option by default in amd64 and
arm64 GENERIC.

Reviewed by: hiren
MFC after: 1 month
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14048


# c560df6f 25-Feb-2018 Patrick Kelsey <pkelsey@FreeBSD.org>

This is an implementation of the client side of TCP Fast Open (TFO)
[RFC7413]. It also includes a pre-shared key mode of operation in
which the server requires the client to be in possession of a shared
secret in order to successfully open TFO connections with that server.

The names of some existing fastopen sysctls have changed (e.g.,
net.inet.tcp.fastopen.enabled -> net.inet.tcp.fastopen.server_enable).

Reviewed by: tuexen
MFC after: 1 month
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14047


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# 63ec505a 18-Aug-2017 Michael Tuexen <tuexen@FreeBSD.org>

Ensure inp_vflag is consistently set for TCP endpoints.

Make sure that the flags INP_IPV4 and INP_IPV6 are consistently set
for inpcbs used for TCP sockets, no matter if the setting is derived
from the net.inet6.ip6.v6only sysctl or the IPV6_V6ONLY socket option.
For UDP this was already done right.

PR: 221385
MFC after: 1 week


# 5dba6ada 22-May-2017 Michael Tuexen <tuexen@FreeBSD.org>

The connect() system call should return -1 and set errno to EAFNOSUPPORT
if it is called on a TCP socket
* with an IPv6 address and the socket is bound to an
IPv4-mapped IPv6 address.
* with an IPv4-mapped IPv6 address and the socket is bound to an
IPv6 address.
Thanks to Jonathan T. Leighton for reporting this issue.

Reviewed by: bz gnn
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D9163


# fbbd9655 28-Feb-2017 Warner Losh <imp@FreeBSD.org>

Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96


# 4616026f 09-Feb-2017 Ermal Luçi <eri@FreeBSD.org>

Revert r313527

Heh svn is not git


# c0fadfdb 09-Feb-2017 Ermal Luçi <eri@FreeBSD.org>

Correct missed variable name.

Reported-by: ohartmann@walstatt.org


# fcf59617 06-Feb-2017 Andrey V. Elsukov <ae@FreeBSD.org>

Merge projects/ipsec into head/.

Small summary
-------------

o Almost all IPsec releated code was moved into sys/netipsec.
o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel
option IPSEC_SUPPORT added. It enables support for loading
and unloading of ipsec.ko and tcpmd5.ko kernel modules.
o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by
default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type
support was removed. Added TCP/UDP checksum handling for
inbound packets that were decapsulated by transport mode SAs.
setkey(8) modified to show run-time NAT-T configuration of SA.
o New network pseudo interface if_ipsec(4) added. For now it is
build as part of ipsec.ko module (or with IPSEC kernel).
It implements IPsec virtual tunnels to create route-based VPNs.
o The network stack now invokes IPsec functions using special
methods. The only one header file <netipsec/ipsec_support.h>
should be included to declare all the needed things to work
with IPsec.
o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed.
Now these protocols are handled directly via IPsec methods.
o TCP_SIGNATURE support was reworked to be more close to RFC.
o PF_KEY SADB was reworked:
- now all security associations stored in the single SPI namespace,
and all SAs MUST have unique SPI.
- several hash tables added to speed up lookups in SADB.
- SADB now uses rmlock to protect access, and concurrent threads
can do SA lookups in the same time.
- many PF_KEY message handlers were reworked to reflect changes
in SADB.
- SADB_UPDATE message was extended to support new PF_KEY headers:
SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They
can be used by IKE daemon to change SA addresses.
o ipsecrequest and secpolicy structures were cardinally changed to
avoid locking protection for ipsecrequest. Now we support
only limited number (4) of bundled SAs, but they are supported
for both INET and INET6.
o INPCB security policy cache was introduced. Each PCB now caches
used security policies to avoid SP lookup for each packet.
o For inbound security policies added the mode, when the kernel does
check for full history of applied IPsec transforms.
o References counting rules for security policies and security
associations were changed. The proper SA locking added into xform
code.
o xform code was also changed. Now it is possible to unregister xforms.
tdb_xxx structures were changed and renamed to reflect changes in
SADB/SPDB, and changed rules for locking and refcounting.

Reviewed by: gnn, wblock
Obtained from: Yandex LLC
Relnotes: yes
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D9352


# f5cf1e5f 18-Oct-2016 Julien Charbon <jch@FreeBSD.org>

Fix a double-free when an inp transitions to INP_TIMEWAIT state
after having been dropped.

This fixes enforces in_pcbdrop() logic in tcp_input():

"in_pcbdrop() is used by TCP to mark an inpcb as unused and avoid future packet
delivery or event notification when a socket remains open but TCP has closed."

PR: 203175
Reported by: Palle Girgensohn, Slawa Olhovchenkov
Tested by: Slawa Olhovchenkov
Reviewed by: Slawa Olhovchenkov
Approved by: gnn, Slawa Olhovchenkov
Differential Revision: https://reviews.freebsd.org/D8211
MFC after: 1 week
Sponsored by: Verisign, inc


# 68bd7ed1 12-Oct-2016 Jonathan T. Looney <jtl@FreeBSD.org>

The TFO server-side code contains some changes that are not conditioned on
the TCP_RFC7413 kernel option. This change removes those few instructions
from the packet processing path.

While not strictly necessary, for the sake of consistency, I applied the
new IS_FASTOPEN macro to all places in the packet processing path that
used the (t_flags & TF_FASTOPEN) check.

Reviewed by: hiren
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8219


# 3ac12506 06-Oct-2016 Jonathan T. Looney <jtl@FreeBSD.org>

Remove "long" variables from the TCP stack (not including the modular
congestion control framework).

Reviewed by: gnn, lstewart (partial)
Sponsored by: Juniper Networks, Netflix
Differential Revision: (multiple)
Tested by: Limelight, Netflix


# 5a17b6ad 14-Sep-2016 Michael Tuexen <tuexen@FreeBSD.org>

Ensure that the IPPROTO_TCP level socket options
* TCP_KEEPINIT
* TCP_KEEPINTVL
* TCP_KEEPIDLE
* TCP_KEEPCNT
always always report the values currently used when getsockopt()
is used. This wasn't the case when the sysctl-inherited default
values where used.
Ensure that the IPPROTO_TCP level socket option TCP_INFO has the
TCPI_OPT_ECN flag set in the tcpi_options field when ECN support
has been negotiated successfully.

Reviewed by: rrs, jtl, hiren
MFC after: 1 month
Differential Revision: 7833


# 587d67c0 16-Aug-2016 Randall Stewart <rrs@FreeBSD.org>

Here we update the modular tcp to be able to switch to an
alternate TCP stack in other then the closed state (pre-listen/connect).
The idea is that *if* that is supported by the alternate stack, it
is asked if its ok to switch. If it approves the "handoff" then we
allow the switch to happen. Also the fini() function now gets a flag
to tell if you are switching away *or* the tcb is destroyed. The
init() call into the alternate stack is moved to the end so the
tcb is more fully formed before the init transpires.

Sponsored by: Netflix Inc.
Differential Revision: D6790


# bac5bedf 26-Apr-2016 Conrad Meyer <cem@FreeBSD.org>

tcp_usrreq: Free allocated buffer in relock case

The disgusting macro INP_WLOCK_RECHECK may early-return. In
tcp_default_ctloutput() the TCP_CCALGOOPT case allocates memory before invoking
this macro, which may leak memory.

Add a _CLEANUP variant that takes a code argument to perform variable cleanup
in the early return path. Use it to free the 'pbuf' allocated in
tcp_default_ctloutput().

I am not especially happy with this macro, but I reckon it's not any worse than
INP_WLOCK_RECHECK already was.

Reported by: Coverity
CID: 1350286
Sponsored by: EMC / Isilon Storage Division


# bf840a17 14-Mar-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Redo r294869. The array of counters for TCP states doesn't belong to
struct tcpstat, because the structure can be zeroed out by netstat(1) -z,
and of course running connection counts shouldn't be touched.

Place running connection counts into separate array, and provide
separate read-only sysctl oid for it.


# e79cb051 03-Mar-2016 George V. Neville-Neil <gnn@FreeBSD.org>

Fix dtrace probes (introduced in 287759): debug__input was used
for output and drop; connect didn't always fire a user probe
some probes were missing in fastpath

Submitted by: Hannes Mehnert
Sponsored by: REMS, EPSRC
Differential Revision: https://reviews.freebsd.org/D5525


# 4644fda3 27-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Rename netinet/tcp_cc.h to netinet/cc/cc.h.

Discussed with: lstewart


# af6fef3a 27-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Fix issues with TCP_CONGESTION handling after r294540:
o Return back the buf[TCP_CA_NAME_MAX] for TCP_CONGESTION,
for TCP_CCALGOOPT use dynamically allocated *pbuf.
o For SOPT_SET TCP_CONGESTION do NULL terminating of string
taking from userland.
o For SOPT_SET TCP_CONGESTION do the search for the algorithm
keeping the inpcb lock.
o For SOPT_GET TCP_CONGESTION first strlcpy() the name
holding the inpcb lock into temporary buffer, then copyout.

Together with: lstewart


# 57a78e3b 26-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Augment struct tcpstat with tcps_states[], which is used for book-keeping
the amount of TCP connections by state. Provides a cheap way to get
connection count without traversing the whole pcb list.

Sponsored by: Netflix


# d519cedb 21-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Provide new socket option TCP_CCALGOOPT, which stands for TCP congestion
control algorithm options. The argument is variable length and is opaque
to TCP, forwarded directly to the algorithm's ctl_output method.

Provide new includes directory netinet/cc, where algorithm specific
headers can be installed.

The new API doesn't yet have any in tree consumers.

The original code written by lstewart.
Reviewed by: rrs, emax
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D711


# 73e263b1 21-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Refactor TCP_CONGESTION setsockopt handling:
- Use M_TEMP instead of stack variable.
- Unroll error handling, removing several levels of indentation.


# 2de3e790 21-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

- Rename cc.h to more meaningful tcp_cc.h.
- Declare it a kernel only include, which it already is.
- Don't include tcp.h implicitly from tcp_cc.h


# 0c39d38d 06-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Historically we have two fields in tcpcb to describe sender MSS: t_maxopd,
and t_maxseg. This dualism emerged with T/TCP, but was not properly cleaned
up after T/TCP removal. After all permutations over the years the result is
that t_maxopd stores a minimum of peer offered MSS and MTU reduced by minimum
protocol header. And t_maxseg stores (t_maxopd - TCPOLEN_TSTAMP_APPA) if
timestamps are in action, or is equal to t_maxopd otherwise. That's a very
rough estimate of MSS reduced by options length. Throughout the code it
was used in places, where preciseness was not important, like cwnd or
ssthresh calculations.

With this change:

- t_maxopd goes away.
- t_maxseg now stores MSS not adjusted by options.
- new function tcp_maxseg() is provided, that calculates MSS reduced by
options length. The functions gives a better estimate, since it takes
into account SACK state as well.

Reviewed by: jtl
Differential Revision: https://reviews.freebsd.org/D3593


# 281a0fd4 24-Dec-2015 Patrick Kelsey <pkelsey@FreeBSD.org>

Implementation of server-side TCP Fast Open (TFO) [RFC7413].

TFO is disabled by default in the kernel build. See the top comment
in sys/netinet/tcp_fastopen.c for implementation particulars.

Reviewed by: gnn, jch, stas
MFC after: 3 days
Sponsored by: Verisign, Inc.
Differential Revision: https://reviews.freebsd.org/D4350


# 55bceb1e 15-Dec-2015 Randall Stewart <rrs@FreeBSD.org>

First cut of the modularization of our TCP stack. Still
to do is to clean up the timer handling using the async-drain.
Other optimizations may be coming to go with this. Whats here
will allow differnet tcp implementations (one included).
Reviewed by: jtl, hiren, transports
Sponsored by: Netflix Inc.
Differential Revision: D4055


# 86a996e6 13-Oct-2015 Hiren Panchasara <hiren@FreeBSD.org>

There are times when it would be really nice to have a record of the last few
packets and/or state transitions from each TCP socket. That would help with
narrowing down certain problems we see in the field that are hard to reproduce
without understanding the history of how we got into a certain state. This
change provides just that.

It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is
destroyed, the list is freed. I thought this was likely to be more
performance-friendly than saving copies of the tcpcb. Plus, with the packets,
you should be able to reverse-engineer what happened to the tcpcb.

To enable the feature, you will need to compile a kernel with the TCPPCAP
option. Even then, the feature defaults to being deactivated. You can activate
it by setting a positive value for the number of captured packets. You can do
that on either a global basis or on a per-socket basis (via a setsockopt call).

There is no way to get the packets out of the kernel other than using kmem or
getting a coredump. I thought that would help some of the legal/privacy concerns
regarding such a feature. However, it should be possible to add a future effort
to export them in PCAP format.

I tested this at low scale, and found that there were no mbuf leaks and the peak
mbuf usage appeared to be unchanged with and without the feature.

The main performance concern I can envision is the number of mbufs that would be
used on systems with a large number of sockets. If you save five packets per
direction per socket and have 3,000 sockets, that will consume at least 30,000
mbufs just to keep these packets. I tried to reduce the concerns associated with
this by limiting the number of clusters (not mbufs) that could be used for this
feature. Again, in my testing, that appears to work correctly.

Differential Revision: D3100
Submitted by: Jonathan Looney <jlooney at juniper dot net>
Reviewed by: gnn, hiren


# 550e9d42 15-Sep-2015 Hiren Panchasara <hiren@FreeBSD.org>

Remove unnecessary tcp state transition call.

Differential Revision: D3451
Reviewed by: markj
MFC after: 2 weeks
Sponsored by: Limelight Networks


# 5d06879a 13-Sep-2015 George V. Neville-Neil <gnn@FreeBSD.org>

dd DTrace probe points, translators and a corresponding script
to provide the TCPDEBUG functionality with pure DTrace.

Reviewed by: rwatson
MFC after: 2 weeks
Sponsored by: Limelight Networks
Differential Revision: D3530


# 079672cb 08-Aug-2015 Julien Charbon <jch@FreeBSD.org>

Fix a kernel assertion issue introduced with r286227:
Avoid too strict INP_INFO_RLOCK_ASSERT checks due to
tcp_notify() being called from in6_pcbnotify().

Reported by: Larry Rosenman <ler@lerctr.org>
Submitted by: markj, jch


# ff9b006d 02-Aug-2015 Julien Charbon <jch@FreeBSD.org>

Decompose TCP INP_INFO lock to increase short-lived TCP connections scalability:

- The existing TCP INP_INFO lock continues to protect the global inpcb list
stability during full list traversal (e.g. tcp_pcblist()).

- A new INP_LIST lock protects inpcb list actual modifications (inp allocation
and free) and inpcb global counters.

It allows to use TCP INP_INFO_RLOCK lock in critical paths (e.g. tcp_input())
and INP_INFO_WLOCK only in occasional operations that walk all connections.

PR: 183659
Differential Revision: https://reviews.freebsd.org/D2599
Reviewed by: jhb, adrian
Tested by: adrian, nitroboost-gmail.com
Sponsored by: Verisign, Inc.


# 4741bfcb 29-Jul-2015 Patrick Kelsey <pkelsey@FreeBSD.org>

Revert r265338, r271089 and r271123 as those changes do not handle
non-inline urgent data and introduce an mbuf exhaustion attack vector
similar to FreeBSD-SA-15:15.tcp, but not requiring VNETs.

Address the issue described in FreeBSD-SA-15:15.tcp.

Reviewed by: glebius
Approved by: so
Approved by: jmallett (mentor)
Security: FreeBSD-SA-15:15.tcp
Sponsored by: Norse Corp, Inc.


# eb96dc33 09-Mar-2015 Julien Charbon <jch@FreeBSD.org>

In TCP, connect() can return incorrect error code EINVAL
instead of EADDRINUSE or ECONNREFUSED

PR: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=196035
Differential Revision: https://reviews.freebsd.org/D1982
Reported by: Mark Nunberg <mnunberg@haskalah.org>
Submitted by: Harrison Grundy <harrison.grundy@astrodoggroup.com>
Reviewed by: adrian, jch, glebius, gnn
Approved by: jhb
MFC after: 2 weeks


# 2cbcd3c1 30-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Merge from projects/sendfile:

- Provide pru_ready function for TCP.
- Don't call tcp_output() from tcp_usr_send() if no ready data was put
into the socket buffer.
- In case of dropped connection don't try to m_freem() not ready data.

Sponsored by: Nginx, Inc.
Sponsored by: Netflix


# 651e4e6a 30-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Merge from projects/sendfile: extend protocols API to support
sending not ready data:
o Add new flag to pru_send() flags - PRUS_NOTREADY.
o Add new protocol method pru_ready().

Sponsored by: Nginx, Inc.
Sponsored by: Netflix


# 300fa232 29-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Missed in r274421: use sbavail() instead of bare access to sb_cc.


# cea40c48 30-Oct-2014 Julien Charbon <jch@FreeBSD.org>

Fix a race condition in TCP timewait between tcp_tw_2msl_reuse() and
tcp_tw_2msl_scan(). This race condition drives unplanned timewait
timeout cancellation. Also simplify implementation by holding inpcb
reference and removing tcptw reference counting.

Differential Revision: https://reviews.freebsd.org/D826
Submitted by: Marc De la Gueronniere <mdelagueronniere@verisign.com>
Submitted by: jch
Reviewed By: jhb (mentor), adrian, rwatson
Sponsored by: Verisign, Inc.
MFC after: 2 weeks
X-MFC-With: r264321


# 489dcc92 12-Oct-2014 Julien Charbon <jch@FreeBSD.org>

A connection in TIME_WAIT state before calling close() actually did not
received any RST packet. Do not set error to ECONNRESET in this case.

Differential Revision: https://reviews.freebsd.org/D879
Reviewed by: rpaulo, adrian
Approved by: jhb (mentor)
Sponsored by: Verisign, Inc.


# a7e201bb 10-Sep-2014 Andrey V. Elsukov <ae@FreeBSD.org>

Make in6_pcblookup_hash_locked and in6_pcbladdr static.

Obtained from: Yandex LLC
Sponsored by: Yandex LLC


# e407b67b 04-May-2014 Gleb Smirnoff <glebius@FreeBSD.org>

The FreeBSD-SA-14:08.tcp was a lesson on not doing acrobatics with
mixing on stack memory and UMA memory in one linked list.

Thus, rewrite TCP reassembly code in terms of memory usage. The
algorithm remains unchanged.

We actually do not need extra memory to build a reassembly queue.
Arriving mbufs are always packet header mbufs. So we got the length
of data as pkthdr.len. We got m_nextpkt for linkage. And we need
only one pointer to point at the tcphdr, use PH_loc for that.

In tcpcb the t_segq fields becomes mbuf pointer. The t_segqlen
field now counts not packets, but bytes in the queue. This gives
us more precision when comparing to socket buffer limits.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 6f3caa6d 28-Jan-2014 George V. Neville-Neil <gnn@FreeBSD.org>

Decrease lock contention within the TCP accept case by removing
the INP_INFO lock from tcp_usr_accept. As the PR/patch states
this was following the advice already in the code.
See the PR below for a full disucssion of this change and its
measured effects.

PR: 183659
Submitted by: Julian Charbon
Reviewed by: jhb


# 2f3eb7f4 08-Nov-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Make TCP_KEEP* socket options readable. At least PostgreSQL wants
to read the values.

Reported by: sobomax


# 76039bc8 26-Oct-2013 Gleb Smirnoff <glebius@FreeBSD.org>

The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 57f60867 25-Aug-2013 Mark Johnston <markj@FreeBSD.org>

Implement the ip, tcp, and udp DTrace providers. The probe definitions use
dynamic translation so that their arguments match the definitions for
these providers in Solaris and illumos. Thus, existing scripts for these
providers should work unmodified on FreeBSD.

Tested by: gnn, hiren
MFC after: 1 month


# adfaf8f6 25-Jan-2013 Navdeep Parhar <np@FreeBSD.org>

Add checks for SO_NO_OFFLOAD in a couple of places that I missed earlier
in r245915.


# 460cf046 25-Jan-2013 Navdeep Parhar <np@FreeBSD.org>

There is no need to call into the TOE driver twice in pru_rcvd (tod_rcvd
and then tod_output right after that).

Reviewed by: bz@


# 37cc0ecb 25-Jan-2013 Navdeep Parhar <np@FreeBSD.org>

Heed SO_NO_OFFLOAD.

MFC after: 1 week


# 85c05144 27-Sep-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Fix bug in TCP_KEEPCNT setting, which slipped in in the last round
of reviewing of r231025.

Unlike other options from this family TCP_KEEPCNT doesn't specify
time interval, but a count, thus parameter supplied doesn't need
to be multiplied by hz.

Reported & tested by: amdmi3


# 09fe6320 19-Jun-2012 Navdeep Parhar <np@FreeBSD.org>

- Updated TOE support in the kernel.

- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs.
These are available as t3_tom and t4_tom modules that augment cxgb(4)
and cxgbe(4) respectively. The cxgb/cxgbe drivers continue to work as
usual with or without these extra features.

- iWARP driver for Terminator 3 ASIC (kernel verbs). T4 iWARP in the
works and will follow soon.

Build-tested with make universe.

30s overview
============
What interfaces support TCP offload? Look for TOE4 and/or TOE6 in the
capabilities of an interface:
# ifconfig -m | grep TOE

Enable/disable TCP offload on an interface (just like any other ifnet
capability):
# ifconfig cxgbe0 toe
# ifconfig cxgbe0 -toe

Which connections are offloaded? Look for toe4 and/or toe6 in the
output of netstat and sockstat:
# netstat -np tcp | grep toe
# sockstat -46c | grep toe

Reviewed by: bz, gnn
Sponsored by: Chelsio communications.
MFC after: ~3 months (after 9.1, and after ensuring MFC is feasible)


# 9077f387 05-Feb-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Add new socket options: TCP_KEEPINIT, TCP_KEEPIDLE, TCP_KEEPINTVL and
TCP_KEEPCNT, that allow to control initial timeout, idle time, idle
re-send interval and idle send count on a per-socket basis.

Reviewed by: andre, bz, lstewart


# db3cee51 06-Jan-2012 Navdeep Parhar <np@FreeBSD.org>

Always release the inp lock before returning from tcp_detach.

MFC after: 5 days


# 873789cb 16-Oct-2011 Andre Oppermann <andre@FreeBSD.org>

Move the tcp_sendspace and tcp_recvspace sysctl's from
the middle of tcp_usrreq.c to the top of tcp_output.c
and tcp_input.c respectively next to the socket buffer
autosizing controls.

MFC after: 1 week


# e233e2ac 16-Oct-2011 Andre Oppermann <andre@FreeBSD.org>

VNET virtualize tcp_sendspace/tcp_recvspace and change the
type to INT. A long is not necessary as the TCP window is
limited to 2**30. A larger initial window isn't useful.

MFC after: 1 week


# c8360ae2 16-Oct-2011 Andre Oppermann <andre@FreeBSD.org>

Update the comment and description of tcp_sendspace and tcp_recvspace
to better reflect their purpose.
MFC after: 1 week


# b598155a 02-Jun-2011 Robert Watson <rwatson@FreeBSD.org>

Do not leak the pcbinfohash lock in the case where in6_pcbladdr() returns
an error during TCP connect(2) on an IPv6 socket.

Submitted by: bz
Sponsored by: Juniper Networks, Inc.


# fa046d87 30-May-2011 Robert Watson <rwatson@FreeBSD.org>

Decompose the current single inpcbinfo lock into two locks:

- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).

- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.

Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.

A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:

INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb

Callers must pass exactly one of these flags (for the time being).

Some notes:

- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).

This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.

Reviewed by: bz
Sponsored by: Juniper Networks, Inc.


# b287c6c7 30-Apr-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Make the TCP code compile without INET. Sort #includes and add #ifdef INETs.
Add some comments at #endifs given more nestedness. To make the compiler
happy, some default initializations were added in accordance with the style
on the files.

Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 4 days


# d28b9e89 04-Feb-2011 John Baldwin <jhb@FreeBSD.org>

When turning off TCP_NOPUSH, only call tcp_output() to immediately flush
any pending data if the connection is established.

Submitted by: csjp
Reviewed by: lstewart
MFC after: 1 week


# 7f79e7e4 29-Jan-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Remove duplicate printing of TF_NOPUSH in db_print_tflags().

MFC after: 10 days


# 79e955ed 07-Jan-2011 John Baldwin <jhb@FreeBSD.org>

Trim extra spaces before tabs.


# f5d34df5 17-Nov-2010 George V. Neville-Neil <gnn@FreeBSD.org>

Add new, per connection, statistics for TCP, including:
Retransmitted Packets
Zero Window Advertisements
Out of Order Receives

These statistics are available via the -T argument to
netstat(1).
MFC after: 2 weeks


# dbc42409 11-Nov-2010 Lawrence Stewart <lstewart@FreeBSD.org>

This commit marks the first formal contribution of the "Five New TCP Congestion
Control Algorithms for FreeBSD" FreeBSD Foundation funded project. More details
about the project are available at: http://caia.swin.edu.au/freebsd/5cc/

- Add a KPI and supporting infrastructure to allow modular congestion control
algorithms to be used in the net stack. Algorithms can maintain per-connection
state if required, and connections maintain their own algorithm pointer, which
allows different connections to concurrently use different algorithms. The
TCP_CONGESTION socket option can be used with getsockopt()/setsockopt() to
programmatically query or change the congestion control algorithm respectively
from within an application at runtime.

- Integrate the framework with the TCP stack in as least intrusive a manner as
possible. Care was also taken to develop the framework in a way that should
allow integration with other congestion aware transport protocols (e.g. SCTP)
in the future. The hope is that we will one day be able to share a single set
of congestion control algorithm modules between all congestion aware transport
protocols.

- Introduce a new congestion recovery (TF_CONGRECOVERY) state into the TCP stack
and use it to decouple the meaning of recovery from a congestion event and
recovery from packet loss (TF_FASTRECOVERY) a la RFC2581. ECN and delay based
congestion control protocols don't generally need to recover from packet loss
and need a different way to note a congestion recovery episode within the
stack.

- Remove the net.inet.tcp.newreno sysctl, which simplifies some portions of code
and ensures the stack always uses the appropriate mechanisms for recovering
from packet loss during a congestion recovery episode.

- Extract the NewReno congestion control algorithm from the TCP stack and
massage it into module form. NewReno is always built into the kernel and will
remain the default algorithm for the forseeable future. Implementations of
additional different algorithms will become available in the near future.

- Bump __FreeBSD_version to 900025 and note in UPDATING that rebuilding code
that relies on the size of "struct tcpcb" is required.

Many thanks go to the Cisco University Research Program Fund at Community
Foundation Silicon Valley and the FreeBSD Foundation. Their support of our work
at the Centre for Advanced Internet Architectures, Swinburne University of
Technology is greatly appreciated.

In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: Cisco URP, FreeBSD Foundation
Reviewed by: rpaulo
Tested by: David Hayes (and many others over the years)
MFC after: 3 months


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 1c18314d 16-Sep-2010 Andre Oppermann <andre@FreeBSD.org>

Remove the TCP inflight bandwidth limiter as announced in r211315
to give way for the pluggable congestion control framework. It is
the task of the congestion control algorithm to set the congestion
window and amount of inflight data without external interference.

In 'struct tcpcb' the variables previously used by the inflight
limiter are renamed to spares to keep the ABI intact and to have
some more space for future extensions.

In 'struct tcp_info' the variable 'tcpi_snd_bwnd' is not removed to
preserve the ABI. It is always set to 0.

In siftr.c in 'struct pkt_node' the variable 'snd_bwnd' is not removed
to preserve the ABI. It is always set to 0.

These unused variable in the various structures may be reused in the
future or garbage collected before the next release or at some other
point when an ABI change happens anyway for other reasons.

No MFC is planned. The inflight bandwidth limiter stays disabled by
default in the other branches but remains available.


# 03b868be 01-Jun-2010 Robert Watson <rwatson@FreeBSD.org>

Merge r204809 from head to stable/8:

Add a comment to tcp_usr_accept() to indicate why it is we acquire the
tcbinfo lock there: r175612, which re-added it, masked a race between
sonewconn(2) and accept(2) that could allow an incompletely initialized
address on a newly-created socket on a listen queue to be exposed. Full
details can be found in that commit message.

Sponsored by: Juniper Networks

Approved by: re (bz)


# 8296cddf 06-Mar-2010 Robert Watson <rwatson@FreeBSD.org>

Add a comment to tcp_usr_accept() to indicate why it is we acquire the
tcbinfo lock there: r175612, which re-added it, masked a race between
sonewconn(2) and accept(2) that could allow an incompletely initialized
address on a newly-created socket on a listen queue to be exposed. Full
details can be found in that commit message.

MFC after: 1 week
Sponsored by: Juniper Networks


# e10b0dfd 05-Jan-2010 John Baldwin <jhb@FreeBSD.org>

MFC 200847:
- Rename the __tcpi_(snd|rcv)_mss fields of the tcp_info structure to remove
the leading underscores since they are now implemented.
- Implement the tcpi_rto and tcpi_last_data_recv fields in the tcp_info
structure.


# 43d94734 22-Dec-2009 John Baldwin <jhb@FreeBSD.org>

- Rename the __tcpi_(snd|rcv)_mss fields of the tcp_info structure to remove
the leading underscores since they are now implemented.
- Implement the tcpi_rto and tcpi_last_data_recv fields in the tcp_info
structure.

Reviewed by: rwatson
MFC after: 2 weeks


# 11c99a6d 15-Sep-2009 Andre Oppermann <andre@FreeBSD.org>

-Put the optimized soreceive_stream() under a compile time option called
TCP_SORECEIVE_STREAM for the time being.

Requested by: brooks

Once compiled in make it easily switchable for testers by using a tuneable
net.inet.tcp.soreceive_stream
and a corresponding read-only sysctl to report the current state.

Suggested by: rwatson

MFC after: 2 days


# 530c0060 01-Aug-2009 Robert Watson <rwatson@FreeBSD.org>

Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by: bz
Approved by: re (vimage blanket)


# eddfbb76 14-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)


# 88d166bf 23-Jun-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Make callers to in6_selectsrc() and in6_pcbladdr() pass in memory
to save the selected source address rather than returning an
unreferenced copy to a pointer that might long be gone by the
time we use the pointer for anything meaningful.

Asked for by: rwatson
Reviewed by: rwatson


# ef760e6a 22-Jun-2009 Andre Oppermann <andre@FreeBSD.org>

Add soreceive_stream(), an optimized version of soreceive() for
stream (TCP) sockets.

It is functionally identical to generic soreceive() but has a
number stream specific optimizations:
o does only one sockbuf unlock/lock per receive independent of
the length of data to be moved into the uio compared to
soreceive() which unlocks/locks per *mbuf*.
o uses m_mbuftouio() instead of its own copy(out) variant.
o much more compact code flow as a large number of special
cases is removed.
o much improved reability.

It offers significantly reduced CPU usage and lock contention
when receiving fast TCP streams. Additional gains are obtained
when the receiving application is using SO_RCVLOWAT to batch up
some data before a read (and wakeup) is done.

This function was written by "reverse engineering" and is not
just a stripped down variant of soreceive().

It is not yet enabled by default on TCP sockets. Instead it is
commented out in the protocol initialization in tcp_usrreq.c
until more widespread testing has been done.

Testers, especially with 10GigE gear, are welcome.

MFP4: r164817 //depot/user/andre/soreceive_stream/


# 9f78a87a 16-Jun-2009 John Baldwin <jhb@FreeBSD.org>

- Change members of tcpcb that cache values of ticks from int to u_int:
t_rcvtime, t_starttime, t_rtttime, t_bw_rtttime, ts_recent_age,
t_badrxtwin.
- Change t_recent in struct timewait from u_long to u_int32_t to match
the type of the field it shadows from tcpcb: ts_recent.
- Change t_starttime in struct timewait from u_long to u_int to match
the t_starttime field in tcpcb.

Requested by: bde (1, 3)


# a13c655c 11-Jun-2009 John Baldwin <jhb@FreeBSD.org>

Correct printf format type mismatches.


# 0e8cc7e7 10-Jun-2009 John Baldwin <jhb@FreeBSD.org>

Change a few members of tcpcb that store cached copies of ticks to be ints
instead of unsigned longs. This fixes a few overflow edge cases on 64-bit
platforms. Specifically, if an idle connection receives a packet shortly
before 2^31 clock ticks of uptime (about 25 days with hz=1000) and the keep
alive timer fires after 2^31 clock ticks, the keep alive timer will think
that the connection has been idle for a very long time and will immediately
drop the connection instead of sending a keep alive probe.

Reviewed by: silby, gnn, lstewart
MFC after: 1 week


# 78b50714 11-Apr-2009 Robert Watson <rwatson@FreeBSD.org>

Update stats in struct tcpstat using two new macros, TCPSTAT_ADD() and
TCPSTAT_INC(), rather than directly manipulating the fields across the
kernel. This will make it easier to change the implementation of
these statistics, such as using per-CPU versions of the data structures.

MFC after: 3 days


# 970caf60 07-Apr-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

With the right comparison we get a proper wscale value and thus
more adequate TCP performance with IPv6.

Changes for IPv4, r166403 and r172795, both ignored the
IPv6 counterpart and left it in the state of art of year 2000.

The same logic in syncache already shares code between v4 and v6 so
things do not need to be adapted there.

Reported by: Steinar Haug (sthaug nethelp.no)
Tested by: Steinar Haug (sthaug nethelp.no)
MFC after: 3 days


# ad71fe3c 15-Mar-2009 Robert Watson <rwatson@FreeBSD.org>

Correct a number of evolved problems with inp_vflag and inp_flags:
certain flags that should have been in inp_flags ended up in inp_vflag,
meaning that they were inconsistently locked, and in one case,
interpreted. Move the following flags from inp_vflag to gaps in the
inp_flags space (and clean up the inp_flags constants to make gaps
more obvious to future takers):

INP_TIMEWAIT
INP_SOCKREF
INP_ONESBCAST
INP_DROPPED

Some aspects of this change have no effect on kernel ABI at all, as these
are UDP/TCP/IP-internal uses; however, netstat and sockstat detect
INP_TIMEWAIT when listing TCP sockets, so any MFC will need to take this
into account.

MFC after: 1 week (or after dependencies are MFC'd)
Reviewed by: bz


# ce2ae9ab 24-Feb-2009 Robert Watson <rwatson@FreeBSD.org>

In tcp_usr_shutdown() and tcp_usr_send(), I missed converting NULL
checks for the tcpcb, previously used to detect complete disconnection,
with INP_DROPPED checks. Correct that, preventing shutdown() from
improperly generating a TCP segment with destination IP and port of
0.0.0.0:0.

PR: kern/132050
Reported by: david gueluy <david.gueluy at netasq.com>
MFC after: 3 weeks


# b89e82dd 05-Feb-2009 Jamie Gritton <jamie@FreeBSD.org>

Standardize the various prison_foo_ip[46] functions and prison_if to
return zero on success and an error code otherwise. The possible errors
are EADDRNOTAVAIL if an address being checked for doesn't match the
prison, and EAFNOSUPPORT if the prison doesn't have any addresses in
that address family. For most callers of these functions, use the
returned error code instead of e.g. a hard-coded EADDRNOTAVAIL or
EINVAL.

Always include a jailed() check in these functions, where a non-jailed
cred always returns success (and makes no changes). Remove the explicit
jailed() checks that preceded many of the function calls.

Approved by: bz (mentor)


# dcdb4371 16-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Use inc_flags instead of the inc_isipv6 alias which so far
had been the only flag with random usage patterns.
Switch inc_flags to be used as a real bit field by using
INC_ISIPV6 with bitops to check for the 'isipv6' condition.

While here fix a place or two where in case of v4 inc_flags
were not properly initialized before.[1]

Found by: rwatson during review [1]
Discussed with: rwatson
Reviewed by: rwatson
MFC after: 4 weeks


# fc384fa5 15-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Another step assimilating IPv[46] PCB code - directly use
the inpcb names rather than the following IPv6 compat macros:
in6pcb,in6p_sp, in6p_ip6_nxt,in6p_flowinfo,in6p_vflag,
in6p_flags,in6p_socket,in6p_lport,in6p_fport,in6p_ppcb and
sotoin6pcb().

Apart from removing duplicate code in netipsec, this is a pure
whitespace, not a functional change.

Discussed with: rwatson
Reviewed by: rwatson (version before review requested changes)
MFC after: 4 weeks (set the timer and see then)


# 4b79449e 02-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation


# 413628a7 29-Nov-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

MFp4:
Bring in updated jail support from bz_jail branch.

This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..

SCTP support was updated and supports IPv6 in jails as well.

Cpuset support permits jails to be bound to specific processor
sets after creation.

Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.

DDB 'show jails' command was added to aid debugging.

Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.

Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.

Bump __FreeBSD_version for the afore mentioned and in kernel changes.

Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.

Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible


# 5cd54324 27-Nov-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Replace most INP_CHECK_SOCKAF() uses checking if it is an
IPv6 socket by comparing a constant inp vflag.
This is expected to help to reduce extra locking.

Suggested by: rwatson
Reviewed by: rwatson
MFC after: 6 weeks


# 6aee2fc5 26-Nov-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Merge in6_pcbfree() into in_pcbfree() which after the previous
IPsec change in r185366 only differed in two additonal IPv6 lines.
Rather than splattering conditional code everywhere add the v6
check centrally at this single place.

Reviewed by: rwatson (as part of a larger changset)
MFC after: 6 weeks (*)
(*) possibly need to leave a stub wrapper in 7 to keep the symbol.


# 0206cdb8 26-Nov-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Remove in6_pcbdetach() as it is exactly the same function
as in_pcbdetach() and we don't need the code twice.

Reviewed by: rwatson
MFC after: 6 weeks (*)
(*) possibly need to leave a stub wrapper in 7 to keep the symbol.


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 8b615593 02-Oct-2008 Marko Zec <zec@FreeBSD.org>

Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 603724d3 17-Aug-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch


# f2512ba1 31-Jul-2008 Rui Paulo <rpaulo@FreeBSD.org>

MFp4 (//depot/projects/tcpecn/):

TCP ECN support. Merge of my GSoC 2006 work for NetBSD.
TCP ECN is defined in RFC 3168.

Partly reviewed by: dwmalone, silby
Obtained from: NetBSD


# 8ab7ce7c 05-May-2008 Kip Macy <kmacy@FreeBSD.org>

replace spaces added in last change with tabs


# 535fbad6 05-May-2008 Kip Macy <kmacy@FreeBSD.org>

add rcv_nxt, snd_nxt, and toe offload id to FreeBSD-specific
extension fields for tcp_info


# 8501a69c 17-Apr-2008 Robert Watson <rwatson@FreeBSD.org>

Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.

This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.

MFC after: 3 months
Tested by: kris (superset of committered patch)


# 109058b0 23-Jan-2008 Robert Watson <rwatson@FreeBSD.org>

tcp_usrreq.c:1.313 removed tcbinfo locking from tcp_usr_accept(), which
while in principle a good idea, opened us up to a race inherrent to
the syncache's direct insertion of incoming TCP connections into the
"completed connection" listen queue, as it transpires that the socket
is inserted before the inpcb is fully filled in by syncache_expand().
The bug manifested with the occasional returning of 0.0.0.0:0 in the
address returned by the accept() system call, which occurred if accept
managed to execute tcp_usr_accept() before syncache_expand() had copied
the endpoint addresses into inpcb connection state.

Re-add tcbinfo locking around the address copyout, which has the effect
of delaying the copy until syncache_expand() has finished running, as
it is run while the tcbinfo lock is held. This is undesirable in that
it increases contention on tcbinfo further, but a more significant
change will be required to how the syncache inserts new sockets in
order to fix this and keep more granular locking here. In particular,
either more state needs to be passed into sonewconn() so that
pru_attach() can fill in the fields *before* the socket is inserted, or
the socket needs to be inserted in the incomplete connection queue
until it is actually ready to be used.

Reported by: glebius (and kris)
Tested by: glebius


# 1e8f5ffa 17-Jan-2008 Robert Watson <rwatson@FreeBSD.org>

In tcp_ctloutput(), don't hold the inpcb lock over sooptcopyin(), rather,
drop the lock and then re-acquire it, revalidating TCP connection state
assumptions when we do so. This avoids a potential lock order reversal
(and potential deadlock, although none have been reported) due to the
inpcb lock being held over a page fault.

MFC after: 1 week
PR: 102752
Reviewed by: bz
Reported by: Václav Haisman <v dot haisman at sh dot cvut dot cz>


# bc65987a 18-Dec-2007 Kip Macy <kmacy@FreeBSD.org>

Incorporate TCP offload hooks in to core TCP code.
- Rename output routines tcp_gen_* -> tcp_output_*.
- Rename notification routines that turn in to no-ops in the absence of TOE
from tcp_gen_* -> tcp_offload_*.
- Fix some minor comment nits.
- Add a /* FALLTHROUGH */

Reviewed by: Sam Leffler, Robert Watson, and Mike Silbersack


# 9b3bc6bf 19-Oct-2007 Mike Silbersack <silby@FreeBSD.org>

Pick the smallest possible TCP window scaling factor that will still allow
us to scale up to sb_max, aka kern.ipc.maxsockbuf.

We do this because there are broken firewalls that will corrupt the window
scale option, leading to the other endpoint believing that our advertised
window is unscaled. At scale factors larger than 5 the unscaled window will
drop below 1500 bytes, leading to serious problems when traversing these
broken firewalls.

With the default maxsockbuf of 256K, a scale factor of 3 will be chosen by
this algorithm. Those who choose a larger maxsockbuf should watch out
for the compatiblity problems mentioned above.

Reviewed by: andre


# 4b421e2d 07-Oct-2007 Mike Silbersack <silby@FreeBSD.org>

Add FBSDID to all files in netinet so that people can more
easily include file version information in bug reports.

Approved by: re (kensmith)


# e2f2059f 23-Sep-2007 Mike Silbersack <silby@FreeBSD.org>

Two changes:

- Reintegrate the ANSI C function declaration change
from tcp_timer.c rev 1.92

- Reorganize the tcpcb structure so that it has a single
pointer to the "tcp_timer" structure which contains all
of the tcp timer callouts. This change means that when
the single tcp timer change is reintegrated, tcpcb will
not change in size, and therefore the ABI between
netstat and the kernel will not change.

Neither of these changes should have any functional
impact.

Reviewed by: bmah, rrs
Approved by: re (bmah)


# 85d94372 07-Sep-2007 Robert Watson <rwatson@FreeBSD.org>

Back out tcp_timer.c:1.93 and associated changes that reimplemented the many
TCP timers as a single timer, but retain the API changes necessary to
reintroduce this change. This will back out the source of at least two
reported problems: lock leaks in certain timer edge cases, and TCP timers
continuing to fire after a connection has closed (a bug previously fixed and
then reintroduced with the timer rewrite).

In a follow-up commit, some minor restylings and comment changes performed
after the TCP timer rewrite will be reapplied, and a further change to allow
the TCP timer rewrite to be added back without disturbing the ABI. The new
design is believed to be a good thing, but the outstanding issues are
leading to significant stability/correctness problems that are holding
up 7.0.

This patch was generated by silby, but is being committed by proxy due to
poor network connectivity for silby this week.

Approved by: re (kensmith)
Submitted by: silby
Tested by: rwatson, kris
Problems reported by: peter, kris, others


# 218cbbea 30-Jul-2007 Dag-Erling Smørgrav <des@FreeBSD.org>

Make tcpstates[] static, and make sure TCPSTATES is defined before
<netinet/tcp_fsm.h> is included into any compilation unit that needs
tcpstates[]. Also remove incorrect extern declarations and TCPDEBUG
conditionals. This allows kernels both with and without TCPDEBUG to
build, and unbreaks the tinderbox.

Approved by: re (rwatson)


# 24face54 28-Jul-2007 Matt Jacob <mjacob@FreeBSD.org>

Fix compilation problems- tcpstates is only available if TCPDEBUG
is set.

Approved by: re (in spirit)


# 3c010a41 15-Jun-2007 Matt Jacob <mjacob@FreeBSD.org>

Garbage collect some debug code that not only no longer could
work but in fact probably causes a random pointer dereferences.
Garbage collect the tp variable too.


# abc7d910 30-May-2007 Robert Watson <rwatson@FreeBSD.org>

(1) In tcp_usrclosed(), tp can never become NULL, so don't test for NULL
before handling the socket disconnection case.

(2) Clean up surrounding comments and formatting.

Found with: Coverity Prevent(tm) (1)
CID: 2203


# 54d642bb 11-May-2007 Robert Watson <rwatson@FreeBSD.org>

Reduce network stack oddness: implement .pru_sockaddr and .pru_peeraddr
protocol entry points using functions named proto_getsockaddr and
proto_getpeeraddr rather than proto_setsockaddr and proto_setpeeraddr.
While it's true that sockaddrs are allocated and set, the net effect is
to retrieve (get) the socket address or peer address from a socket, not
set it, so align names to that intent.


# 169db7b2 11-May-2007 Robert Watson <rwatson@FreeBSD.org>

Remove unneeded wrappers for in_setsockaddr() and in_setpeeraddr(), which
used to exist so pcbinfo locks could be acquired, but are no longer
required as a result of socket/pcb reference model refinements.


# f2565d68 10-May-2007 Robert Watson <rwatson@FreeBSD.org>

Move universally to ANSI C function declarations, with relatively
consistent style(9)-ish layout.


# 1a553740 06-May-2007 Andre Oppermann <andre@FreeBSD.org>

Remove unused requested_s_scale from struct tcpcb.


# 3529149e 06-May-2007 Andre Oppermann <andre@FreeBSD.org>

Use existing TF_SACK_PERMIT flag in struct tcpcb t_flags field instead of
a decdicated sack_enable int for this bool. Change all users accordingly.


# 84ca8aa6 01-May-2007 Robert Watson <rwatson@FreeBSD.org>

Remove unused pcbinfo arguments to in_setsockaddr() and
in_setpeeraddr().


# b8152ba7 11-Apr-2007 Andre Oppermann <andre@FreeBSD.org>

Change the TCP timer system from using the callout system five times
directly to a merged model where only one callout, the next to fire,
is registered.

Instead of callout_reset(9) and callout_stop(9) the new function
tcp_timer_activate() is used which then internally manages the callout.

The single new callout is a mutex callout on inpcb simplifying the
locking a bit.

tcp_timer() is the called function which handles all race conditions
in one place and then dispatches the individual timer functions.

Reviewed by: rwatson (earlier version)


# ad3f9ab3 21-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

ANSIfy function declarations and remove register keywords for variables.
Consistently apply style to all function declarations.


# e406f5a1 21-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

Remove tcp_minmssoverload DoS detection logic. The problem it tried to
protect us from wasn't really there and it only bloats the code. Should
the problem surface in the future we can simply resurrect it from cvs
history.


# 7c72af87 26-Feb-2007 Mohan Srinivasan <mohans@FreeBSD.org>

Reap FIN_WAIT_2 connections marked SOCANTRCVMORE faster. This mitigate
potential issues where the peer does not close, potentially leaving
thousands of connections in FIN_WAIT_2. This is controlled by a new sysctl
fast_finwait2_recycle, which is disabled by default.

Reviewed by: gnn, silby.


# 497057ee 17-Feb-2007 Robert Watson <rwatson@FreeBSD.org>

Add "show inpcb", "show tcpcb" DDB commands, which should come in handy
for debugging sblock and other network panics.


# 1baaf834 02-Feb-2007 Bruce M Simpson <bms@FreeBSD.org>

Expose smoothed RTT and RTT variance measurements to userland via
socket option TCP_INFO.
Note that the units used in the original Linux API are in microseconds,
so use a 64-bit mantissa to convert FreeBSD's internal measurements
from struct tcpcb from ticks.


# 6741ecf5 01-Feb-2007 Andre Oppermann <andre@FreeBSD.org>

Auto sizing TCP socket buffers.

Normally the socket buffers are static (either derived from global
defaults or set with setsockopt) and do not adapt to real network
conditions. Two things happen: a) your socket buffers are too small
and you can't reach the full potential of the network between both
hosts; b) your socket buffers are too big and you waste a lot of
kernel memory for data just sitting around.

With automatic TCP send and receive socket buffers we can start with a
small buffer and quickly grow it in parallel with the TCP congestion
window to match real network conditions.

FreeBSD has a default 32K send socket buffer. This supports a maximal
transfer rate of only slightly more than 2Mbit/s on a 100ms RTT
trans-continental link. Or at 200ms just above 1Mbit/s. With TCP send
buffer auto scaling and the default values below it supports 20Mbit/s
at 100ms and 10Mbit/s at 200ms. That's an improvement of factor 10, or
1000%. For the receive side it looks slightly better with a default of
64K buffer size.

New sysctls are:
net.inet.tcp.sendbuf_auto=1 (enabled)
net.inet.tcp.sendbuf_inc=8192 (8K, step size)
net.inet.tcp.sendbuf_max=262144 (256K, growth limit)
net.inet.tcp.recvbuf_auto=1 (enabled)
net.inet.tcp.recvbuf_inc=16384 (16K, step size)
net.inet.tcp.recvbuf_max=262144 (256K, growth limit)

Tested by: many (on HEAD and RELENG_6)
Approved by: re
MFC after: 1 month


# 087b55ea 01-Feb-2007 Andre Oppermann <andre@FreeBSD.org>

Change the way the advertized TCP window scaling is computed. Instead of
upper-bounding it to the size of the initial socket buffer lower-bound it
to the smallest MSS we accept. Ideally we'd use the actual MSS information
here but it is not available yet.

For socket buffer auto sizing to be effective we need room to grow the
receive window. The window scale shift is determined at connection setup
and can't be changed afterwards. The previous, original, method effectively
just did a power of two roundup of the socket buffer size at connection
setup severely limiting the headroom for larger socket buffers.

Tested by: many (as part of the socket buffer auto sizing patch)
MFC after: 1 month


# 21367f63 22-Nov-2006 Sam Leffler <sam@FreeBSD.org>

Change error codes returned by protocol operations when an inpcb is
marked INP_DROPPED or INP_TIMEWAIT:
o return ECONNRESET instead of EINVAL for close, disconnect, shutdown,
rcvd, rcvoob, and send operations
o return ECONNABORTED instead of EINVAL for accept

These changes should reduce confusion in applications since EINVAL is
normally interpreted to mean an invalid file descriptor. This change
does not conflict with POSIX or other standards I checked. The return
of EINVAL has always been possible but rare; it's become more common
with recent changes to the socket/inpcb handling and with finer-grained
locking and preemption.

Note: there are other instances of EINVAL for this state that were
left unchanged; they should be reviewed.

Reviewed by: rwatson, andre, ru
MFC after: 1 month


# 7ff0b850 17-Sep-2006 Andre Oppermann <andre@FreeBSD.org>

Make tcp_usr_send() free the passed mbufs on error in all cases as the
comment to it claims.

Sponsored by: TCP/IP Optimization Fundraise 2005


# a152f8a3 21-Jul-2006 Robert Watson <rwatson@FreeBSD.org>

Change semantics of socket close and detach. Add a new protocol switch
function, pru_close, to notify protocols that the file descriptor or
other consumer of a socket is closing the socket. pru_abort is now a
notification of close also, and no longer detaches. pru_detach is no
longer used to notify of close, and will be called during socket
tear-down by sofree() when all references to a socket evaporate after
an earlier call to abort or close the socket. This means detach is now
an unconditional teardown of a socket, whereas previously sockets could
persist after detach of the protocol retained a reference.

This faciliates sharing mutexes between layers of the network stack as
the mutex is required during the checking and removal of references at
the head of sofree(). With this change, pru_detach can now assume that
the mutex will no longer be required by the socket layer after
completion, whereas before this was not necessarily true.

Reviewed by: gnn


# d915b280 18-Jul-2006 Stephan Uphoff <ups@FreeBSD.org>

Fix race conditions on enumerating pcb lists by moving the initialization
( and where appropriate the destruction) of the pcb mutex to the init/finit
functions of the pcb zones.
This allows locking of the pcb entries and race condition free comparison
of the generation count.
Rearrange locking a bit to avoid extra locking operation to update the generation
count in in_pcballoc(). (in_pcballoc now returns the pcb locked)

I am planning to convert pcb list handling from a type safe to a reference count
model soon. ( As this allows really freeing the PCBs)

Reviewed by: rwatson@, mohans@
MFC after: 1 week


# b4470c16 26-Jun-2006 Robert Watson <rwatson@FreeBSD.org>

In tcp6_usr_attach(), return immediately if SS_ISDISCONNECTED, to
avoid dereferencing an uninitialized inp variable.

Submitted by: Michiel Boland <michiel at boland dot org>
MFC after: 1 month


# f2de87fe 04-Jun-2006 Robert Watson <rwatson@FreeBSD.org>

Push acquisition of pcbinfo lock out of tcp_usr_attach() into
tcp_attach() after the call to soreserve(), as it doesn't require
the global lock. Rearrange inpcb locking here also.

MFC after: 1 month


# c78cbc7b 24-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Instead of calling tcp_usr_detach() from tcp_usr_abort(), break out
common pcb tear-down logic into tcp_detach(), which is called from
either. Invoke tcp_drop() from the tcp_usr_abort() path rather than
tcp_disconnect(), as we want to drop it immediately not perform a
FIN sequence. This is one reason why some people were experiencing
panics in sodealloc(), as the netisr and aborting thread were
simultaneously trying to tear down the socket. This bug could often
be reproduced using repeated runs of the listenclose regression test.

MFC after: 3 months
PR: 96090
Reported by: Peter Kostouros <kpeter at melbpc dot org dot au>, kris
Tested by: Peter Kostouros <kpeter at melbpc dot org dot au>, kris


# e6e65783 02-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Clarify comment on handling of non-timewait TCP states in
tcp_usr_detach().

MFC after: 3 months


# 3d2d3ef4 03-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

After checking for SO_ISDISCONNECTED in tcp_usr_accept(), return
immediately rather than jumping to the normal output handling, which
assumes we've pulled out the inpcb, which hasn't happened at this
point (and isn't necessary).

Return ECONNABORTED instead of EINVAL when the inpcb has entered
INP_TIMEWAIT or INP_DROPPED, as this is the documented error value.

This may correct the panic seen by Ganbold.

MFC after: 1 month
Reported by: Ganbold <ganbold at micom dot mng dot net>


# 953b5606 02-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

During reformulation of tcp_usr_detach(), the call to initiate TCP
disconnect for fully connected sockets was dropped, meaning that if
the socket was closed while the connection was alive, it would be
leaked. Structure tcp_usr_detach() so that there are two clear
parts: initiating disconnect, and reclaiming state, and reintroduce
the tcp_disconnect() call in the first part.

MFC after: 3 months


# 34af7bae 01-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Properly handle an edge case previously not handled correctly: a
socket can have a tcp connection that has entered time wait
attached to it, in the event that shutdown() is called on the
socket and the FINs properly exchange before close(). In this
case we don't detach or free the inpcb, just leave the tcptw
detached and freed, but we must release the inpcb lock (which we
didn't previously).

MFC after: 3 months


# 623dce13 01-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():

- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.

- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.

- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.

- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.

- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.

- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.

- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.

- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.

- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.

These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.

MFC after: 3 months


# bc725eaf 01-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Chance protocol switch method pru_detach() so that it returns void
rather than an error. Detaches do not "fail", they other occur or
the protocol flags SS_PROTOREF to take ownership of the socket.

soclose() no longer looks at so_pcb to see if it's NULL, relying
entirely on the protocol to decide whether it's time to free the
socket or not using SS_PROTOREF. so_pcb is now entirely owned and
managed by the protocol code. Likewise, no longer test so_pcb in
other socket functions, such as soreceive(), which have no business
digging into protocol internals.

Protocol detach routines no longer try to free the socket on detach,
this is performed in the socket code if the protocol permits it.

In rts_detach(), no longer test for rp != NULL in detach, and
likewise in other protocols that don't permit a NULL so_pcb, reduce
the incidence of testing for it during detach.

netinet and netinet6 are not fully updated to this change, which
will be in an upcoming commit. In their current state they may leak
memory or panic.

MFC after: 3 months


# ac45e92f 01-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Change protocol switch pru_abort() API so that it returns void rather
than an int, as an error here is not meaningful. Modify soabort() to
unconditionally free the socket on the return of pru_abort(), and
modify most protocols to no longer conditionally free the socket,
since the caller will do this.

This commit likely leaves parts of netinet and netinet6 in a situation
where they may panic or leak memory, as they have not are not fully
updated by this commit. This will be corrected shortly in followup
commits to these components.

MFC after: 3 months


# e59898ff 14-Dec-2005 Maxime Henrion <mux@FreeBSD.org>

Fix a bunch of SYSCTL_INT() that should have been SYSCTL_ULONG() to
match the type of the variable they are exporting.

Spotted by: Thomas Hurst <tom@hur.st>
MFC after: 3 days


# d374e81e 30-Oct-2005 Robert Watson <rwatson@FreeBSD.org>

Push the assignment of a new or updated so_qlimit from solisten()
following the protocol pru_listen() call to solisten_proto(), so
that it occurs under the socket lock acquisition that also sets
SO_ACCEPTCONN. This requires passing the new backlog parameter
to the protocol, which also allows the protocol to be aware of
changes in queue limit should it wish to do something about the
new queue limit. This continues a move towards the socket layer
acting as a library for the protocol.

Bump __FreeBSD_version due to a change in the in-kernel protocol
interface. This change has been tested with IPv4 and UNIX domain
sockets, but not other protocols.


# ef8fd904 23-Aug-2005 Andre Oppermann <andre@FreeBSD.org>

Remove unnecessary IPSEC includes.

MFC after: 2 weeks
Sponsored by: TCP/IP Optimization Fundraise 2005


# a1f7e5f8 24-Jul-2005 Hajimu UMEMOTO <ume@FreeBSD.org>

scope cleanup. with this change
- most of the kernel code will not care about the actual encoding of
scope zone IDs and won't touch "s6_addr16[1]" directly.
- similarly, most of the kernel code will not care about link-local
scoped addresses as a special case.
- scope boundary check will be stricter. For example, the current
*BSD code allows a packet with src=::1 and dst=(some global IPv6
address) to be sent outside of the node, if the application do:
s = socket(AF_INET6);
bind(s, "::1");
sendto(s, some_global_IPv6_addr);
This is clearly wrong, since ::1 is only meaningful within a single
node, but the current implementation of the *BSD kernel cannot
reject this attempt.

Submitted by: JINMEI Tatuya <jinmei__at__isl.rdc.toshiba.co.jp>
Obtained from: KAME


# 30393994 31-May-2005 Robert Watson <rwatson@FreeBSD.org>

When aborting tcp_attach() due to a problem allocating or attaching the
tcpcb, lock the inpcb before calling in_pcbdetach() or in6_pcbdetach(),
as they expect the inpcb to be passed locked.

MFC after: 7 days


# e6e0b5ff 31-May-2005 Robert Watson <rwatson@FreeBSD.org>

Assert tcbinfo lock, inpcb lock in tcp_disconnect().
Assert tcbinfo lock, inpcb lock in in tcp_usrclosed().

MFC after: 7 days


# 7609aad7 01-Jun-2005 Robert Watson <rwatson@FreeBSD.org>

Assert tcbinfo lock in tcp_attach(), as it is required; the caller
(tcp_usr_attach()) currently grabs it.

MFC after: 7 days


# 2cdbfa66 20-May-2005 Paul Saab <ps@FreeBSD.org>

Replace t_force with a t_flag (TF_FORCEDATA).

Submitted by: Raja Mukerji.
Reviewed by: Mohan, Silby, Andre Opperman.


# b60d26c9 01-May-2005 Robert Watson <rwatson@FreeBSD.org>

Remove now unused inirw variable from previous use of COMMON_END().

Reported by: csjp


# 73fddeda 01-May-2005 Peter Grehan <grehan@FreeBSD.org>

Fix typo in last commit.

Approved by: rwatson


# d1401c90 01-May-2005 Robert Watson <rwatson@FreeBSD.org>

Slide unlocking of the tcbinfo lock earlier in tcp_usr_send(), as it's
needed only for implicit connect cases. Under load, especially on SMP,
this can greatly reduce contention on the tcbinfo lock.

NB: Ambiguities about the state of so_pcb need to be resolved so that
all use of the tcbinfo lock in non-implicit connection cases can be
eliminated.

Submited by: Kazuaki Oda <kaakun at highway dot ne dot jp>


# 812d8653 28-Mar-2005 Sam Leffler <sam@FreeBSD.org>

eliminate extraneous null ptr checks

Noticed by: Coverity Prevent analysis tool


# d2bc35ab 14-Mar-2005 Robert Watson <rwatson@FreeBSD.org>

In tcp_usr_send(), broaden coverage of the socket buffer lock in the
non-OOB case so that the sbspace() check is performed under the same
lock instance as the append to the send socket buffer.

MFC after: 1 week


# 0daccb9c 21-Feb-2005 Robert Watson <rwatson@FreeBSD.org>

In the current world order, solisten() implements the state transition of
a socket from a regular socket to a listening socket able to accept new
connections. As part of this state transition, solisten() calls into the
protocol to update protocol-layer state. There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.

This change does the following:

- Pushes the socket state transition from the socket layer solisten() to
to socket "library" routines called from the protocol. This permits
the socket routines to be called while holding the protocol mutexes,
preventing a race exposing the incomplete socket state transition to TCP
after the TCP state transition has completed. The check for a socket
layer state transition is performed by solisten_proto_check(), and the
actual transition is performed by solisten_proto().

- Holds the socket lock for the duration of the socket state test and set,
and over the protocol layer state transition, which is now possible as
the socket lock is acquired by the protocol layer, rather than vice
versa. This prevents additional state related races in the socket
layer.

This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another. Similar changes are likely require
elsewhere in the socket/protocol code.

Reported by: Peter Holm <peter@holm.cc>
Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod: gnn


# 9945c0e2 14-Feb-2005 Maxim Konovalov <maxim@FreeBSD.org>

o Add handling of an IPv4-mapped IPv6 address.
o Use SYSCTL_IN() macro instead of direct call of copyin(9).

Submitted by: ume

o Move sysctl_drop() implementation to sys/netinet/tcp_subr.c where
most of tcp sysctls live.
o There are net.inet[6].tcp[6].getcred sysctls already, no needs in
a separate struct tcp_ident_mapping.

Suggested by: ume


# 212a79b0 06-Feb-2005 Maxim Konovalov <maxim@FreeBSD.org>

o Implement net.inet.tcp.drop sysctl and userland part, tcpdrop(8)
utility:

The tcpdrop command drops the TCP connection specified by the
local address laddr, port lport and the foreign address faddr,
port fport.

Obtained from: OpenBSD
Reviewed by: rwatson (locking), ru (man page), -current
MFC after: 1 month


# c398230b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# c8443a1d 27-Nov-2004 Robert Watson <rwatson@FreeBSD.org>

Do export the advertised receive window via the tcpi_rcv_space field of
struct tcp_info.


# b8af5dfa 26-Nov-2004 Robert Watson <rwatson@FreeBSD.org>

Implement parts of the TCP_INFO socket option as found in Linux 2.6.
This socket option allows processes query a TCP socket for some low
level transmission details, such as the current send, bandwidth, and
congestion windows. Linux provides a 'struct tcpinfo' structure
containing various variables, rather than separate socket options;
this makes the API somewhat fragile as it makes it dificult to add
new entries of interest as requirements and implementation evolve.
As such, I've included a large pad at the end of the structure.
Right now, relatively few of the Linux API fields are filled in, and
some contain no logical equivilent on FreeBSD. I've include __'d
entries in the structure to make it easier to figure ou what is and
isn't omitted. This API/ABI should be considered unstable for the
time being.


# 756d52a1 08-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Initialize struct pr_userreqs in new/sparse style and fill in common
default elements in net_init_domain().

This makes it possible to grep these structures and see any bogosities.


# c94c54e4 02-Nov-2004 Andre Oppermann <andre@FreeBSD.org>

Remove RFC1644 T/TCP support from the TCP side of the network stack.

A complete rationale and discussion is given in this message
and the resulting discussion:

http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706

Note that this commit removes only the functional part of T/TCP
from the tcp_* related functions in the kernel. Other features
introduced with RFC1644 are left intact (socket layer changes,
sendmsg(2) on connection oriented protocols) and are meant to
be reused by a simpler and less intrusive reimplemention of the
previous T/TCP functionality.

Discussed on: -arch


# a4f757cd 16-Aug-2004 Robert Watson <rwatson@FreeBSD.org>

White space cleanup for netinet before branch:

- Trailing tab/space cleanup
- Remove spurious spaces between or before tabs

This change avoids touching files that Andre likely has in his working
set for PFIL hooks changes for IPFW/DUMMYNET.

Approved by: re (scottl)
Submitted by: Xin LI <delphij@frontfree.net>


# 1f44b0a1 14-Aug-2004 David Malone <dwmalone@FreeBSD.org>

Get rid of the RANDOM_IP_ID option and make it a sysctl. NetBSD
have already done this, so I have styled the patch on their work:

1) introduce a ip_newid() static inline function that checks
the sysctl and then decides if it should return a sequential
or random IP ID.

2) named the sysctl net.inet.ip.random_id

3) IPv6 flow IDs and fragment IDs are now always random.
Flow IDs and frag IDs are significantly less common in the
IPv6 world (ie. rarely generated per-packet), so there should
be smaller performance concerns.

The sysctl defaults to 0 (sequential IP IDs).

Reviewed by: andre, silby, mlaier, ume
Based on: NetBSD
MFC after: 2 months


# 0aa8ce50 26-Jul-2004 John-Mark Gurney <jmg@FreeBSD.org>

compare pointer against NULL, not 0

when inpcb is NULL, this is no longer invalid since jlemon added the
tcp_twstart function... this prevents close "failing" w/ EINVAL when it
really was successful...

Reviewed by: jeremy (NetBSD)


# 8a59da30 16-Jul-2004 Hajimu UMEMOTO <ume@FreeBSD.org>

when IN6P_AUTOFLOWLABEL is set, the flowlabel is not set on
outgoing tcp connections.

Reported by: Orla McGann <orly@cnri.dit.ie>
Reviewed by: Orla McGann <orly@cnri.dit.ie>
Obtained from: KAME


# 3f9d1ef9 26-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Remove spl's from TCP protocol entry points. While not all locking
is merged here yet, this will ease the merge process by bringing the
locked and unlocked versions into sync.


# 4e397bc5 18-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

In tcp_ctloutput(), don't hold the inpcb lock over a call to
ip_ctloutput(), as it may need to perform blocking memory allocations.
This also improves consistency with locking relative to other points
that call into ip_ctloutput().

Bumped into by: Grover Lines <grover@ceribus.net>


# c0b99ffa 14-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

The socket field so_state is used to hold a variety of socket related
flags relating to several aspects of socket functionality. This change
breaks out several bits relating to send and receive operation into a
new per-socket buffer field, sb_state, in order to facilitate locking.
This is required because, in order to provide more granular locking of
sockets, different state fields have different locking properties. The
following fields are moved to sb_state:

SS_CANTRCVMORE (so_state)
SS_CANTSENDMORE (so_state)
SS_RCVATMARK (so_state)

Rename respectively to:

SBS_CANTRCVMORE (so_rcv.sb_state)
SBS_CANTSENDMORE (so_snd.sb_state)
SBS_RCVATMARK (so_rcv.sb_state)

This facilitates locking by isolating fields to be located with other
identically locked fields, and permits greater granularity in socket
locking by avoiding storing fields with different locking semantics in
the same short (avoiding locking conflicts). In the future, we may
wish to coallesce sb_state and sb_flags; for the time being I leave
them separate and there is no additional memory overhead due to the
packing/alignment of shorts in the socket buffer structure.


# f36cfd49 07-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson


# 52710de1 04-Apr-2004 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Fix a panic possibility caused by returning without releasing locks.
It was fixed by moving problemetic checks, as well as checks that
doesn't need locking before locks are acquired.

Submitted by: Ryan Sommers <ryans@gamersimpact.com>
In co-operation with: cperciva, maxim, mlaier, sam
Tested by: submitter (previous patch), me (current patch)
Reviewed by: cperciva, mlaier (previous patch), sam (current patch)
Approved by: sam
Dedicated to: enough!


# 56dc72c3 28-Mar-2004 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Remove unused argument.


# b0330ed9 27-Mar-2004 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Reduce 'td' argument to 'cred' (struct ucred) argument in those functions:
- in_pcbbind(),
- in_pcbbind_setup(),
- in_pcbconnect(),
- in_pcbconnect_setup(),
- in6_pcbbind(),
- in6_pcbconnect(),
- in6_pcbsetport().
"It should simplify/clarify things a great deal." --rwatson

Requested by: rwatson
Reviewed by: rwatson, ume


# 6823b823 27-Mar-2004 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Remove unused argument.

Reviewed by: ume


# 88f6b043 16-Feb-2004 Bruce M Simpson <bms@FreeBSD.org>

Shorten the name of the socket option used to enable TCP-MD5 packet
treatment.

Submitted by: Vincent Jardin


# 265ed012 13-Feb-2004 Bruce M Simpson <bms@FreeBSD.org>

Brucification.

Submitted by: bde


# 1cfd4b53 10-Feb-2004 Bruce M Simpson <bms@FreeBSD.org>

Initial import of RFC 2385 (TCP-MD5) digest support.

This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.

For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.

Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.

There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.

Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.

This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.

Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.

Sponsored by: sentex.net


# e29ef13f 10-Jan-2004 Don Lewis <truckman@FreeBSD.org>

Check that sa_len is the appropriate value in tcp_usr_bind(),
tcp6_usr_bind(), tcp_usr_connect(), and tcp6_usr_connect() before checking
to see whether the address is multicast so that the proper errno value
will be returned if sa_len is incorrect. The checks are identical to the
ones in in_pcbbind_setup(), in6_pcbbind(), and in6_pcbladdr(), which are
called after the multicast address check passes.

MFC after: 30 days


# 53369ac9 08-Jan-2004 Andre Oppermann <andre@FreeBSD.org>

Limiters and sanity checks for TCP MSS (maximum segement size)
resource exhaustion attacks.

For network link optimization TCP can adjust its MSS and thus
packet size according to the observed path MTU. This is done
dynamically based on feedback from the remote host and network
components along the packet path. This information can be
abused to pretend an extremely low path MTU.

The resource exhaustion works in two ways:

o during tcp connection setup the advertized local MSS is
exchanged between the endpoints. The remote endpoint can
set this arbitrarily low (except for a minimum MTU of 64
octets enforced in the BSD code). When the local host is
sending data it is forced to send many small IP packets
instead of a large one.

For example instead of the normal TCP payload size of 1448
it forces TCP payload size of 12 (MTU 64) and thus we have
a 120 times increase in workload and packets. On fast links
this quickly saturates the local CPU and may also hit pps
processing limites of network components along the path.

This type of attack is particularly effective for servers
where the attacker can download large files (WWW and FTP).

We mitigate it by enforcing a minimum MTU settable by sysctl
net.inet.tcp.minmss defaulting to 256 octets.

o the local host is reveiving data on a TCP connection from
the remote host. The local host has no control over the
packet size the remote host is sending. The remote host
may chose to do what is described in the first attack and
send the data in packets with an TCP payload of at least
one byte. For each packet the tcp_input() function will
be entered, the packet is processed and a sowakeup() is
signalled to the connected process.

For example an attack with 2 Mbit/s gives 4716 packets per
second and the same amount of sowakeup()s to the process
(and context switches).

This type of attack is particularly effective for servers
where the attacker can upload large amounts of data.
Normally this is the case with WWW server where large POSTs
can be made.

We mitigate this by calculating the average MSS payload per
second. If it goes below 'net.inet.tcp.minmss' and the pps
rate is above 'net.inet.tcp.minmssoverload' defaulting to
1000 this particular TCP connection is resetted and dropped.

MITRE CVE: CAN-2004-0002
Reviewed by: sam (mentor)
MFC after: 1 day


# 5bd311a5 25-Nov-2003 Sam Leffler <sam@FreeBSD.org>

Split the "inp" mutex class into separate classes for each of divert,
raw, tcp, udp, raw6, and udp6 sockets to avoid spurious witness
complaints.

Reviewed by: rwatson
Approved by: re (rwatson)


# 97d8d152 20-Nov-2003 Andre Oppermann <andre@FreeBSD.org>

Introduce tcp_hostcache and remove the tcp specific metrics from
the routing table. Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.

It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination. Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.

tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.

It removes significant locking requirements from the tcp stack with
regard to the routing table.

Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)


# a557af22 17-Nov-2003 Robert Watson <rwatson@FreeBSD.org>

Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.

This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.

For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.

Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.

Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 395bb186 27-Oct-2003 Sam Leffler <sam@FreeBSD.org>

speedup stream socket recv handling by tracking the tail of
the mbuf chain instead of walking the list for each append

Submitted by: ps/jayanth
Obtained from: netbsd (jason thorpe)


# a3b6edc3 08-Mar-2003 Jonathan Lemon <jlemon@FreeBSD.org>

Remove check for t_state == TCPS_TIME_WAIT and introduce the tw structure.

Sponsored by: DARPA, NAI Labs


# edf02ff1 24-Feb-2003 Jeffrey Hsu <hsu@FreeBSD.org>

Hold the TCP protocol lock while modifying the connection hash table.


# efac726e 23-Oct-2002 Ian Dowse <iedowse@FreeBSD.org>

Unbreak the automatic remapping of an INADDR_ANY destination address
to the primary local IP address when doing a TCP connect(). The
tcp_connect() code was relying on in_pcbconnect (actually in_pcbladdr)
modifying the passed-in sockaddr, and I failed to notice this in
the recent change that added in_pcbconnect_setup(). As a result,
tcp_connect() was ending up using the unmodified sockaddr address
instead of the munged version.

There are two cases to handle: if in_pcbconnect_setup() succeeds,
then the PCB has already been updated with the correct destination
address as we pass it pointers to inp_faddr and inp_fport directly.
If in_pcbconnect_setup() fails due to an existing but dead connection,
then copy the destination address from the old connection.


# 5200e00e 21-Oct-2002 Ian Dowse <iedowse@FreeBSD.org>

Replace in_pcbladdr() with a more generic inner subroutine for
in_pcbconnect() called in_pcbconnect_setup(). This version performs
all of the functions of in_pcbconnect() except for the final
committing of changes to the PCB. In the case of an EADDRINUSE error
it can also provide to the caller the PCB of the duplicate connection,
avoiding an extra in_pcblookup_hash() lookup in tcp_connect().

This change will allow the "temporary connect" hack in udp_output()
to be removed and is part of the preparation for adding the
IP_SENDSRCADDR control message.

Discussed on: -net
Approved by: re


# 4a6a94d8 22-Aug-2002 Archie Cobbs <archie@FreeBSD.org>

Replace (ab)uses of "NULL" where "0" is really meant.


# 26ef6ac4 21-Aug-2002 Don Lewis <truckman@FreeBSD.org>

Create new functions in_sockaddr(), in6_sockaddr(), and
in6_v4mapsin6_sockaddr() which allocate the appropriate sockaddr_in*
structure and initialize it with the address and port information passed
as arguments. Use calls to these new functions to replace code that is
replicated multiple times in in_setsockaddr(), in_setpeeraddr(),
in6_setsockaddr(), in6_setpeeraddr(), in6_mapped_sockaddr(), and
in6_mapped_peeraddr(). Inline COMMON_END in tcp_usr_accept() so that
we can call in_sockaddr() with temporary copies of the address and port
after the PCB is unlocked.

Fix the lock violation in tcp6_usr_accept() (caused by calling MALLOC()
inside in6_mapped_peeraddr() while the PCB is locked) by changing
the implementation of tcp6_usr_accept() to match tcp_usr_accept().

Reviewed by: suz


# 1fcc99b5 17-Aug-2002 Matthew Dillon <dillon@FreeBSD.org>

Implement TCP bandwidth delay product window limiting, similar to (but
not meant to duplicate) TCP/Vegas. Add four sysctls and default the
implementation to 'off'.

net.inet.tcp.inflight_enable enable algorithm (defaults to 0=off)
net.inet.tcp.inflight_debug debugging (defaults to 1=on)
net.inet.tcp.inflight_min minimum window limit
net.inet.tcp.inflight_max maximum window limit

MFC after: 1 week


# d46a5312 29-Jul-2002 Maxim Konovalov <maxim@FreeBSD.org>

Use a common way to release locks before exit.

Reviewed by: hsu


# 66ef17c4 25-Jul-2002 Hajimu UMEMOTO <ume@FreeBSD.org>

make setsockopt(IPV6_V6ONLY, 0) actuall work for tcp6.

MFC after: 1 week


# eccb7001 25-Jul-2002 Hajimu UMEMOTO <ume@FreeBSD.org>

cleanup usage of ip6_mapped_addr_on and ip6_v6only. now,
ip6_mapped_addr_on is unified into ip6_v6only.

MFC after: 1 week


# 9c68f33a 13-Jun-2002 Jeffrey Hsu <hsu@FreeBSD.org>

Because we're holding an exclusive write lock on the head, references to
the new inp cannot leak out even though it has been placed on the head list.


# f76fcf6d 10-Jun-2002 Jeffrey Hsu <hsu@FreeBSD.org>

Lock up inpcb.

Submitted by: Jennifer Yang <yangjihui@yahoo.com>


# 4cc20ab1 31-May-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Back out my lats commit of locking down a socket, it conflicts with hsu's work.

Requested by: hsu


# 243917fe 19-May-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Lock down a socket, milestone 1.

o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

- so_count
- so_options
- so_linger
- so_state

o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:

- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()

Reviewed by: alfred


# c1cd65ba 24-Mar-2002 Bruce Evans <bde@FreeBSD.org>

Fixed some style bugs in the removal of __P(()). Continuation lines
were not outdented to preserve non-KNF lining up of code with parentheses.
Switch to KNF formatting.


# 4d77a549 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# b7d6d952 28-Feb-2002 Hajimu UMEMOTO <ume@FreeBSD.org>

- Set inc_isipv6 in tcp6_usr_connect().
- When making a pcb from a sync cache, do not forget to copy inc_isipv6.

Obtained from: KAME
MFC After: 1 week


# a854ed98 27-Feb-2002 John Baldwin <jhb@FreeBSD.org>

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


# be2ac88c 21-Nov-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Introduce a syncache, which enables FreeBSD to withstand a SYN flood
DoS in an improved fashion over the existing code.

Reviewed by: silby (in a previous iteration)
Sponsored by: DARPA, NAI Labs


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# b0e3ad75 21-Aug-2001 Mike Silbersack <silby@FreeBSD.org>

Much delayed but now present: RFC 1948 style sequence numbers

In order to ensure security and functionality, RFC 1948 style
initial sequence number generation has been implemented. Barring
any major crypographic breakthroughs, this algorithm should be
unbreakable. In addition, the problems with TIME_WAIT recycling
which affect our currently used algorithm are not present.

Reviewed by: jesper


# 13cf67f3 26-Jul-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

move ipsec security policy allocation into in_pcballoc, before
making pcbs available to the outside world. otherwise, we will see
inpcb without ipsec security policy attached (-> panic() in ipsec.c).

Obtained from: KAME
MFC after: 3 days


# 81e561cd 13-Jul-2001 David E. O'Brien <obrien@FreeBSD.org>

Bump net.inet.tcp.sendspace to 32k and net.inet.tcp.recvspace to 65k.
This should help us in nieve benchmark "tests".

It seems a wide number of people think 32k buffers would not cause major
issues, and is in fact in use by many other OS's at this time. The
receive buffers can be bumped higher as buffers are hardly used and several
research papers indicate that receive buffers rarely use much space at all.

Submitted by: Leo Bicknell <bicknell@ufp.org>
<20010713101107.B9559@ussenterprise.ufp.org>
Agreed to in principle by: dillon (at the 32k level)


# 2d610a50 07-Jul-2001 Mike Silbersack <silby@FreeBSD.org>

Temporary feature: Runtime tuneable tcp initial sequence number
generation scheme. Users may now select between the currently used
OpenBSD algorithm and the older random positive increment method.

While the OpenBSD algorithm is more secure, it also breaks TIME_WAIT
handling; this is causing trouble for an increasing number of folks.

To switch between generation schemes, one sets the sysctl
net.inet.tcp.tcp_seq_genscheme. 0 = random positive increments,
1 = the OpenBSD algorithm. 1 is still the default.

Once a secure _and_ compatible algorithm is implemented, this sysctl
will be removed.

Reviewed by: jlemon
Tested by: numerous subscribers of -net


# 08517d53 22-Jun-2001 Mike Silbersack <silby@FreeBSD.org>

Eliminate the allocation of a tcp template structure for each
connection. The information contained in a tcptemp can be
reconstructed from a tcpcb when needed.

Previously, tcp templates required the allocation of one
mbuf per connection. On large systems, this change should
free up a large number of mbufs.

Reviewed by: bmilekic, jlemon, ru
MFC after: 2 weeks


# 33841545 10-Jun-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

Sync with recent KAME.
This work was based on kame-20010528-freebsd43-snap.tgz and some
critical problem after the snap was out were fixed.
There are many many changes since last KAME merge.

TODO:
- The definitions of SADB_* in sys/net/pfkeyv2.h are still different
from RFC2407/IANA assignment because of binary compatibility
issue. It should be fixed under 5-CURRENT.
- ip6po_m member of struct ip6_pktopts is no longer used. But, it
is still there because of binary compatibility issue. It should
be removed under 5-CURRENT.

Reviewed by: itojun
Obtained from: KAME
MFC after: 3 weeks


# d1745f45 20-Apr-2001 Jesper Skriver <jesper@FreeBSD.org>

Say goodbye to TCP_COMPAT_42

Reviewed by: wollman
Requested by: wollman


# f0a04f3f 17-Apr-2001 Kris Kennaway <kris@FreeBSD.org>

Randomize the TCP initial sequence numbers more thoroughly.

Obtained from: OpenBSD
Reviewed by: jesper, peter, -developers


# 1db24ffb 11-Mar-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Unbreak LINT.

Pointed out by: phk


# c0647e0d 09-Mar-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Push the test for a disconnected socket when accept()ing down to the
protocol layer. Not all protocols behave identically. This fixes the
brokenness observed with unix-domain sockets (and postfix)


# 91421ba2 20-Feb-2001 Robert Watson <rwatson@FreeBSD.org>

o Move per-process jail pointer (p->pr_prison) to inside of the subject
credential structure, ucred (cr->cr_prison).
o Allow jail inheritence to be a function of credential inheritence.
o Abstract prison structure reference counting behind pr_hold() and
pr_free(), invoked by the similarly named credential reference
management functions, removing this code from per-ABI fork/exit code.
o Modify various jail() functions to use struct ucred arguments instead
of struct proc arguments.
o Introduce jailed() function to determine if a credential is jailed,
rather than directly checking pointers all over the place.
o Convert PRISON_CHECK() macro to prison_check() function.
o Move jail() function prototypes to jail.h.
o Emulate the P_JAILED flag in fill_kinfo_proc() and no longer set the
flag in the process flags field itself.
o Eliminate that "const" qualifier from suser/p_can/etc to reflect
mutex use.

Notes:

o Some further cleanup of the linux/jail code is still required.
o It's now possible to consider resolving some of the process vs
credential based permission checking confusion in the socket code.
o Mutex protection of struct prison is still not present, and is
required to protect the reference count plus some fields in the
structure.

Reviewed by: freebsd-arch
Obtained from: TrustedBSD Project


# 007581c0 02-Feb-2001 Jonathan Lemon <jlemon@FreeBSD.org>

When turning off TCP_NOPUSH, call tcp_output to immediately flush
out any data pending in the buffer.

Submitted by: Tony Finch <dot@dotat.at>


# fdaf052e 01-Apr-2000 Yoshinobu Inoue <shin@FreeBSD.org>

Support per socket based IPv4 mapped IPv6 addr enable/disable control.

Submitted by: ume


# fb59c426 09-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

tcp updates to support IPv6.
also a small patch to sys/nfs/nfs_socket.c, as max_hdr size change.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 6a800098 22-Dec-1999 Yoshinobu Inoue <shin@FreeBSD.org>

IPSEC support in the kernel.
pr_input() routines prototype is also changed to support IPSEC and IPV6
chained protocol headers.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 79ea3cf1 12-Dec-1999 Yoshinobu Inoue <shin@FreeBSD.org>

Always set INP_IPV4 flag for IPv4 pcb entries, because netstat needs it
to print out protocol specific pcb info.

A patch submitted by guido@gvr.org, and asmodai@wxs.nl also reported
the problem.
Thanks and sorry for your troubles.

Submitted by: guido@gvr.org
Reviewed by: shin


# cfa1ca9d 07-Dec-1999 Yoshinobu Inoue <shin@FreeBSD.org>

udp IPv6 support, IPv6/IPv4 tunneling support in kernel,
packet divert at kernel for IPv6/IPv4 translater daemon

This includes queue related patch submitted by jburkhol@home.com.

Submitted by: queue related patch from jburkhol@home.com
Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 45d3a132 18-Nov-1999 Peter Wemm <peter@FreeBSD.org>

Fix a warning and a potential panic if TCPDEBUG is active. (tp is
a wild pointer and used by TCPDEBUG2())


# 9b8b58e0 30-Aug-1999 Jonathan Lemon <jlemon@FreeBSD.org>

Restructure TCP timeout handling:

- eliminate the fast/slow timeout lists for TCP and instead use a
callout entry for each timer.
- increase the TCP timer granularity to HZ
- implement "bad retransmit" recovery, as presented in
"On Estimating End-to-End Network Path Properties", by Allman and Paxson.

Submitted by: jlemon, wollmann


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# 9c9906e9 03-Jun-1999 Peter Wemm <peter@FreeBSD.org>

Plug a mbuf leak in tcp_usr_send(). pru_send() routines are expected
to either enqueue or free their mbuf chains, but tcp_usr_send() was
dropping them on the floor if the tcpcb/inpcb has been torn down in the
middle of a send/write attempt. This has been responsible for a wide
variety of mbuf leak patterns, ranging from slow gradual leakage to rather
rapid exhaustion. This has been a problem since before 2.2 was branched
and appears to have been fixed in rev 1.16 and lost in 1.23/1.28.

Thanks to Jayanth Vijayaraghavan <jayanth@yahoo-inc.com> for checking
(extensively) into this on a live production 2.2.x system and that it
was the actual cause of the leak and looks like it fixes it. The machine
in question was loosing (from memory) about 150 mbufs per hour under
load and a change similar to this stopped it. (Don't blame Jayanth
for this patch though)

An alternative approach to this would be to recheck SS_CANTSENDMORE etc
inside the splnet() right before calling pru_send() after all the potential
sleeps, interrupts and delays have happened. However, this would mean
exposing knowledge of the tcp stack's reset handling and removal of the
pcb to the generic code. There are other things that call pru_send()
directly though.

Problem originally noted by: John Plevyak <jplevyak@inktomi.com>


# 3d177f46 03-May-1999 Bill Fumerola <billf@FreeBSD.org>

Add sysctl descriptions to many SYSCTL_XXXs

PR: kern/11197
Submitted by: Adrian Chadd <adrian@FreeBSD.org>
Reviewed by: billf(spelling/style/minor nits)
Looked at by: bde(style)


# 75c13541 28-Apr-1999 Poul-Henning Kamp <phk@FreeBSD.org>

This Implements the mumbled about "Jail" feature.

This is a seriously beefed up chroot kind of thing. The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.

For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact: "real virtual servers".

Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.

Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.

It generally does what one would expect, but setting up a jail
still takes a little knowledge.

A few notes:

I have no scripts for setting up a jail, don't ask me for them.

The IP number should be an alias on one of the interfaces.

mount a /proc in each jail, it will make ps more useable.

/proc/<pid>/status tells the hostname of the prison for
jailed processes.

Quotas are only sensible if you have a mountpoint per prison.

There are no privisions for stopping resource-hogging.

Some "#ifdef INET" and similar may be missing (send patches!)

If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!

Tools, comments, patches & documentation most welcome.

Have fun...

Sponsored by: http://www.rndassociates.com/
Run for almost a year by: http://www.servetheweb.com/


# 3879597f 24-Apr-1999 Andrey A. Chernov <ache@FreeBSD.org>

so_linger is in seconds, not in 1/HZ

PR: 11252
Submitted by: Martin Kammerhofer <dada@sbox.tu-graz.ac.at>


# b0acefa8 20-Jan-1999 Bill Fenner <fenner@FreeBSD.org>

Add a flag, passed to pru_send routines, PRUS_MORETOCOME. This
flag means that there is more data to be put into the socket buffer.
Use it in TCP to reduce the interaction between mbuf sizes and the
Nagle algorithm.

Based on: "Justin C. Walker" <justin@apple.com>'s description of Apple's
fix for this problem.


# f1d19042 07-Dec-1998 Archie Cobbs <archie@FreeBSD.org>

The "easy" fixes for compiling the kernel -Wunused: remove unreferenced static
and local variables, goto labels, and functions declared but not defined.


# cfe8b629 22-Aug-1998 Garrett Wollman <wollman@FreeBSD.org>

Yow! Completely change the way socket options are handled, eliminating
another specialized mbuf type in the process. Also clean up some
of the cruft surrounding IPFW, multicast routing, RSVP, and other
ill-explored corners.


# c3229e05 27-Jan-1998 David Greenman <dg@FreeBSD.org>

Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.

Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.

These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.

Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.

WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).


# 744f87ea 18-Dec-1997 David Greenman <dg@FreeBSD.org>

Fixed a missing splx(s) bug in tcp_usr_send().


# 0cc12cc5 16-Sep-1997 Joerg Wunsch <joerg@FreeBSD.org>

Make TCPDEBUG a new-style option.


# f8f6cbba 13-Sep-1997 Peter Wemm <peter@FreeBSD.org>

Update network code to use poll support.


# 57bf258e 16-Aug-1997 Garrett Wollman <wollman@FreeBSD.org>

Fix all areas of the system (or at least all those in LINT) to avoid storing
socket addresses in mbufs. (Socket buffers are the one exception.) A number
of kernel APIs needed to get fixed in order to make this happen. Also,
fix three protocol families which kept PCBs in mbufs to not malloc them
instead. Delete some old compatibility cruft while we're at it, and add
some new routines in the in_cksum family.


# 1fd0b058 02-Aug-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# a29f300e 27-Apr-1997 Garrett Wollman <wollman@FreeBSD.org>

The long-awaited mega-massive-network-code- cleanup. Part I.

This commit includes the following changes:
1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility
glue for them is deleted, and the kernel will panic on boot if any are compiled
in.

2) Certain protocol entry points are modified to take a process structure,
so they they can easily tell whether or not it is possible to sleep, and
also to access credentials.

3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt()
call. Protocols should use the process pointer they are now passed.

4) The PF_LOCAL and PF_ROUTE families have been updated to use the new
style, as has the `raw' skeleton family.

5) PF_LOCAL sockets now obey the process's umask when creating a socket
in the filesystem.

As a result, LINT is now broken. I'm hoping that some enterprising hacker
with a bit more time will either make the broken bits work (should be
easy for netipx) or dike them out.


# ef53690b 21-Feb-1997 Garrett Wollman <wollman@FreeBSD.org>

Fix potential crash where a user attempts to perform an implied
connect in TCP while sending urgent data. It is not clear what
purpose is served by doing this, but there's no good reason why it
shouldn't work.

Submitted by: tjevans@raleigh.ibm.com via wpaul


# 117bcae7 18-Feb-1997 Garrett Wollman <wollman@FreeBSD.org>

Convert raw IP from mondo-switch-statement-from-Hell to
pr_usrreqs. Collapse duplicates with udp_usrreq.c and
tcp_usrreq.c (calling the generic routines in uipc_socket2.c and
in_pcb.c). Calling sockaddr()_ or peeraddr() on a detached
socket now traps, rather than harmlessly returning an error; this
should never happen. Allow the raw IP buffer sizes to be
controlled via sysctl.


# d0390e05 14-Feb-1997 Garrett Wollman <wollman@FreeBSD.org>

Fix the mechanism for choosing wehether to save the slow-start threshold
in the route. This allows us to remove the unconditional setting of the
pipesize in the route, which should mean that SO_SNDBUF and SO_RCVBUF
should actually work again. While we're at it:

- Convert udp_usrreq from `mondo switch statement from Hell' to new-style.
- Delete old TCP mondo switch statement from Hell, which had previously
been diked out.


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 6d6a026b 07-Oct-1996 David Greenman <dg@FreeBSD.org>

Improved in_pcblookuphash() to support wildcarding, and changed relavent
callers of it to take advantage of this. This reduces new connection
request overhead in the face of a large number of PCBs in the system.
Thanks to David Filo <filo@yahoo.com> for suggesting this and providing
a sample implementation (which wasn't used, but showed that it could be
done).

Reviewed by: wollman


# 7b40aa32 13-Sep-1996 Paul Traina <pst@FreeBSD.org>

Make the misnamed tcp initial keepalive timer value (which is really the
time, in seconds, that state for non-established TCP sessions stays about)
a sysctl modifyable variable.

[part 1 of two commits, I just realized I can't play with the indices as
I was typing this commit message.]


# af7a2999 12-Jul-1996 David Greenman <dg@FreeBSD.org>

Fixed two bugs in previous commit: be sure to include tcp_debug.h when
TCPDEBUG is defined, and fix typo in TCPDEBUG2() macro.


# 2c37256e 11-Jul-1996 Garrett Wollman <wollman@FreeBSD.org>

Modify the kernel to use the new pr_usrreqs interface rather than the old
pr_usrreq mechanism which was poorly designed and error-prone. This
commit renames pr_usrreq to pr_ousrreq so that old code which depended on it
would break in an obvious manner. This commit also implements the new
interface for TCP, although the old function is left as an example
(#ifdef'ed out). This commit ALSO fixes a longstanding bug in the
TCP timer processing (introduced by davidg on 1995/04/12) which caused
timer processing on a TCB to always stop after a single timer had
expired (because it misinterpreted the return value from tcp_usrreq()
to indicate that the TCB had been deleted). Finally, some code
related to polling has been deleted from if.c because it is not
relevant t -current and doesn't look at all like my current code.


# 2ee45d7d 11-Mar-1996 David Greenman <dg@FreeBSD.org>

Move or add #include <queue.h> in preparation for upcoming struct socket
changes.


# 2baeef32 06-Dec-1995 Bruce Evans <bde@FreeBSD.org>

Removed unnecessary #includes of vm stuff. Most of them were once
prerequisites for <sys/sysctl.h>.

subr_prof.c:
Also replaced #include of <sys/user.h> by #include of <sys/resourcevar.h>.


# 0312fbe9 14-Nov-1995 Poul-Henning Kamp <phk@FreeBSD.org>

New style sysctl & staticize alot of stuff.


# 98163b98 09-Nov-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Start adding new style sysctl here too.


# a45d2726 03-Nov-1995 Andras Olah <olah@FreeBSD.org>

Fix a logical error in T/TCP: when we actively open a connection, we
have to decide whether to send a CC or CCnew option in our SYN segment
depending on the contents of our TAO cache. This decision has to be
made once when the connection starts. The earlier code delayed this
decision until the segment was assembled in tcp_output() and
retransmitted SYN segments could have different CC options.

Reviewed by: Richard Stevens, davidg, wollman


# b6239c4a 29-Oct-1995 Andras Olah <olah@FreeBSD.org>

Start the 2MSL timer when the socket is closed and the TCP connection is
in the FIN_WAIT_2 state in order to prevent the conn. hanging there
forever.

Reviewed by: davidg, olah
Submitted by: Arne Henrik Juul <arnej@imf.unit.no>
Obtained from: bugs@netbsd.org


# efe4b0eb 21-Sep-1995 Garrett Wollman <wollman@FreeBSD.org>

Second try: get 4.4-Lite-2 into the source tree. The conflicts don't
matter because none of our working source files are on the CSRG branch
any more.

Obtained from: 4.4BSD-Lite-2


# b6e3d50f 13-Sep-1995 Garrett Wollman <wollman@FreeBSD.org>

Don't leak mbufs in an unusual error case in tcp_usrreq().

Reviewed by: Andras Olah <olah@freebsd.org>
Obtained from: Lite-2


# d3628763 11-Jun-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Merge RELENG_2_0_5 into HEAD


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# 15bd2b43 08-Apr-1995 David Greenman <dg@FreeBSD.org>

Implemented PCB hashing. Includes new functions in_pcbinshash, in_pcbrehash,
and in_pcblookuphash.


# b5e8ce9f 16-Mar-1995 Bruce Evans <bde@FreeBSD.org>

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# c7a82f90 16-Feb-1995 Garrett Wollman <wollman@FreeBSD.org>

Include missing <sys/kernel.h> for `hz'.

Submitted by: David Greenman, Rod Grimes, Christoph Kukulies


# 1fdbc7ae 15-Feb-1995 Garrett Wollman <wollman@FreeBSD.org>

Correctly initialize so_linger in ticks (not seconds).

Obtained from: Stevens, vol. 2, p. 1010


# 41f82abe 15-Feb-1995 Garrett Wollman <wollman@FreeBSD.org>

Transaction TCP support now standard. Hack away!


# f2ea20e6 15-Feb-1995 Garrett Wollman <wollman@FreeBSD.org>

Add lots of useful MIB variables and a few not-so-useful ones for
completeness.


# a0292f23 09-Feb-1995 Garrett Wollman <wollman@FreeBSD.org>

Merge Transaction TCP, courtesy of Andras Olah <olah@cs.utwente.nl> and
Bob Braden <braden@isi.edu>.

NB: This has not had David's TCP ACK hack re-integrated. It is not clear
what the correct solution to this problem is, if any. If a better solution
doesn't pop up in response to this message, I'll put David's code back in
(or he's welcome to do so himself).


# 9ee39fc6 15-Dec-1994 Garrett Wollman <wollman@FreeBSD.org>

Fix PR 59: don't allow TCP connections withmulticast addresses at either
end.


# 610ee2f9 15-Sep-1994 David Greenman <dg@FreeBSD.org>

Made TCPDEBUG truely optional. Based on changes I made in FreeBSD 1.1.5.
Fixed somebody's idea of a joke - about the first half of the lines in
in_proto.c were spaced over by one space.


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# 26e30fbb 29-May-1994 David Greenman <dg@FreeBSD.org>

Increased tcp_send/recvspace to 16k, and added TCP_SMALLSPACE ifdef
to set it to 4k.


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources