History log of /freebsd-current/sys/netinet/tcp_output.c
Revision Date Author Comments
# 2a9aae9e 08-May-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: add counter to track when SACK loss recovery uses TSO

Add a counter to track how frequently SACK has transmitted
more than one MSS using TSO. Instances when this will be
beneficial is the use of PRR, or when ACK thinning due to
GRO/LRO or ACK discards by the network are present.

Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D45070


# dcdfe449 08-May-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: add sysctl to allow/disallow TSO during SACK loss recovery

Introduce net.inet.tcp.sack.tso for future use when TSO is ready
to be used during loss recovery.

Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D45068


# af700f43 22-Mar-2024 Michael Tuexen <tuexen@FreeBSD.org>

tcp: no data on SYN segments unless doing TFO

Ensure that there is no data on SYN segments unless doing TFO.
This check is already in RACK and BBR.

Reported by: glebius
Reviewed by: rscheff
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44384


# dd7b86e2 18-Mar-2024 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove IS_FASTOPEN() macro

The macro is more obfuscating than helping as it just checks a single flag
of t_flags. All other t_flags bits are checked without a macro.

A bigger problem was that declaration of the macro in tcp_var.h depended
on a kernel option. It is a bad practice to create such definitions in
installable headers.

Reviewed by: rscheff, tuexen, kib
Differential Revision: https://reviews.freebsd.org/D44362


# a8e817cf 10-Feb-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: stop doing superfluous work after sending RST

When sending a RST control segment in tcp_output() it
means we are in TCPS_CLOSED state, called from tcp_drop().
Once the RST is sent, don't call tcp_timer_activate() or
update anything in tcpcb, since that will go away shortly.

PR: 276761
Provided by: glebius
Reviewed By: glebius, tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43808


# 2d05a1c8 25-Jan-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: commonize check for more data to send, style changes

Use SEQ_SUB instead of a plain subtraction, for an implict
type conversion and prevention of a possible overflow.
Use curly brackets in stacked if statements throughout.
Use of the ? operator to enhance readability when clearing
the FIN flag in tcp_output().

None of the above change the function.

Reviewed By: tuexen, cc, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43539


# 0932fb56 25-Jan-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: fix TCPSTAT accounting for SACK

Account for SACK retransmitted bytes once the actual length
is known. This prevents a call to tcp_maxseg() and prepares
for TSO support when transmitting from the SACK scoreboard.

Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43447


# 90ad2dc2 23-Jan-2024 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove 20+ year old disabled code from d912c694ee00


# c809435b 23-Jan-2024 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: clear outdated comment mentioning T/TCP


# 429f14f8 08-Jan-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: clean PRR state after ECN congestion recovery.

PRR state was not properly reset on subsequent ECN CE
events. Clean up after local transmission failures too.

Reviewed by: tuexen, cc, #transport
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43170


# f7d5900a 28-Dec-2023 John Baldwin <jhb@FreeBSD.org>

sys: Style fix for M_EXT | M_EXTPG

Add a space around the | operator in places testing for either M_EXT
or M_EXTPG.

Reviewed by: imp, glebius
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D43216


# e3b9058e 18-Dec-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: properly unroll sack transmission on tx error with LRD

Reviewed By: tuexen, #transport
MFC after: 1 week
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43085


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 43b117f8 06-Jun-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: make the maximum number of retransmissions tunable per VNET

Both Windows (TcpMaxDataRetransmissions) and Linux (tcp_retries2)
allow to restrict the maximum number of consecutive timer based
retransmissions. Add that same capability on a per-VNet basis to
FreeBSD.

Reviewed By: cc, tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D40424


# 2169f712 11-Apr-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: use IPV6_FLOWLABEL_LEN

Avoid magic numbers when handling the IPv6 flow ID for
DSCP and ECN fields and use the named variable instead.

Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D39503


# 69c7c811 16-Mar-2023 Randall Stewart <rrs@FreeBSD.org>

Move access to tcp's t_logstate into inline functions and provide new tracepoint and bbpoint capabilities.

The TCP stacks have long accessed t_logstate directly, but in order to do tracepoints and the new bbpoints
we need to move to using the new inline functions. This adds them and moves rack to now use
the tcp_tracepoints.

Reviewed by: tuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D38831


# 2f201df1 20-Jul-2021 Alfonso <gfunni234@gmail.com>

Change hw_tls to a bool

Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/512


# c0e4090e 08-Feb-2023 Andrew Gallatin <gallatin@FreeBSD.org>

ktls: Accurately track if ifnet ktls is enabled

This allows us to avoid spurious calls to ktls_disable_ifnet()

When we implemented ifnet kTLSe, we set a flag in the tx socket
buffer (SB_TLS_IFNET) to indicate ifnet kTLS. This flag meant that
now, or in the past, ifnet ktls was active on a socket. Later,
I added code to switch ifnet ktls sessions to software in the case
of lossy TCP connections that have a high retransmit rate.
Because TCP was using SB_TLS_IFNET to know if it needed to do math
to calculate the retransmit ratio and potentially call into
ktls_disable_ifnet(), it was doing unneeded work long after
a session was moved to software.

This patch carefully tracks whether or not ifnet ktls is still enabled
on a TCP connection. Because the inp is now embedded in the tcpcb, and
because TCP is the most frequent accessor of this state, it made sense to
move this from the socket buffer flags to the tcpcb. Because we now need
reliable access to the tcbcb, we take a ref on the inp when creating a tx
ktls session.

While here, I noticed that rack/bbr were incorrectly implementing
tfb_hwtls_change(), and applying the change to all pending sends,
when it should apply only to future sends.

This change reduces spurious calls to ktls_disable_ifnet() by 95% or so
in a Netflix CDN environment.

Reviewed by: markj, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D38380


# eaabc937 14-Dec-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: retire TCPDEBUG

This subsystem is superseded by modern debugging facilities,
e.g. DTrace probes and TCP black box logging.

We intentionally leave SO_DEBUG in place, as many utilities may
set it on a socket. Also the tcp::debug DTrace probes look at
this flag on a socket.

Reviewed by: gnn, tuexen
Discussed with: rscheff, rrs, jtl
Differential revision: https://reviews.freebsd.org/D37694


# e68b3792 07-Dec-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: embed inpcb into tcpcb

For the TCP protocol inpcb storage specify allocation size that would
provide space to most of the data a TCP connection needs, embedding
into struct tcpcb several structures, that previously were allocated
separately.

The most import one is the inpcb itself. With embedding we can provide
strong guarantee that with a valid TCP inpcb the tcpcb is always valid
and vice versa. Also we reduce number of allocs/frees per connection.
The embedded inpcb is placed in the beginning of the struct tcpcb,
since in_pcballoc() requires that. However, later we may want to move
it around for cache line efficiency, and this can be done with a little
effort. The new intotcpcb() macro is ready for such move.

The congestion algorithm data, the TCP timers and osd(9) data are
also embedded into tcpcb, and temprorary struct tcpcb_mem goes away.
There was no extra allocation here, but we went through extra pointer
every time we accessed this data.

One interesting side effect is that now TCP data is allocated from
SMR-protected zone. Potentially this allows the TCP stacks or other
TCP related modules to utilize that for their own synchronization.

Large part of the change was done with sed script:

s/tp->ccv->/tp->t_ccv./g
s/tp->ccv/\&tp->t_ccv/g
s/tp->cc_algo/tp->t_cc/g
s/tp->t_timers->tt_/tp->tt_/g
s/CCV\(ccv, osd\)/\&CCV(ccv, t_osd)/g

Dependency side effect is that code that needs to know struct tcpcb
should also know struct inpcb, that added several <netinet/in_pcb.h>.

Differential revision: https://reviews.freebsd.org/D37127


# 9eb0e832 08-Nov-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: provide macros to access inpcb and socket from a tcpcb

There should be no functional changes with this commit.

Reviewed by: rscheff
Differential revision: https://reviews.freebsd.org/D37123


# cd84e78f 04-Oct-2022 Randall Stewart <rrs@FreeBSD.org>

tcp idle reduce does not work for a server.

TCP has an idle-reduce feature that allows a connection to reduce its
cwnd after it has been idle more than an RTT. This feature only works
for a sending side connection. It does this by at output checking the
idle time (t_rcvtime vs ticks) to see if its more than the RTO timeout.

The problem comes if you are a web server. You get a request and
then send out all the data.. then go idle. The next time you would
send is in response to a request from the peer asking for more data.
But the thing is you updated t_rcvtime when the request came in so
you never reduce.

The fix is to do the idle reduce check also on inbound.

Reviewed by: tuexen, rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36721


# 08af8aac 27-Sep-2022 Randall Stewart <rrs@FreeBSD.org>

Tcp progress timeout

Rack has had the ability to timeout connections that just sit idle automatically. This
feature of course is off by default and requires the user set it on (though the socket option
has been missing in tcp_usrreq.c). Lets get the progress timeout fully supported in
the base stack as well as rack.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36716


# a743fc88 21-Sep-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: fix cwnd restricted SACK retransmission loop

While doing the initial SACK retransmission segment while heavily cwnd
constrained, tcp_ouput can erroneously send out the entire sendbuffer
again. This may happen after an retransmission timeout, which resets
snd_nxt to snd_una while the SACK scoreboard is still populated.

Reviewed By: tuexen, #transport
PR: 264257
PR: 263445
PR: 260393
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36637


# 6d9e911f 18-Sep-2022 Michael Tuexen <tuexen@FreeBSD.org>

tcp: fix computation of offset

Only update the offset if actually retransmitting from the
scoreboard. If not done correctly, this may result in
trying to (re)-transmit data not being being in the socket
buffe and therefore resulting in a panic.

PR: 264257
PR: 263445
PR: 260393
Reviewed by: rscheff@
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D36626


# 4012ef77 31-Aug-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Functional implementation of Accurate ECN

The AccECN handshake and TCP header flags are supported,
no support yet for the AccECN option. This minimalistic
implementation is sufficient to support DCTCP while
dramatically cutting the number of ACKs, and provide ECN
response from the receiver to the CC modules.

Reviewed By: #transport, #manpages, rrs, pauamma
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D21011


# bd30a121 08-Aug-2022 Michael Tuexen <tuexen@FreeBSD.org>

tcp: improve BBLog for output events when using the FreeBSD stack

Put the return value of ip_output()/ip6_output in the output event
instead of adding another one in case of an error. This improves
consistency with other similar places.

Reviewed by: rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D36085


# 66605ff7 13-Jul-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Undo the increase in sequence number by 1 due to the FIN flag in case of a transient error.

If an error occurs while processing a TCP segment with some data and the FIN
flag, the back out of the sequence number advance does not take into account the
increase by 1 due to the FIN flag.

Reviewed By: jch, gnn, #transport, tuexen
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D2970


# 28173d49 02-Jun-2022 Hans Petter Selasky <hselasky@FreeBSD.org>

tcp: Correctly compute the retransmit length for all 64-bit platforms.

When the TCP sequence number subtracted is greater than 2**32 minus
the window size, or 2**31 minus the window size, the use of unsigned
long as an intermediate variable, may result in an incorrect retransmit
length computation on all 64-bit platforms.

While at it create a helper macro to facilitate the computation of
the difference between two TCP sequence numbers.

Differential Revision: https://reviews.freebsd.org/D35388
Reviewed by: rscheff
MFC after: 3 days
Sponsored by: NVIDIA Networking


# 43283184 12-May-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: use socket buffer mutexes in struct socket directly

Since c67f3b8b78e the sockbuf mutexes belong to the containing socket,
and socket buffers just point to it. In 74a68313b50 macros that access
this mutex directly were added. Go over the core socket code and
eliminate code that reaches the mutex by dereferencing the sockbuf
compatibility pointer.

This change requires a KPI change, as some functions were given the
sockbuf pointer only without any hint if it is a receive or send buffer.

This change doesn't cover the whole kernel, many protocols still use
compatibility pointers internally. However, it allows operation of a
protocol that doesn't use them.

Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35152


# 732b6d4d 13-Apr-2022 John Baldwin <jhb@FreeBSD.org>

netinet: Use __diagused for variables only used in KASSERT().


# 2ff07d92 25-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Restore correct ECT marking behavior on SACK retransmissions

While coalescing all ECN-related code into new common source files,
the flag to deal with SACK retransmissions was skipped. This leads
to non-compliant ECT-marking of SACK retransmissions, as well as
the premature sending of other TCP ECN flags (CWR).

Reviewed By: rrs, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34376


# f7220c48 05-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: move ECN handling code to a common file

Reduce the burden to maintain correct and
extensible ECN related code across multiple
stacks and codepaths.

Formally no functional change.

Incidentially this establishes correct
ECN operation in one instance.

Reviewed By: rrs, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34162


# 7994ef3c 04-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

Revert "tcp: move ECN handling code to a common file"

This reverts commit 0c424c90eaa6602e07bca7836b1d178b91f2a88a.


# 0c424c90 04-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: move ECN handling code to a common file

Reduce the burden to maintain correct and
extensible ECN related code across multiple
stacks and codepaths.

Formally no functional change.

Incidentially this establishes correct
ECN operation in one instance.

Reviewed By: rrs, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34162


# f026275e 03-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: set IP ECN header codepoint properly

TCP RACK can cache the IP header while preparing
a new TCP packet for transmission. Thus all the
IP ECN codepoint bits need to be assigned, without
assuming a clear field beforehand.

Reviewed By: tuexen, kbowling, #transport
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34148


# 1ebf4607 03-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Access all 12 TCP header flags via inline function

In order to consistently provide access to all
(including reserved) TCP header flag bits,
use an accessor function tcp_get_flags and
tcp_set_flags. Also expand any flag variable from
uint8_t / char to uint16_t.

Reviewed By: hselasky, tuexen, glebius, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34130


# 5b08b46a 26-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: welcome back tcp_output() as the right way to run output on tcpcb.

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D33365


# 9e4d9e4c 25-Jun-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Preparation for allowing hardware TLS to be able to kick a tcp connection that is retransmitting too much out of hardware and back to software.

Hardware TLS is now supported in some interface cards and it works well. Except that
when we have connections that retransmit a lot we get into trouble with all the retransmits.
This prep step makes way for change that Drew will be making so that we can "kick out" a
session from hardware TLS.

Reviewed by: mtuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30895


# 67e89281 10-Jun-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Mbuf leak while holding a socket buffer lock.

When running at NF the current Rack and BBR changes with the recent
commits from Richard that cause the socket buffer lock to be held over
the ip_output() call and then finally culminating in a call to tcp_handle_wakeup()
we get a lot of leaked mbufs. I don't think that this leak is actually caused
by holding the lock or what Richard has done, but is exposing some other
bug that has probably been lying dormant for a long time. I will continue to
look (using his changes) at what is going on to try to root cause out the issue.

In the meantime I can't leave the leaks out for everyone else. So this commit
will revert all of Richards changes and move both Rack and BBR back to just
doing the old sorwakeup_locked() calls after messing with the so_rcv buffer.

We may want to look at adding back in Richards changes after I have pinpointed
the root cause of the mbuf leak and fixed it.

Reviewed by: mtuexen,rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30704


# 500eb6dd 21-May-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: Fix sending of TCP segments with IP level options

When bringing in TCP over UDP support in
https://cgit.FreeBSD.org/src/commit/?id=9e644c23000c2f5028b235f6263d17ffb24d3605,
the length of IP level options was considered when locating the
transport header. This was incorrect and is fixed by this patch.

X-MFC with: https://cgit.FreeBSD.org/src/commit/?id=9e644c23000c2f5028b235f6263d17ffb24d3605
MFC after: 3 days
Reviewed by: markj, rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D30358


# 0471a8c7 10-May-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: SACK Lost Retransmission Detection (LRD)

Recover from excessive losses without reverting to a
retransmission timeout (RTO). Disabled by default, enable
with sysctl net.inet.tcp.do_lrd=1

Reviewed By: #transport, rrs, tuexen, #manpages
Sponsored by: Netapp, Inc.
Differential Revision: https://reviews.freebsd.org/D28931


# 9e644c23 18-Apr-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: add support for TCP over UDP

Adding support for TCP over UDP allows communication with
TCP stacks which can be implemented in userspace without
requiring special priviledges or specific support by the OS.
This is joint work with rrs.

Reviewed by: rrs
Sponsored by: Netflix, Inc.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D29469


# 9f2eeb02 08-Apr-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

[tcp] Fix ECN on finalizing sessions.

A subtle oversight would subtly change new data packets
sent after a shutdown() or close() call, while the send
buffer is still draining.

MFC after: 3 days
Reviewed By: #transport, tuexen
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29616


# e5313869 05-Mar-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Add prr_out in preparation for PRR/nonSACK and LRD

Reviewed By: #transport, kbowling
MFC after: 3 days
Sponsored By: Netapp, Inc.
Differential Revision: https://reviews.freebsd.org/D29058


# ed782b9f 13-Feb-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: improve behaviour when using TCP_NOOPT

Use ISS for SEG.SEQ when sending a SYN-ACK segment in response to
an SYN segment received in the SYN-SENT state on a socket having
the IPPROTO_TCP level socket option TCP_NOOPT enabled.

Reviewed by: rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D28656


# 4b72ae16 08-Oct-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Stop sending tiny new data segments during SACK recovery

Consider the currently in-use TCP options when
calculating the amount of new data to be injected during
SACK loss recovery. That addresses the effect that very small
(new) segments could be injected on partial ACKs while
still performing a SACK loss recovery.

Reported by: Liang Tian
Reviewed by: tuexen, chengc_netapp.com
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D26446


# e3995661 25-Sep-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

TCP: send full initial window when timestamps are in use

The fastpath in tcp_output tries to send out
full segments, and avoid sending partial segments by
comparing against the static t_maxseg variable.
That value does not consider tcp options like timestamps,
while the initial window calculation is using
the correct dynamic tcp_maxseg() function.

Due to this interaction, the last, full size segment
is considered too short and not sent out immediately.

Reviewed by: tuexen
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D26478


# 662c1305 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

net: clean up empty lines in .c and .h files


# b9978183 19-Aug-2020 Andrew Gallatin <gallatin@FreeBSD.org>

TCP: remove special treatment for hardware (ifnet) TLS

Remove most special treatment for ifnet TLS in the TCP stack, except
for code to avoid mixing handshakes and bulk data.

This code made heroic efforts to send down entire TLS records to
NICs. It was added to improve the PCIe bus efficiency of older TLS
offload NICs which did not keep state per-session, and so would need
to re-DMA the first part(s) of a TLS record if a TLS record was sent
in multiple TCP packets or TSOs. Newer TLS offload NICs do not need
this feature.

At Netflix, we've run extensive QoE tests which show that this feature
reduces client quality metrics, presumably because the effort to send
TLS records atomically causes the server to both wait too long to send
data (leading to buffers running dry), and to send too much data at
once (leading to packet loss).

Reviewed by: hselasky, jhb, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26103


# 9dc7d8a2 24-Jun-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

TCP: make after-idle work for transactional sessions.

The use of t_rcvtime as proxy for the last transmission
fails for transactional IO, where the client requests
data before the server can respond with a bulk transfer.

Set aside a dedicated variable to actually track the last
locally sent segment going forward.

Reported by: rrs
Reviewed by: rrs, tuexen (mentor)
Approved by: tuexen (mentor), rgrimes (mentor)
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D25016


# f092a3c7 12-Jun-2020 Randall Stewart <rrs@FreeBSD.org>

So it turns out with the right window scaling you can get the code in all stacks to
always want to do a window update, even when no data can be sent. Now in
cases where you are not pacing thats probably ok, you just send an extra
window update or two. However with bbr (and rack if its paced) every time
the pacer goes off its going to send a "window update".

Also in testing bbr I have found that if we are not responding to
data right away we end up staying in startup but incorrectly holding
a pacing gain of 192 (a loss). This is because the idle window code
does not restict itself to only work with PROBE_BW. In all other
states you dont want it doing a PROBE_BW state change.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D25247


# af2fb894 21-May-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

With RFC3168 ECN, CWR SHOULD only be sent with new data

Overly conservative data receivers may ignore the CWR flag
on other packets, and keep ECE latched. This can result in
continous reduction of the congestion window, and very poor
performance when ECN is enabled.

Reviewed by: rgrimes (mentor), rrs
Approved by: rgrimes (mentor), tuexen (mentor)
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D23364


# 6e16d877 21-May-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Handle ECN handshake in simultaneous open

While testing simultaneous open TCP with ECN, found that
negotiation fails to arrive at the expected final state.

Reviewed by: tuexen (mentor)
Approved by: tuexen (mentor), rgrimes (mentor)
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D23373


# 61664ee7 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Step 4.2: start divorce of M_EXT and M_EXTPG

They have more differencies than similarities. For now there is lots
of code that would check for M_EXT only and work correctly on M_EXTPG
buffers, so still carry M_EXT bit together with M_EXTPG. However,
prepare some code for explicit check for M_EXTPG.

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# 6edfd179 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Step 4.1: mechanically rename M_NOMAP to M_EXTPG

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# 7b6c99d0 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Step 3: anonymize struct mbuf_ext_pgs and move all its fields into mbuf
within m_epg namespace.
All edits except the 'struct mbuf' declaration and mb_dupcl() were done
mechanically with sed:

s/->m_ext_pgs.nrdy/->m_epg_nrdy/g
s/->m_ext_pgs.hdr_len/->m_epg_hdrlen/g
s/->m_ext_pgs.trail_len/->m_epg_trllen/g
s/->m_ext_pgs.first_pg_off/->m_epg_1st_off/g
s/->m_ext_pgs.last_pg_len/->m_epg_last_len/g
s/->m_ext_pgs.flags/->m_epg_flags/g
s/->m_ext_pgs.record_type/->m_epg_record_type/g
s/->m_ext_pgs.enc_cnt/->m_epg_enc_cnt/g
s/->m_ext_pgs.tls/->m_epg_tls/g
s/->m_ext_pgs.so/->m_epg_so/g
s/->m_ext_pgs.seqno/->m_epg_seqno/g
s/->m_ext_pgs.stailq/->m_epg_stailq/g

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# 9028b6e0 29-Apr-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Prevent premature shrinking of the scaled receive window
which can cause a TCP client to use invalid or stale TCP sequence numbers for ACK packets.

Packets with old sequence numbers are ignored and not used to update the send window size.
This might cause the TCP session to hang indefinitely under some circumstances.

Reported by: Cui Cheng
Reviewed by: tuexen (mentor), rgrimes (mentor)
Approved by: tuexen (mentor), rgrimes (mentor)
MFC after: 3 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D24515


# 983066f0 25-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Convert route caching to nexthop caching.

This change is build on top of nexthop objects introduced in r359823.

Nexthops are separate datastructures, containing all necessary information
to perform packet forwarding such as gateway interface and mtu. Nexthops
are shared among the routes, providing more pre-computed cache-efficient
data while requiring less memory. Splitting the LPM code and the attached
data solves multiple long-standing problems in the routing layer,
drastically reduces the coupling with outher parts of the stack and allows
to transparently introduce faster lookup algorithms.

Route caching was (re)introduced to minimise (slow) routing lookups, allowing
for notably better performance for large TCP senders. Caching works by
acquiring rtentry reference, which is protected by per-rtentry mutex.
If the routing table is changed (checked by comparing the rtable generation id)
or link goes down, cache record gets withdrawn.

Nexthops have the same reference counting interface, backed by refcount(9).
This change merely replaces rtentry with the actual forwarding nextop as a
cached object, which is mostly mechanical. Other moving parts like cache
cleanup on rtable change remains the same.

Differential Revision: https://reviews.freebsd.org/D24340


# 23feb563 14-Apr-2020 Andrew Gallatin <gallatin@FreeBSD.org>

KTLS: Re-work unmapped mbufs to carry ext_pgs in the mbuf itself.

While the original implementation of unmapped mbufs was a large
step forward in terms of reducing cache misses by enabling mbufs
to carry more than a single page for sendfile, they are rather
cache unfriendly when accessing the ext_pgs metadata and
data. This is because the ext_pgs part of the mbuf is allocated
separately, and almost guaranteed to be cold in cache.

This change takes advantage of the fact that unmapped mbufs
are never used at the same time as pkthdr mbufs. Given this
fact, we can overlap the ext_pgs metadata with the mbuf
pkthdr, and carry the ext_pgs meta directly in the mbuf itself.
Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array
of pages) directly after the existing m_ext.

In order to be able to carry 5 pages (which is the minimum
required for a 16K TLS record which is not perfectly aligned) on
LP64, I've had to steal ext_arg2. The only user of this in the
xmit path is sendfile, and I've adjusted it to use arg1 when
using unmapped mbufs.

This change is almost entirely mechanical, except that we
change mb_alloc_ext_pgs() to no longer allow allocating
pkthdrs, the change to avoid ext_arg2 as mentioned above,
and the removal of the ext_pgs zone,

This change saves roughly 2% "raw" CPU (~59% -> 57%), or over
3% "scaled" CPU on a Netflix 100% software kTLS workload at
90+ Gb/s on Broadwell Xeons.

In a follow-on commit, I plan to remove some hacks to avoid
access ext_pgs fields of mbufs, since they will now be in
cache.

Many thanks to glebius for helping to make this better in
the Netflix tree.

Reviewed by: hselasky, jhb, rrs, glebius (early version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24213


# ee7a9e50 16-Mar-2020 Andrew Gallatin <gallatin@FreeBSD.org>

Avoid a cache miss accessing an mbuf ext_pgs pointer when doing SW kTLS.

For a Netflix 90Gb/s 100% TLS software kTLS workload, this reduces
the CPI of tcp_m_copym() from ~3.5 to ~2.5 as reported by vtune.

Reviewed by: jtl, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D23998


# a3574665 13-Feb-2020 Michael Tuexen <tuexen@FreeBSD.org>

sack_newdata and snd_recover hold the same value. Therefore, use only
a single instance: use snd_recover also where sack_newdata was used.

Submitted by: Richard Scheffenegger
Differential Revision: https://reviews.freebsd.org/D18811


# 481be5de 12-Feb-2020 Randall Stewart <rrs@FreeBSD.org>

White space cleanup -- remove trailing tab's or spaces
from any line.

Sponsored by: Netflix Inc.


# 47e2c17c 25-Jan-2020 Michael Tuexen <tuexen@FreeBSD.org>

Don't set the ECT codepoint on retransmitted packets during SACK loss
recovery. This is required by RFC 3168.

Submitted by: Richard Scheffenegger
Reviewed by: rgrimes@, tuexen@, Cheng Cui
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23118


# 109eb549 21-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Make tcp_output() require network epoch.

Enter the epoch before calling into tcp_output() from those
functions, that didn't do that before.

This eliminates a bunch of epoch recursions in TCP.


# b9555453 21-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Make ip6_output() and ip_output() require network epoch.

All callers that before may called into these functions
without network epoch now must enter it.


# adc56f5a 02-Dec-2019 Edward Tomasz Napierala <trasz@FreeBSD.org>

Make use of the stats(3) framework in the TCP stack.

This makes it possible to retrieve per-connection statistical
information such as the receive window size, RTT, or goodput,
using a newly added TCP_STATS getsockopt(3) option, and extract
them using the stats_voistat_fetch(3) API.

See the net/tcprtt port for an example consumer of this API.

Compared to the existing TCP_INFO system, the main differences
are that this mechanism is easy to extend without breaking ABI,
and provides statistical information instead of raw "snapshots"
of values at a given point in time. stats(3) is more generic
and can be used in both userland and the kernel.

Reviewed by: thj
Tested by: thj
Obtained from: Netflix
Relnotes: yes
Sponsored by: Klara Inc, Netflix
Differential Revision: https://reviews.freebsd.org/D20655


# 3cf38784 01-Dec-2019 Michael Tuexen <tuexen@FreeBSD.org>

Move all ECN related flags from the flags to the flags2 field.
This allows adding more ECN related flags in the future.
No functional change intended.

Submitted by: Richard Scheffenegger
Reviewed by: rrs@, tuexen@
Differential Revision: https://reviews.freebsd.org/D22497


# 12a43d0d 29-Sep-2019 Michael Tuexen <tuexen@FreeBSD.org>

RFC 7112 requires a host to put the complete IP header chain
including the TCP header in the first IP packet.
Enforce this in tcp_output(). In addition make sure that at least
one byte payload fits in the TCP segement to allow making progress.
Without this check, a kernel with INVARIANTS will panic.
This issue was found by running an instance of syzkaller.

Reviewed by: jtl@
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21665


# 82e837f8 06-Sep-2019 Warner Losh <imp@FreeBSD.org>

Initialize if_hw_tsomaxsegsize to 0 to appease gcc's flow analysis as a
fail-safe.


# b2e60773 26-Aug-2019 John Baldwin <jhb@FreeBSD.org>

Add kernel-side support for in-kernel TLS.

KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277


# e5926fd3 14-Jul-2019 Randall Stewart <rrs@FreeBSD.org>

This is the second in a number of patches needed to
get BBRv1 into the tree. This fixes the DSACK bug but
is also needed by BBR. We have yet to go two more
one will be for the pacing code (tcp_ratelimit.c) and
the second will be for the new updated LRO code that
allows a transport to know the arrival times of packets
and (tcp_lro.c). After that we should finally be able
to get BBRv1 into head.

Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D20908


# fd29ff5d 03-Apr-2019 Randall Stewart <rrs@FreeBSD.org>

Undo my previous erroneous commit changing the tcp_output kassert.
Hmm now the question is where did the tcp_log_id change go :o


# 7854c63d 26-Mar-2019 Randall Stewart <rrs@FreeBSD.org>

Fix a small bug in the tcp_log_id where the bucket
was unlocked and yet the bucket-unlock flag was not
changed to false. This can cause a panic if INVARIANTS
is on and we go through the right path (though rare).

Reported by: syzbot+179a1ad49f3c4c215fa2@syzkaller.appspotmail.com
Reviewed by: tuexen@
MFC after: 1 week


# 05fb056c 23-Mar-2019 Michael Tuexen <tuexen@FreeBSD.org>

Fix a KASSERT() in tcp_output().

When checking the length of the headers at this point, the IP level
options have not been added to the mbuf chain.
So don't take them into account.

Reported by: syzbot+16025fff7ee5f7c5957b@syzkaller.appspotmail.com
Reported by: syzbot+adb5836b8a9ff621b2aa@syzkaller.appspotmail.com
Reported by: syzbot+d25a5352bcdf40acdbb8@syzkaller.appspotmail.com
Reviewed by: rrs@
MFC after: 3 days
Sponsored by: Netflix, Inc.


# 50075939 15-Jan-2019 Stephen Hurd <shurd@FreeBSD.org>

Fix window update issue when scaling disabled

When the TCP window scale option is not used, and the window
opens up enough in one soreceive, a window update will not be sent.

For example, if recwin == 65535, so->so_rcv.sb_hiwat >= 262144, and
so->so_rcv.sb_hiwat <= 524272, the window update will never be sent.
This is because recwin and adv are clamped to TCP_MAXWIN << tp->rcv_scale,
and so will never be >= so->so_rcv.sb_hiwat / 4
or <= so->so_rcv.sb_hiwat / 8.

This patch ensures a window update is sent if the window opens by
TCP_MAXWIN << tp->rcv_scale, which should only happen when the window
size goes from zero to the max expressible.

This issue looks like it was introduced in r306769 when recwin was clamped
to TCP_MAXWIN << tp->rcv_scale.

MFC after: 1 week
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D18821


# 79410718 22-Nov-2018 Michael Tuexen <tuexen@FreeBSD.org>

Ensure that TCP RST-segments announce consistently a receiver window of
zero. This was already done when sending them via tcp_respond().

Reviewed by: rrs@
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D17949


# 74e10fb6 22-Oct-2018 John Baldwin <jhb@FreeBSD.org>

A couple of style fixes in recent TCP changes.

- Add a blank line before a block comment to match other block comments
in the same function.
- Sort the prototype for sbsndptr_adv and fix whitespace between return
type and function name.

Reviewed by: gallatin, bz
Differential Revision: https://reviews.freebsd.org/D17474


# 8db239dc 30-Jul-2018 Michael Tuexen <tuexen@FreeBSD.org>

Fix some TCP fast open issues.

The following issues are fixed:
* Whenever a TCP server with TCP fast open enabled, calls accept(),
recv(), send(), and close() before the TCP-ACK segment has been received,
the TCP connection is just dropped and the reception of the TCP-ACK
segment triggers the sending of a TCP-RST segment.
* Whenever a TCP server with TCP fast open enabled, calls accept(), recv(),
send(), send(), and close() before the TCP-ACK segment has been received,
the first byte provided in the second send call is not transferred.
* Whenever a TCP client with TCP fast open enabled calls sendto() followed
by close() the TCP connection is just dropped.

Reviewed by: jtl@, kbowling@, rrs@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16485


# a00f4ac2 23-Jun-2018 Gleb Smirnoff <glebius@FreeBSD.org>

Revert r334843, and partially revert r335180.

tcp_outflags[] were defined since 4BSD and are defined nowadays in
all its descendants. Removing them breaks third party application.


# 581a046a 21-Jun-2018 Randall Stewart <rrs@FreeBSD.org>

This adds in an optimization so that we only walk one
time through the mbuf chain during copy and TSO limiting.
It is used by both Rack and now the FreeBSD stack.
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D15937


# 9293873e 14-Jun-2018 Gleb Smirnoff <glebius@FreeBSD.org>

TCPOUTFLAGS no longer exists since r334843.


# 3db28e66 08-Jun-2018 Matt Macy <mmacy@FreeBSD.org>

avoid 'tcp_outflags defined but not used'


# 89e560f4 07-Jun-2018 Randall Stewart <rrs@FreeBSD.org>

This commit brings in a new refactored TCP stack called Rack.
Rack includes the following features:
- A different SACK processing scheme (the old sack structures are not used).
- RACK (Recent acknowledgment) where counting dup-acks is no longer done
instead time is used to knwo when to retransmit. (see the I-D)
- TLP (Tail Loss Probe) where we will probe for tail-losses to attempt
to try not to take a retransmit time-out. (see the I-D)
- Burst mitigation using TCPHTPS
- PRR (partial rate reduction) see the RFC.

Once built into your kernel, you can select this stack by either
socket option with the name of the stack is "rack" or by setting
the global sysctl so the default is rack.

Note that any connection that does not support SACK will be kicked
back to the "default" base FreeBSD stack (currently known as "default").

To build this into your kernel you will need to enable in your
kernel:
makeoptions WITH_EXTRA_TCP_STACKS=1
options TCPHPTS

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D15525


# 10d20c84 07-May-2018 Matt Macy <mmacy@FreeBSD.org>

Fix spurious retransmit recovery on low latency networks

TCP's smoothed RTT (SRTT) can be much larger than an actual observed RTT. This can be either because of hz restricting the calculable RTT to 10ms in VMs or 1ms using the default 1000hz or simply because SRTT recently incorporated a larger value.

If an ACK arrives before the calculated badrxtwin (now + SRTT):
tp->t_badrxtwin = ticks + (tp->t_srtt >> (TCP_RTT_SHIFT + 1));

We'll erroneously reset snd_una to snd_max. If multiple segments were dropped and this happens repeatedly the transmit rate will be limited to 1MSS per RTO until we've retransmitted all drops.

Reported by: rstone
Reviewed by: hiren, transport
Approved by: sbruno
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D8556


# 5ada5423 04-May-2018 Andrey V. Elsukov <ae@FreeBSD.org>

Immediately propagate EACCES error code to application from tcp_output.

In r309610 and r315514 the behavior of handling EACCES was changed, and
tcp_output() now returns zero when EACCES happens. The reason of this
change was a hesitation that applications that use TCP-MD5 will be
affected by changes in project/ipsec.

TCP-MD5 code returns EACCES when security assocition for given connection
is not configured. But the same error code can return pfil(9), and this
change has affected connections blocked by pfil(9). E.g. application
doesn't return immediately when SYN segment is blocked, instead it waits
when several tries will be failed.

Actually, for TCP-MD5 application it doesn't matter will it get EACCES
after first SYN, or after several tries. Security associtions must be
configured before initiating TCP connection.

I left the EACCES in the switch() to show that it has special handling.

Reported by: Andreas Longwitz <longwitz at incore dot de>
MFC after: 10 days


# dcaffbd6 10-Apr-2018 Jonathan T. Looney <jtl@FreeBSD.org>

Move the TCP Blackbox Recorder probe in tcp_output.c to be with the
other tracing/debugging code.

Sponsored by: Netflix, Inc.


# 2529f56e 22-Mar-2018 Jonathan T. Looney <jtl@FreeBSD.org>

Add the "TCP Blackbox Recorder" which we discussed at the developer
summits at BSDCan and BSDCam in 2017.

The TCP Blackbox Recorder allows you to capture events on a TCP connection
in a ring buffer. It stores metadata with the event. It optionally stores
the TCP header associated with an event (if the event is associated with a
packet) and also optionally stores information on the sockets.

It supports setting a log ID on a TCP connection and using this to correlate
multiple connections that share a common log ID.

You can log connections in different modes. If you are doing a coordinated
test with a particular connection, you may tell the system to put it in
mode 4 (continuous dump). Or, if you just want to monitor for errors, you
can put it in mode 1 (ring buffer) and dump all the ring buffers associated
with the connection ID when we receive an error signal for that connection
ID. You can set a default mode that will be applied to a particular ratio
of incoming connections. You can also manually set a mode using a socket
option.

This commit includes only basic probes. rrs@ has added quite an abundance
of probes in his TCP development work. He plans to commit those soon.

There are user-space programs which we plan to commit as ports. These read
the data from the log device and output pcapng files, and then let you
analyze the data (and metadata) in the pcapng files.

Reviewed by: gnn (previous version)
Obtained from: Netflix, Inc.
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D11085


# 18a75309 25-Feb-2018 Patrick Kelsey <pkelsey@FreeBSD.org>

Greatly reduce the number of #ifdefs supporting the TCP_RFC7413 kernel option.

The conditional compilation support is now centralized in
tcp_fastopen.h and tcp_var.h. This doesn't provide the minimum
theoretical code/data footprint when TCP_RFC7413 is disabled, but
nearly all the TFO code should wind up being removed by the optimizer,
the additional footprint in the syncache entries is a single pointer,
and the additional overhead in the tcpcb is at the end of the
structure.

This enables the TCP_RFC7413 kernel option by default in amd64 and
arm64 GENERIC.

Reviewed by: hiren
MFC after: 1 month
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14048


# c560df6f 25-Feb-2018 Patrick Kelsey <pkelsey@FreeBSD.org>

This is an implementation of the client side of TCP Fast Open (TFO)
[RFC7413]. It also includes a pre-shared key mode of operation in
which the server requires the client to be in possession of a shared
secret in order to successfully open TFO connections with that server.

The names of some existing fastopen sysctls have changed (e.g.,
net.inet.tcp.fastopen.enabled -> net.inet.tcp.fastopen.server_enable).

Reviewed by: tuexen
MFC after: 1 month
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14047


# 151ba793 24-Dec-2017 Alexander Kabaev <kan@FreeBSD.org>

Do pass removing some write-only variables from the kernel.

This reduces noise when kernel is compiled by newer GCC versions,
such as one used by external toolchain ports.

Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial)
Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c)
Differential Revision: https://reviews.freebsd.org/D10385


# 2aad6240 13-Dec-2017 Andrey V. Elsukov <ae@FreeBSD.org>

Fix mbuf leak when TCPMD5_OUTPUT() method returns error.

PR: 223817
MFC after: 1 week


# 66492fea 07-Dec-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Separate out send buffer autoscaling code into function, so that
alternative TCP stacks may reuse it instead of pasting.

Obtained from: Netflix


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# 3e21cbc8 13-Nov-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Style r320614: don't initialize at declaration, new line after declarations,
shorten variable name to avoid extra long lines.
No functional changes.


# 3bdf4c42 11-Oct-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Declare more TCP globals in tcp_var.h, so that alternative TCP stacks
can use them. Gather all TCP tunables in tcp_var.h in one place and
alphabetically sort them, to ease maintainance of the list.

Don't copy and paste declarations in tcp_stacks/fastpath.c.


# cb503ae2 05-Jul-2017 Jonathan T. Looney <jtl@FreeBSD.org>

Don't overpromote values when calculating len in tcp_output().

sbavail() returns u_int and sendwin is a uint32_t. Therefore, min() (which
operates on two u_int values) is able to correctly calculate the minimum
of these two arguments.

Reported by: rrs
MFC after: 1 week
Sponsored by: Netflix


# ac952dd2 03-Jul-2017 Sean Bruno <sbruno@FreeBSD.org>

Add a sysctl to toggle the use of the sockets LOWAT when calculating auto window growth

Submitted by: j@nitrology.com (Jason Wolfe)
Reviewed by: gnn hiren
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D11016


# e44c1887 10-Apr-2017 Steven Hartland <smh@FreeBSD.org>

Use estimated RTT for receive buffer auto resizing instead of timestamps

Switched from using timestamps to RTT estimates when performing TCP receive
buffer auto resizing, as not all hosts support / enable TCP timestamps.

Disabled reset of receive buffer auto scaling when not in bulk receive mode,
which gives an extra 20% performance increase.

Also extracted auto resizing to a common method shared between standard and
fastpath modules.

With this AWS S3 downloads at ~17ms latency on a 1Gbps connection jump from
~3MB/s to ~100MB/s using the default settings.

Reviewed by: lstewart, gnn
MFC after: 2 weeks
Relnotes: Yes
Sponsored by: Multiplay
Differential Revision: https://reviews.freebsd.org/D9668


# 4a5c6c6a 27-Mar-2017 Mike Karels <karels@FreeBSD.org>

Enable route and LLE (ndp) caching in TCP/IPv6

tcp_output.c was using a route on the stack for IPv6, which does not
allow route caching or LLE/ndp caching. Switch to using the route
(v6 flavor) in the in_pcb, which was already present, which caches
both L3 and L2 lookups.

Reviewed by: gnn hiren
MFC after: 2 weeks


# fbbd9655 28-Feb-2017 Warner Losh <imp@FreeBSD.org>

Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96


# 8d62aae8 23-Feb-2017 Michael Tuexen <tuexen@FreeBSD.org>

TCP window updates are only sent if the window can be increased by at
least 2 * MSS. However, if the receive buffer size is small, this might
be impossible. Add back a criterion to send a TCP window update if
the window can be increased by at least half of the receive buffer size.
This condition was removed in r242252. This patch simply brings it back.
PR: 211003
Reviewed by: gnn
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D9475


# fcf59617 06-Feb-2017 Andrey V. Elsukov <ae@FreeBSD.org>

Merge projects/ipsec into head/.

Small summary
-------------

o Almost all IPsec releated code was moved into sys/netipsec.
o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel
option IPSEC_SUPPORT added. It enables support for loading
and unloading of ipsec.ko and tcpmd5.ko kernel modules.
o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by
default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type
support was removed. Added TCP/UDP checksum handling for
inbound packets that were decapsulated by transport mode SAs.
setkey(8) modified to show run-time NAT-T configuration of SA.
o New network pseudo interface if_ipsec(4) added. For now it is
build as part of ipsec.ko module (or with IPSEC kernel).
It implements IPsec virtual tunnels to create route-based VPNs.
o The network stack now invokes IPsec functions using special
methods. The only one header file <netipsec/ipsec_support.h>
should be included to declare all the needed things to work
with IPsec.
o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed.
Now these protocols are handled directly via IPsec methods.
o TCP_SIGNATURE support was reworked to be more close to RFC.
o PF_KEY SADB was reworked:
- now all security associations stored in the single SPI namespace,
and all SAs MUST have unique SPI.
- several hash tables added to speed up lookups in SADB.
- SADB now uses rmlock to protect access, and concurrent threads
can do SA lookups in the same time.
- many PF_KEY message handlers were reworked to reflect changes
in SADB.
- SADB_UPDATE message was extended to support new PF_KEY headers:
SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They
can be used by IKE daemon to change SA addresses.
o ipsecrequest and secpolicy structures were cardinally changed to
avoid locking protection for ipsecrequest. Now we support
only limited number (4) of bundled SAs, but they are supported
for both INET and INET6.
o INPCB security policy cache was introduced. Each PCB now caches
used security policies to avoid SP lookup for each packet.
o For inbound security policies added the mode, when the kernel does
check for full history of applied IPsec transforms.
o References counting rules for security policies and security
associations were changed. The proper SA locking added into xform
code.
o xform code was also changed. Now it is possible to unregister xforms.
tdb_xxx structures were changed and renamed to reflect changes in
SADB/SPDB, and changed rules for locking and refcounting.

Reviewed by: gnn, wblock
Obtained from: Yandex LLC
Relnotes: yes
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D9352


# 3df96ee6 29-Jan-2017 Cy Schubert <cy@FreeBSD.org>

Correct comment grammar and make it easier to understand.

MFC after: 1 week


# 2b9c9984 03-Jan-2017 George V. Neville-Neil <gnn@FreeBSD.org>

Fix DTrace TCP tracepoints to not use mtod() as it is both unnecessary and
dangerous. Those wanting data from an mbuf should use DTrace itself to get
the data.

PR: 203409
Reviewed by: hiren
MFC after: 1 week
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D9035


# b6ff6724 11-Dec-2016 Hiren Panchasara <hiren@FreeBSD.org>

We currently don't do TSO if ip options are present. In case of IPv6, we look at
in6p_options to check that. That is incorrect as we carry ip options in
in6p_outputopts. Also, just checking for in6p_outputopts being NULL won't
suffice as we combine ip options and ip header fields both in that one field.
The commit fixes this by using ip6_optlen() which correctly calculates length
of only ip options for IPv6.

Reviewed by: ae, bz
MFC after: 3 weeks
Sponsored by: Limelight Networks


# 68bd7ed1 12-Oct-2016 Jonathan T. Looney <jtl@FreeBSD.org>

The TFO server-side code contains some changes that are not conditioned on
the TCP_RFC7413 kernel option. This change removes those few instructions
from the packet processing path.

While not strictly necessary, for the sake of consistency, I applied the
new IS_FASTOPEN macro to all places in the packet processing path that
used the (t_flags & TF_FASTOPEN) check.

Reviewed by: hiren
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8219


# bd79708d 11-Oct-2016 Jonathan T. Looney <jtl@FreeBSD.org>

In the TCP stack, the hhook(9) framework provides hooks for kernel modules
to add actions that run when a TCP frame is sent or received on a TCP
session in the ESTABLISHED state. In the base tree, this functionality is
only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd,
and cc_vegas congestion control modules.

Presently, we incur overhead to check for hooks each time a TCP frame is
sent or received on an ESTABLISHED TCP session.

This change adds a new compile-time option (TCP_HHOOK) to determine whether
to include the hhook(9) framework for TCP. To retain backwards
compatibility, I added the TCP_HHOOK option to every configuration file that
already defined "options INET". (Therefore, this patch introduces no
functional change. In order to see a functional difference, you need to
compile a custom kernel without the TCP_HHOOK option.) This change will
allow users to easily exclude this functionality from their kernel, should
they wish to do so.

Note that any users who use a custom kernel configuration and use one of the
congestion control modules listed above will need to add the TCP_HHOOK
option to their kernel configuration.

Reviewed by: rrs, lstewart, hiren (previous version), sjg (makefiles only)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8185


# 3ac12506 06-Oct-2016 Jonathan T. Looney <jtl@FreeBSD.org>

Remove "long" variables from the TCP stack (not including the modular
congestion control framework).

Reviewed by: gnn, lstewart (partial)
Sponsored by: Juniper Networks, Netflix
Differential Revision: (multiple)
Tested by: Limelight, Netflix


# 0dda76b8 06-Oct-2016 Jonathan T. Looney <jtl@FreeBSD.org>

If the new window size is less than the old window size, skip the
calculations to check if we should advertise a larger window.

Reviewed by: gnn
MFC after: 2 weeks
Sponsored by: Juniper Networks, Netflix
Differential Revision: https://reviews.freebsd.org/D7076
Tested by: Limelight, Netflix


# 15c82571 06-Oct-2016 Jonathan T. Looney <jtl@FreeBSD.org>

Correctly calculate snd_max in persist case.

In the persist case, take the SYN and FIN flags into account when updating
the sequence space sent.

Reviewed by: gnn
MFC after: 2 weeks
Sponsored by: Juniper Networks, Netflix
Differential Revision: https://reviews.freebsd.org/D7075
Tested by: Limelight, Netflix


# c3bef61e 15-Sep-2016 Kevin Lo <kevlo@FreeBSD.org>

Remove the 4.3BSD compatible macro m_copy(), use m_copym() instead.

Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D7878


# 425b7639 29-May-2016 Sepherosa Ziehau <sephe@FreeBSD.org>

tcp: Don't prematurely drop receiving-only connections

If the connection was persistent and receiving-only, several (12)
sporadic device insufficient buffers would cause the connection be
dropped prematurely:
Upon ENOBUFS in tcp_output() for an ACK, retransmission timer is
started. No one will stop this retransmission timer for receiving-
only connection, so the retransmission timer promises to expire and
t_rxtshift is promised to be increased. And t_rxtshift will not be
reset to 0, since no RTT measurement will be done for receiving-only
connection. If this receiving-only connection lived long enough
(e.g. >350sec, given the RTO starts from 200ms), and it suffered 12
sporadic device insufficient buffers, i.e. t_rxtshift >= 12, this
receiving-only connection would be dropped prematurely by the
retransmission timer.

We now assert that for data segments, SYNs or FINs either rexmit or
persist timer was wired upon ENOBUFS. And don't set rexmit timer
for other cases, i.e. ENOBUFS upon ACKs.

Discussed with: lstewart, hiren, jtl, Mike Karels
MFC after: 3 weeks
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D5872


# 883054b4 19-May-2016 Don Lewis <truckman@FreeBSD.org>

Change net.inet.tcp.ecn.enable sysctl mib from a binary off/on
control to a three way setting.
0 - Totally disable ECN. (no change)
1 - Enable ECN if incoming connections request it. Outgoing
connections will request ECN. (no change from present != 0 setting)
2 - Enable ECN if incoming connections request it. Outgoing
conections will not request ECN.

Change the default value of net.inet.tcp.ecn.enable from 0 to 2.

Linux version 2.4.20 and newer, Solaris, and Mac OS X 10.5 and newer have
similar capabilities. The actual values above match Linux, and the default
matches the current Linux default.

Reviewed by: eadler
MFC after: 1 month
MFH: yes
Sponsored by: https://reviews.freebsd.org/D6386


# a4641f4e 03-May-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/net*: minor spelling fixes.

No functional change.


# 84cc0778 24-Mar-2016 George V. Neville-Neil <gnn@FreeBSD.org>

FreeBSD previously provided route caching for TCP (and UDP). Re-add
route caching for TCP, with some improvements. In particular, invalidate
the route cache if a new route is added, which might be a better match.
The cache is automatically invalidated if the old route is deleted.

Submitted by: Mike Karels
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D4306


# 5d20f974 22-Mar-2016 Jonathan T. Looney <jtl@FreeBSD.org>

to_flags is currently a 64-bit integer; however, we only use 7 bits.
Furthermore, there is no reason this needs to be a 64-bit integer
for the forseeable future.

Also, there is an inconsistency between to_flags and the mask in
tcp_addoptions(). Before r195654, to_flags was a u_long and the mask in
tcp_addoptions() was a u_int. r195654 changed to_flags to be a u_int64_t
but left the mask in tcp_addoptions() as a u_int, meaning that these
variables will only be the same width on platforms with 64-bit integers.

Convert both to_flags and the mask in tcp_addoptions() to be explicitly
32-bit variables. This may save a few cycles on 32-bit platforms, and
avoids unnecessarily mixing types.

Differential Revision: https://reviews.freebsd.org/D5584
Reviewed by: hiren
MFC after: 2 weeks
Sponsored by: Juniper Networks


# e79cb051 03-Mar-2016 George V. Neville-Neil <gnn@FreeBSD.org>

Fix dtrace probes (introduced in 287759): debug__input was used
for output and drop; connect didn't always fire a user probe
some probes were missing in fastpath

Submitted by: Hannes Mehnert
Sponsored by: REMS, EPSRC
Differential Revision: https://reviews.freebsd.org/D5525


# 4644fda3 27-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Rename netinet/tcp_cc.h to netinet/cc/cc.h.

Discussed with: lstewart


# 0645c604 26-Jan-2016 Hiren Panchasara <hiren@FreeBSD.org>

Persist timers TCPTV_PERSMIN and TCPTV_PERSMAX are hardcoded with 5 seconds and
60 seconds, respectively. Turn them into sysctls that can be tuned live. The
default values of 5 seconds and 60 seconds have been retained.

Submitted by: Jason Wolfe (j at nitrology dot com)
Reviewed by: gnn, rrs, hiren, bz
MFC after: 1 week
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D5024


# 2de3e790 21-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

- Rename cc.h to more meaningful tcp_cc.h.
- Declare it a kernel only include, which it already is.
- Don't include tcp.h implicitly from tcp_cc.h


# f73d9fd2 14-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

There is a bug in tcp_output()'s implementation of the TCP_SIGNATURE
(RFC 2385/TCP-MD5) kernel option.

If a tcpcb has TF_NOOPT flag, then tcp_addoptions() is not called,
and to.to_signature is an uninitialized stack variable. The value
is later used as write offset, which leads to writing to random
address.

Submitted by: rstone, jtl
Security: SA-16:05.tcp


# 0c39d38d 06-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Historically we have two fields in tcpcb to describe sender MSS: t_maxopd,
and t_maxseg. This dualism emerged with T/TCP, but was not properly cleaned
up after T/TCP removal. After all permutations over the years the result is
that t_maxopd stores a minimum of peer offered MSS and MTU reduced by minimum
protocol header. And t_maxseg stores (t_maxopd - TCPOLEN_TSTAMP_APPA) if
timestamps are in action, or is equal to t_maxopd otherwise. That's a very
rough estimate of MSS reduced by options length. Throughout the code it
was used in places, where preciseness was not important, like cwnd or
ssthresh calculations.

With this change:

- t_maxopd goes away.
- t_maxseg now stores MSS not adjusted by options.
- new function tcp_maxseg() is provided, that calculates MSS reduced by
options length. The functions gives a better estimate, since it takes
into account SACK state as well.

Reviewed by: jtl
Differential Revision: https://reviews.freebsd.org/D3593


# 281a0fd4 24-Dec-2015 Patrick Kelsey <pkelsey@FreeBSD.org>

Implementation of server-side TCP Fast Open (TFO) [RFC7413].

TFO is disabled by default in the kernel build. See the top comment
in sys/netinet/tcp_fastopen.c for implementation particulars.

Reviewed by: gnn, jch, stas
MFC after: 3 days
Sponsored by: Verisign, Inc.
Differential Revision: https://reviews.freebsd.org/D4350


# 86a996e6 13-Oct-2015 Hiren Panchasara <hiren@FreeBSD.org>

There are times when it would be really nice to have a record of the last few
packets and/or state transitions from each TCP socket. That would help with
narrowing down certain problems we see in the field that are hard to reproduce
without understanding the history of how we got into a certain state. This
change provides just that.

It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is
destroyed, the list is freed. I thought this was likely to be more
performance-friendly than saving copies of the tcpcb. Plus, with the packets,
you should be able to reverse-engineer what happened to the tcpcb.

To enable the feature, you will need to compile a kernel with the TCPPCAP
option. Even then, the feature defaults to being deactivated. You can activate
it by setting a positive value for the number of captured packets. You can do
that on either a global basis or on a per-socket basis (via a setsockopt call).

There is no way to get the packets out of the kernel other than using kmem or
getting a coredump. I thought that would help some of the legal/privacy concerns
regarding such a feature. However, it should be possible to add a future effort
to export them in PCAP format.

I tested this at low scale, and found that there were no mbuf leaks and the peak
mbuf usage appeared to be unchanged with and without the feature.

The main performance concern I can envision is the number of mbufs that would be
used on systems with a large number of sockets. If you save five packets per
direction per socket and have 3,000 sockets, that will consume at least 30,000
mbufs just to keep these packets. I tried to reduce the concerns associated with
this by limiting the number of clusters (not mbufs) that could be used for this
feature. Again, in my testing, that appears to work correctly.

Differential Revision: D3100
Submitted by: Jonathan Looney <jlooney at juniper dot net>
Reviewed by: gnn, hiren


# d76d4012 14-Sep-2015 Hans Petter Selasky <hselasky@FreeBSD.org>

Update TSO limits to include all headers.

To make driver programming easier the TSO limits are changed to
reflect the values used in the BUSDMA tag a network adapter driver is
using. The TCP/IP network stack will subtract space for all linklevel
and protocol level headers and ensure that the full mbuf chain passed
to the network adapter fits within the given limits.

Implementation notes:

If a network adapter driver needs to fixup the first mbuf in order to
support VLAN tag insertion, the size of the VLAN tag should be
subtracted from the TSO limit. Else not.

Network adapters which typically inline the complete header mbuf could
technically transmit one more segment. This patch does not implement a
mechanism to recover the last segment for data transmission. It is
believed when sufficiently large mbuf clusters are used, the segment
limit will not be reached and recovering the last segment will not
have any effect.

The current TSO algorithm tries to send MTU-sized packets, where the
MTU typically is 1500 bytes, which gives 1448 bytes of TCP data
payload per packet for IPv4. That means if the TSO length limitiation
is set to 65536 bytes, there will be a data payload remainder of
(65536 - 1500) mod 1448 bytes which is equal to 324 bytes. Trying to
recover total TSO length due to inlining mbuf header data will not
have any effect, because adding or removing the ETH/IP/TCP headers
to or from 324 bytes will not cause more or less TCP payload to be
TSO'ed.

Existing network adapter limits will be updated separately.

Differential Revision: https://reviews.freebsd.org/D3458
Reviewed by: rmacklem
MFC after: 2 weeks


# 5d06879a 13-Sep-2015 George V. Neville-Neil <gnn@FreeBSD.org>

dd DTrace probe points, translators and a corresponding script
to provide the TCPDEBUG functionality with pure DTrace.

Reviewed by: rwatson
MFC after: 2 weeks
Sponsored by: Limelight Networks
Differential Revision: D3530


# fc4443a1 25-Jul-2015 Kristof Provost <kp@FreeBSD.org>

Remove stale comment.

The IPv6 pseudo header checksum was added by bz in r235961.

Sponsored by: Essen FreeBSD Hackathon


# 47a8e865 21-Jul-2015 Xin LI <delphij@FreeBSD.org>

Fix resource exhaustion due to sessions stuck in LAST_ACK state.

Submitted by: Jonathan Looney (Juniper SIRT)
Reviewed by: lstewart
Security: CVE-2015-5358
Security: SA-15:13.tcp


# f8568079 29-Jun-2015 Hiren Panchasara <hiren@FreeBSD.org>

Avoid a situation where we do not set persist timer after a zero window
condition.
If you send a 0-length packet, but there is data is the socket buffer, and
neither the rexmt or persist timer is already set, then activate the persist
timer.

PR: 192599
Differential Revision: D2946
Submitted by: jlott at averesystems dot com
Reviewed by: jhb, jch, gnn, hiren
Tested by: jlott at averesystems dot com, jch
MFC after: 2 weeks


# ed6a66ca 05-Jan-2015 Robert Watson <rwatson@FreeBSD.org>

To ease changes to underlying mbuf structure and the mbuf allocator, reduce
the knowledge of mbuf layout, and in particular constants such as M_EXT,
MLEN, MHLEN, and so on, in mbuf consumers by unifying various alignment
utility functions (M_ALIGN(), MH_ALIGN(), MEXT_ALIGN() in a single
M_ALIGN() macro, implemented by a now-inlined m_align() function:

- Move m_align() from uipc_mbuf.c to mbuf.h; mark as __inline.
- Reimplement M_ALIGN(), MH_ALIGN(), and MEXT_ALIGN() using m_align().
- Update consumers around the tree to simply use M_ALIGN().

This change eliminates a number of cases where mbuf consumers must be aware
of whether or not mbufs returned by the allocator use external storage, but
also assumptions about the size of the returned mbuf. This will make it
easier to introduce changes in how we use external storage, as well as
features such as variable-size mbufs.

Differential Revision: https://reviews.freebsd.org/D1436
Reviewed by: glebius, trasz, gnn, bz
Sponsored by: EMC / Isilon Storage Division


# cfa6009e 12-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

In preparation of merging projects/sendfile, transform bare access to
sb_cc member of struct sockbuf to a couple of inline functions:

sbavail() and sbused()

Right now they are equal, but once notion of "not ready socket buffer data",
will be checked in, they are going to be different.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 3c7c188c 10-Nov-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Fix some minor TSO issues:
- Improve description of TSO limits.
- Remove a not needed KASSERT()
- Remove some not needed variable casts.

Sponsored by: Mellanox Technologies
Discussed with: lstewart @
MFC after: 1 week


# 6df8a710 07-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Remove SYSCTL_VNET_* macros, and simply put CTLFLAG_VNET where needed.

Sponsored by: Nginx, Inc.


# 257480b8 04-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Convert netinet6/ to use new routing API.

* Remove &ifpp from ip6_output() in favor of ri->ri_nh_info
* Provide different wrappers to in6_selectsrc:
Currently it is used by 2 differenct type of customers:
- socket-based one, which all are unsure about provided
address scope and
- in-kernel ones (ND code mostly), which don't have
any sockets, options, crededentials, etc.
So, we provide two different wrappers to in6_selectsrc()
returning select source.
* Make different versions of selectroute():
Currenly selectroute() is used in two scenarios:
- SAS, via in6_selecsrc() -> in6_selectif() -> selectroute()
- output, via in6_output -> wrapper -> selectroute()
Provide different versions for each customer:
- fib6_lookup_nh_basic()-based in6_selectif() which is
capable of returning interface only, without MTU/NHOP/L2
calculations
- full-blown fib6_selectroute() with cached route/multipath/
MTU/L2
* Stop using routing table for link-local address lookups
* Add in6_ifawithifp_lla() to make for-us check faster for link-local
* Add in6_splitscope / in6_setllascope for faster embed/deembed scopes


# b4e8f808 19-Oct-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Switch IPv4 output path to use new routing api.

The goals of the new API is to provide consumers with minimal
needed information, but as fast as possible. So we provide
full nexthop info copied into alighed on-cache structure
instead of rte/ia pointers, their refcounts and locks.
This does not provide solution for protecting from egress
ifp destruction, but does not make it any worse.

Current changes:

nhops:
Add fib4_lookup_prepend() function which stores either full
L2+L3 prepend info (e.g. MAC header in case of plain IPv4) or
L3 info with NH_FLAGS_L2_INCOMPLETE flag indicating that no valid L2
info exists and we have to take "slow" path.

ip_output:
Currently ip[ 46]_output consumers use 'struct route' for
the following purposes:
1) double lookup avoidance(route caching)
2) plain route caching
3) get path MTU to be able to notify source.
The former pattern is mostly used by various tunnels
(gif, gre, stf). (Actually, gre is the only remaining,
others were already converted. Their locking model did
not scale good enogh to benefit from such caching, so
we have (temporarily) removed it without any performance
loss).
Plain route caching used by SCTP is simply wrong and should be removed.
Temporary break it for now just to be able to compile.
Optimize path mtu reporting by providing it in new 'route_info' stucture.

Minimize games with @ia locking/refcounting for route lookup:
add special nhop[46]_extended structure to store more route attributes.
Pointer to given structure can be passed to fib4_lookup_prepend() to indicate
we want this info (we actually needs it for UDP and raw IP).

ether_output:
Provide light-weight ether_output2() call to deal with
transmitting L2 frame (e.g. properly handle broadcast/simloop/bridge/
other L2 hooks before actually transmitting frame by if_transmit()).
Add a hack based on new RT_NHOP ro_flag to distinguish which version should
we call. Better way is probably to add a new "if_output_frame" driver
callbacks.

Next steps:
* Convert ip_fastfwd part
* Implement auto-growing array for per-radix nexthops
* Implement LLE tracking for nexthop calculations to be able to
immediately provide all necessary info in single route lookup
for gateway routes
* Switch radix locking scheme to runtime/cfg lock
* Implement multipath support for rtsock
* Implement "tracked nexthops" for tunnels (e.g. _proper_
nexthop caching)
* Add IPv6 support for remaining parts (postponed not to
interfere with user/ae/inet6 branch)
* Consider adding "if_output_frame" driver call to
ease logical frame pushing.


# 0f3e3bc5 13-Oct-2014 Sean Bruno <sbruno@FreeBSD.org>

Catch ipv6 case when attempting to do PLPMTUD blackhole detection.

Submitted by: Mikhail <mp@lenta.ru>
MFC after: 2 weeks
Relnotes: yes


# f6f6703f 07-Oct-2014 Sean Bruno <sbruno@FreeBSD.org>

Implement PLPMTUD blackhole detection (RFC 4821), inspired by code
from xnu sources. If we encounter a network where ICMP is blocked
the Needs Frag indicator may not propagate back to us. Attempt to
downshift the mss once to a preconfigured value.

Default this feature to off for now while we do not have a full PLPMTUD
implementation in our stack.

Adds the following new sysctl's for control:
net.inet.tcp.pmtud_blackhole_detection -- turns on/off this feature
net.inet.tcp.pmtud_blackhole_mss -- mss to try for ipv4
net.inet.tcp.v6pmtud_blackhole_mss -- mss to try for ipv6

Adds the following new sysctl's for monitoring:
-- Number of times the code was activated to attempt a mss downshift
net.inet.tcp.pmtud_blackhole_activated
-- Number of times the blackhole mss was used in an attempt to downshift
net.inet.tcp.pmtud_blackhole_min_activated
-- Number of times that we failed to connect after we downshifted the mss
net.inet.tcp.pmtud_blackhole_failed

Phabricator: https://reviews.freebsd.org/D506
Reviewed by: rpaulo bz
MFC after: 2 weeks
Relnotes: yes
Sponsored by: Limelight Networks


# b228e6bf 06-Oct-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Minor code styling.

Suggested by: glebius @


# 9fd573c3 22-Sep-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Improve transmit sending offload, TSO, algorithm in general.

The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.

Reviewed by: adrian, rmacklem
Sponsored by: Mellanox Technologies
MFC after: 1 week


# 72f31000 13-Sep-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Revert r271504. A new patch to solve this issue will be made.

Suggested by: adrian @


# eb93b77a 13-Sep-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Improve transmit sending offload, TSO, algorithm in general.

The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.

MFC after: 1 week
Sponsored by: Mellanox Technologies


# 43630e62 03-Jul-2014 Hiren Panchasara <hiren@FreeBSD.org>

Fix a typo.


# e3a7aa6f 04-Mar-2014 Gleb Smirnoff <glebius@FreeBSD.org>

- Remove rt_metrics_lite and simply put its members into rtentry.
- Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This
removes another cache trashing ++ from packet forwarding path.
- Create zini/fini methods for the rtentry UMA zone. Via initialize
mutex and counter in them.
- Fix reporting of rmx_pksent to routing socket.
- Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode.

The change is mostly targeted for stable/10 merge. For head,
rt_pksent is expected to just disappear.

Discussed with: melifaro
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# d9fae5ab 26-Nov-2013 Andriy Gapon <avg@FreeBSD.org>

dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE

In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by: markj
MFC after: 3 weeks


# 54366c0b 25-Nov-2013 Attilio Rao <attilio@FreeBSD.org>

- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip


# 57f60867 25-Aug-2013 Mark Johnston <markj@FreeBSD.org>

Implement the ip, tcp, and udp DTrace providers. The probe definitions use
dynamic translation so that their arguments match the definitions for
these providers in Solaris and illumos. Thus, existing scripts for these
providers should work unmodified on FreeBSD.

Tested by: gnn, hiren
MFC after: 1 month


# 3c914c54 02-Jun-2013 Andre Oppermann <andre@FreeBSD.org>

Allow drivers to specify a maximum TSO length in bytes if they are
limited in the amount of data they can handle at once.

Drivers can set ifp->if_hw_tsomax before calling ether_ifattach() to
change the limit.

The lowest allowable size is IP_MAXPACKET / 8 (8192 bytes) as anything
less wouldn't be very useful anymore. The upper limit is still at
IP_MAXPACKET (65536 bytes). Raising it requires further auditing of
the IPv4/v6 code path's as the length field in the IP header would
overflow leading to confusion in firewalls and others packet handler on
the real size of the packet.

The placement into "struct ifnet" is a bit hackish but the best place
that was found. When the stack/driver boundary is updated it should
be handled in a better way.

Submitted by: cperciva (earlier version)
Reviewed by: cperciva
Tested by: cperciva
MFC after: 1 week (using spare struct members to preserve ABI)


# 0e2bc05c 11-Apr-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Fix tcp_output() so that tcpcb is updated in the same manner when an
mbuf allocation fails, as in a case when ip_output() returns error.

To achieve that, move large block of code that updates tcpcb below
the out: label.

This fixes a panic, that requires the following sequence to happen:

1) The SYN was sent to the network, tp->snd_nxt = iss + 1, tp->snd_una = iss
2) The retransmit timeout happened for the SYN we had sent,
tcp_timer_rexmt() sets tp->snd_nxt = tp->snd_una, and calls tcp_output().
In tcp_output m_get() fails.
3) Later on the SYN|ACK for the SYN sent in step 1) came,
tcp_input sets tp->snd_una += 1, which leads to
tp->snd_una > tp->snd_nxt inconsistency, that later panics in
socket buffer code.

For reference, this bug fixed in DragonflyBSD repo:

http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/1ff9b7d322dc5a26f7173aa8c38ecb79da80e419

Reviewed by: andre
Tested by: pho
Sponsored by: Nginx, Inc.
PR: kern/177456
Submitted by: HouYeFei&XiBoLiu <lglion718 163.com>


# aa8bd99d 16-Mar-2013 Gleb Smirnoff <glebius@FreeBSD.org>

- Replace compat macros with function calls.


# 39f6074e 14-Mar-2013 Gleb Smirnoff <glebius@FreeBSD.org>

- Use m_getcl() instead of hand allocating.

Sponsored by: Nginx, Inc.


# eb1b1807 05-Dec-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually


# 0f5e7edc 07-Nov-2012 Kevin Lo <kevlo@FreeBSD.org>

Fix typo; s/ouput/output


# 78f59b4b 29-Oct-2012 Andre Oppermann <andre@FreeBSD.org>

Forced commit to provide the correct commit message to r242251:

Defer sending an independent window update if a delayed ACK is pending
saving a packet. The window update then gets piggy-backed on the next
already scheduled ACK.

Added grammar fixes as well.

MFC after: 2 weeks


# f62563d3 28-Oct-2012 Andre Oppermann <andre@FreeBSD.org>

Prevent a flurry of forced window updates when an application is
doing small reads on a (partially) filled receive socket buffer.

Normally one would a send a window update every time the available
space in the socket buffer increases by two times MSS. This leads
to a flurry of window updates that do not provide any meaningful
new information to the sender. There still is available space in
the window and the sender can continue sending data. All window
updates then get carried by the regular ACKs. Only when the socket
buffer was (almost) full and the window closed accordingly a window
updates delivery new information and allows the sender to start
sending more data again.

Send window updates only every two MSS when the socket buffer
has less than 1/8 space available, or the available space in the
socket buffer increased by 1/4 its full capacity, or the socket
buffer is very small. The next regular data ACK will carry and
report the exact window size again.

Reported by: sbruno
Tested by: darrenr
Tested by: Darren Baginski
PR: kern/116335
MFC after: 2 weeks


# 4249614c 28-Oct-2012 Andre Oppermann <andre@FreeBSD.org>

When SYN or SYN/ACK had to be retransmitted RFC5681 requires us to
reduce the initial CWND to one segment. This reduction got lost
some time ago due to a change in initialization ordering.

Additionally in tcp_timer_rexmt() avoid entering fast recovery when
we're still in TCPS_SYN_SENT state.

MFC after: 2 weeks


# 8f134647 22-Oct-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Switch the entire IPv4 stack to keep the IP packet header
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.

After this change a packet processed by the stack isn't
modified at all[2] except for TTL.

After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.

[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.

[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.

Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>


# df0633a1 16-Jul-2012 Gleb Smirnoff <glebius@FreeBSD.org>

If ip_output() returns EMSGSIZE to tcp_output(), then the latter calls
tcp_mtudisc(), which in its turn may call tcp_output(). Under certain
conditions (must admit they are very special) an infinite recursion can
happen.

To avoid recursion we can pass struct route to ip_output() and obtain
correct mtu. This allows us not to use tcp_mtudisc() but call tcp_mss_update()
directly.

PR: kern/155585
Submitted by: Andrey Zonov <andrey zonov.org> (original version of patch)


# 09fe6320 19-Jun-2012 Navdeep Parhar <np@FreeBSD.org>

- Updated TOE support in the kernel.

- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs.
These are available as t3_tom and t4_tom modules that augment cxgb(4)
and cxgbe(4) respectively. The cxgb/cxgbe drivers continue to work as
usual with or without these extra features.

- iWARP driver for Terminator 3 ASIC (kernel verbs). T4 iWARP in the
works and will follow soon.

Build-tested with make universe.

30s overview
============
What interfaces support TCP offload? Look for TOE4 and/or TOE6 in the
capabilities of an interface:
# ifconfig -m | grep TOE

Enable/disable TCP offload on an interface (just like any other ifnet
capability):
# ifconfig cxgbe0 toe
# ifconfig cxgbe0 -toe

Which connections are offloaded? Look for toe4 and/or toe6 in the
output of netstat and sockstat:
# netstat -np tcp | grep toe
# sockstat -46c | grep toe

Reviewed by: bz, gnn
Sponsored by: Chelsio communications.
MFC after: ~3 months (after 9.1, and after ensuring MFC is feasible)


# 356ab07e 28-May-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

It turns out that too many drivers are not only parsing the L2/3/4
headers for TSO but also for generic checksum offloading. Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.

To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.

Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.

This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.

Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.

Individual driver updates will have to follow, as will SCTP.

Reported by: gallatin, dim, ..
Reviewed by: gallatin (glanced at?)
MFC after: 3 days
X-MFC with: r235961,235959,235958


# 45747ba5 24-May-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

MFp4 bz_ipv6_fast:

Add code to handle pre-checked TCP checksums as indicated by mbuf
flags to save the entire computation for validation if not needed.

In the IPv6 TCP output path only compute the pseudo-header checksum,
set the checksum offset in the mbuf field along the appropriate flag
as done in IPv4.

In tcp_respond() just initialize the IPv6 payload length to 0 as
ip6_output() will properly set it.

Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems

Reviewed by: gnn (as part of the whole)
MFC After: 3 days


# ef341ee1 16-Apr-2012 Gleb Smirnoff <glebius@FreeBSD.org>

When we receive an ICMP unreach need fragmentation datagram, we take
proposed MTU value from it and update the TCP host cache. Then
tcp_mss_update() is called on the corresponding tcpcb. It finds the
just allocated entry in the TCP host cache and updates MSS on the
tcpcb. And then we do a fast retransmit of what we have in the tcp
send buffer.

This sequence gets broken if the TCP host cache is exausted. In this
case allocation fails, and later called tcp_mss_update() finds nothing
in cache. The fast retransmit is done with not reduced MSS and is
immidiately replied by remote host with new ICMP datagrams and the
cycle repeats. This ping-pong can go up to wirespeed.

To fix this:
- tcp_mss_update() gets new parameter - mtuoffer, that is like
offer, but needs to have min_protoh subtracted.
- tcp_mtudisc() as notification method renamed to tcp_mtudisc_notify().
- tcp_mtudisc() now accepts not a useless error argument, but proposed
MTU value, that is passed to tcp_mss_update() as mtuoffer.

Reported by: az
Reported by: Andrey Zonov <andrey zonov.org>
Reviewed by: andre (previous version of patch)


# d8951c8a 15-Feb-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

Fix PAWS (Protect Against Wrapped Sequence numbers) in cases when
hz >> 1000 and thus getting outside the timestamp clock frequenceny of
1ms < x < 1s per tick as mandated by RFC1323, leading to connection
resets on idle connections.

Always use a granularity of 1ms using getmicrouptime() making all but
relevant callouts independent of hz.

Use getmicrouptime(), not getmicrotime() as the latter may make a jump
possibly breaking TCP nfsroot mounts having our timestamps move forward
for more than 24.8 days in a second without having been idle for that
long.

PR: kern/61404
Reviewed by: jhb, mav, rrs
Discussed with: silby, lstewart
Sponsored by: Sandvine Incorporated (originally in 2011)
MFC after: 6 weeks


# ddd0c4a9 02-Nov-2011 Sergey Kandaurov <pluknet@FreeBSD.org>

Restore sysctl names for tcp_sendspace/tcp_recvspace.

They seem to be changed unintentionally in r226437, and there were no
any mentions of renaming in commit log message.

Reported by: Anton Yuzhaninov <citrin citrin ru>


# 873789cb 16-Oct-2011 Andre Oppermann <andre@FreeBSD.org>

Move the tcp_sendspace and tcp_recvspace sysctl's from
the middle of tcp_usrreq.c to the top of tcp_output.c
and tcp_input.c respectively next to the socket buffer
autosizing controls.

MFC after: 1 week


# 9ec4a4cc 16-Oct-2011 Andre Oppermann <andre@FreeBSD.org>

Remove the ss_fltsz and ss_fltsz_local sysctl's which have
long been superseded by the RFC3390 initial CWND sizing.

Also remove the remnants of TCP_METRICS_CWND which used the
TCP hostcache to set the initial CWND in a non-RFC compliant
way.

MFC after: 1 week


# b233773b 25-Aug-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Increase the defaults for the maximum socket buffer limit,
and the maximum TCP send and receive buffer limits from 256kB
to 2MB.

For sb_max_adj we need to add the cast as already used in the sysctl
handler to not overflow the type doing the maths.

Note that this is just the defaults. They will allow more memory
to be consumed per socket/connection if needed but not change the
default "idle" memory consumption. All values are still tunable
by sysctls.

Suggested by: gnn
Discussed on: arch (Mar and Aug 2011)
MFC after: 3 weeks
Approved by: re (kib)


# 472ea5be 05-Jul-2011 Colin Percival <cperciva@FreeBSD.org>

Remove #ifdef notyet code dating back to 4.3BSD Net/2 (and possibly earlier).

I think the benefit of making the code cleaner and easier to understand
outweighs the humour of leaving this intact (or possibly changing it to
#ifdef not_yet_and_probably_never).

MFC after: 2 weeks


# 75497cc5 20-Jun-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Fix a KASSERT from r212803 to check the correct length also in case of
IPsec being compiled in and used. Improve reporting by adding the length
fields to the panic message, so that we would have some immediate debugging
hints.

Discussed with: jhb


# 6b7c15e5 13-Jun-2011 John Baldwin <jhb@FreeBSD.org>

Advance the advertised window (rcv_adv) to the currently received data
(rcv_nxt) if we advertising a zero window. This can be true when ACK'ing
a window probe whose one byte payload was accepted rather than dropped
because the socket's receive buffer was not completely full, but the
remaining space was smaller than the window scale.

This ensures that window probe ACKs satisfy the assumption made in r221346
and closes a window where rcv_nxt could be greater than rcv_adv.

Tested by: trasz, pho, trociny
Reviewed by: silby
MFC after: 1 week


# f701e30d 02-May-2011 John Baldwin <jhb@FreeBSD.org>

Handle a rare edge case with nearly full TCP receive buffers. If a TCP
buffer fills up causing the remote sender to enter into persist mode, but
there is still room available in the receive buffer when a window probe
arrives (either due to window scaling, or due to the local application
very slowing draining data from the receive buffer), then the single byte
of data in the window probe is accepted. However, this can cause rcv_nxt
to be greater than rcv_adv. This condition will only last until the next
ACK packet is pushed out via tcp_output(), and since the previous ACK
advertised a zero window, the ACK should be pushed out while the TCP
pcb is write-locked.

During the window while rcv_nxt is greather than rcv_adv, a few places
would compute the remaining receive window via rcv_adv - rcv_nxt.
However, this value was then (uint32_t)-1. On a 64 bit machine this
could expand to a positive 2^32 - 1 when cast to a long. In particular,
when calculating the receive window in tcp_output(), the result would be
that the receive window was computed as 2^32 - 1 resulting in advertising
a far larger window to the remote peer than actually existed.

Fix various places that compute the remaining receive window to either
assert that it is not negative (i.e. rcv_nxt <= rcv_adv), or treat the
window as full if rcv_nxt is greather than rcv_adv.

Reviewed by: bz
MFC after: 1 month


# b287c6c7 30-Apr-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Make the TCP code compile without INET. Sort #includes and add #ifdef INETs.
Add some comments at #endifs given more nestedness. To make the compiler
happy, some default initializations were added in accordance with the style
on the files.

Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 4 days


# 672dc4ae 29-Apr-2011 John Baldwin <jhb@FreeBSD.org>

TCP reuses t_rxtshift to determine the backoff timer used for both the
persist state and the retransmit timer. However, the code that implements
"bad retransmit recovery" only checks t_rxtshift to see if an ACK has been
received in during the first retransmit timeout window. As a result, if
ticks has wrapped over to a negative value and a socket is in the persist
state, it can incorrectly treat an ACK from the remote peer as a
"bad retransmit recovery" and restore saved values such as snd_ssthresh and
snd_cwnd. However, if the socket has never had a retransmit timeout, then
these saved values will be zero, so snd_ssthresh and snd_cwnd will be set
to 0.

If the socket is in fast recovery (this can be caused by excessive
duplicate ACKs such as those fixed by 220794), then each ACK that arrives
triggers either NewReno or SACK partial ACK handling which clamps snd_cwnd
to be no larger than snd_ssthresh. In effect, the socket's send window
is permamently stuck at 0 even though the remote peer is advertising a
much larger window and pending data is only sent via TCP window probes
(so one byte every few seconds).

Fix this by adding a new TCP pcb flag (TF_PREVVALID) that indicates that
the various snd_*_prev fields in the pcb are valid and only perform
"bad retransmit recovery" if this flag is set in the pcb. The flag is set
on the first retransmit timeout that occurs and is cleared on subsequent
retransmit timeouts or when entering the persist state.

Reviewed by: bz
MFC after: 2 weeks


# da84b2e6 18-Apr-2011 John Baldwin <jhb@FreeBSD.org>

When checking to see if a window update should be sent to the remote peer,
don't force a window update if the window would not actually grow due to
window scaling. Specifically, if the window scaling factor is larger than
2 * MSS, then after the local reader has drained 2 * MSS bytes from the
socket, a window update can end up advertising the same window. If this
happens, the supposed window update actually ends up being a duplicate ACK.
This can result in an excessive number of duplicate ACKs when using a
higher maximum socket buffer size.

Reviewed by: bz
MFC after: 1 month


# 39bc9de5 27-Dec-2010 Lawrence Stewart <lstewart@FreeBSD.org>

- Add some helper hook points to the TCP stack. The hooks allow Khelp modules to
access inbound/outbound events and associated data for established TCP
connections. The hooks only run if at least one hook function is registered
for the hook point, ensuring the impact on the stack is effectively nil when
no TCP Khelp modules are loaded. struct tcp_hhook_data is passed as contextual
data to any registered Khelp module hook functions.

- Add an OSD (Object Specific Data) pointer to struct tcpcb to allow Khelp
modules to associate per-connection data with the TCP control block.

- Bump __FreeBSD_version and add a note to UPDATING regarding to ABI changes
introduced by this commit and r216753.

In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: FreeBSD Foundation
Reviewed by: bz, others along the way
MFC after: 3 months


# 2ea8da28 01-Dec-2010 Lawrence Stewart <lstewart@FreeBSD.org>

- Reinstantiate the after_idle hook call in tcp_output(), which got lost
somewhere along the way due to mismerging r211464 in our development tree.

- Capture the essence of r211464 in NewReno's after_idle() hook. We don't
use V_ss_fltsz/V_ss_fltsz_local yet which needs to be revisited.

Sponsored by: FreeBSD Foundation
Submitted by: David Hayes <dahayes at swin edu au>
MFC after: 9 weeks
X-MFC with: r215166


# f5d34df5 17-Nov-2010 George V. Neville-Neil <gnn@FreeBSD.org>

Add new, per connection, statistics for TCP, including:
Retransmitted Packets
Zero Window Advertisements
Out of Order Receives

These statistics are available via the -T argument to
netstat(1).
MFC after: 2 weeks


# dbc42409 11-Nov-2010 Lawrence Stewart <lstewart@FreeBSD.org>

This commit marks the first formal contribution of the "Five New TCP Congestion
Control Algorithms for FreeBSD" FreeBSD Foundation funded project. More details
about the project are available at: http://caia.swin.edu.au/freebsd/5cc/

- Add a KPI and supporting infrastructure to allow modular congestion control
algorithms to be used in the net stack. Algorithms can maintain per-connection
state if required, and connections maintain their own algorithm pointer, which
allows different connections to concurrently use different algorithms. The
TCP_CONGESTION socket option can be used with getsockopt()/setsockopt() to
programmatically query or change the congestion control algorithm respectively
from within an application at runtime.

- Integrate the framework with the TCP stack in as least intrusive a manner as
possible. Care was also taken to develop the framework in a way that should
allow integration with other congestion aware transport protocols (e.g. SCTP)
in the future. The hope is that we will one day be able to share a single set
of congestion control algorithm modules between all congestion aware transport
protocols.

- Introduce a new congestion recovery (TF_CONGRECOVERY) state into the TCP stack
and use it to decouple the meaning of recovery from a congestion event and
recovery from packet loss (TF_FASTRECOVERY) a la RFC2581. ECN and delay based
congestion control protocols don't generally need to recover from packet loss
and need a different way to note a congestion recovery episode within the
stack.

- Remove the net.inet.tcp.newreno sysctl, which simplifies some portions of code
and ensures the stack always uses the appropriate mechanisms for recovering
from packet loss during a congestion recovery episode.

- Extract the NewReno congestion control algorithm from the TCP stack and
massage it into module form. NewReno is always built into the kernel and will
remain the default algorithm for the forseeable future. Implementations of
additional different algorithms will become available in the near future.

- Bump __FreeBSD_version to 900025 and note in UPDATING that rebuilding code
that relies on the size of "struct tcpcb" is required.

Many thanks go to the Cisco University Research Program Fund at Community
Foundation Silicon Valley and the FreeBSD Foundation. Their support of our work
at the Centre for Advanced Internet Architectures, Swinburne University of
Technology is greatly appreciated.

In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: Cisco URP, FreeBSD Foundation
Reviewed by: rpaulo
Tested by: David Hayes (and many others over the years)
MFC after: 3 months


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# ed420311 17-Sep-2010 Andre Oppermann <andre@FreeBSD.org>

Rearrange the TSO code to make it more readable and to clearly
separate the decision logic, of whether we can do TSO, and the
calculation of the burst length into two distinct parts.

Change the way the TSO burst length calculation is done. While
TSO could do bursts of 65535 bytes that can't be represented in
ip_len together with the IP and TCP header. Account for that and
use IP_MAXPACKET instead of TCP_MAXWIN as base constant (both
have the same value of 64K). When more data is available prevent
less than MSS sized segments from being sent during the current
TSO burst.

Add two more KASSERTs to ensure the integrity of the packets.

Tested by: Ben Wilber <ben-at-desync com>
MFC after: 10 days


# 1c18314d 16-Sep-2010 Andre Oppermann <andre@FreeBSD.org>

Remove the TCP inflight bandwidth limiter as announced in r211315
to give way for the pluggable congestion control framework. It is
the task of the congestion control algorithm to set the congestion
window and amount of inflight data without external interference.

In 'struct tcpcb' the variables previously used by the inflight
limiter are renamed to spares to keep the ABI intact and to have
some more space for future extensions.

In 'struct tcp_info' the variable 'tcpi_snd_bwnd' is not removed to
preserve the ABI. It is always set to 0.

In siftr.c in 'struct pkt_node' the variable 'snd_bwnd' is not removed
to preserve the ABI. It is always set to 0.

These unused variable in the various structures may be reused in the
future or garbage collected before the next release or at some other
point when an ABI change happens anyway for other reasons.

No MFC is planned. The inflight bandwidth limiter stays disabled by
default in the other branches but remains available.


# c3f0bdc6 18-Aug-2010 Andre Oppermann <andre@FreeBSD.org>

If a TCP connection has been idle for one retransmit timeout or more
it must reset its congestion window back to the initial window.

RFC3390 has increased the initial window from 1 segment to up to
4 segments.

The initial window increase of RFC3390 wasn't reflected into the
restart window which remained at its original defaults of 4 segments
for local and 1 segment for all other connections. Both values are
controllable through sysctl net.inet.tcp.local_slowstart_flightsize
and net.inet.tcp.slowstart_flightsize.

The increase helps TCP's slow start algorithm to open up the congestion
window much faster.

Reviewed by: lstewart
MFC after: 1 week


# e4e92660 15-Aug-2010 Andre Oppermann <andre@FreeBSD.org>

Fix the interaction between 'ICMP fragmentation needed' MTU updates,
path MTU discovery and the tcp_minmss limiter for very small MTU's.

When the MTU suggested by the gateway via ICMP, or if there isn't
any the next smaller step from ip_next_mtu(), is lower than the
floor enforced by net.inet.tcp.minmss (default 216) the value is
ignored and the default MSS (512) is used instead. However the
DF flag in the IP header is still set in tcp_output() preventing
fragmentation by the gateway.

Fix this by using tcp_minmss as the MSS and clear the DF flag if
the suggested MTU is too low. This turns off path MTU dissovery
for the remainder of the session and allows fragmentation to be
done by the gateway.

Only MTU's smaller than 256 are affected. The smallest official
MTU specified is for AX.25 packet radio at 256 octets.

PR: kern/146628
Tested by: Matthew Luckie <mjl-at-luckie org nz>
MFC after: 1 week


# 153e5b57 14-Aug-2010 Andre Oppermann <andre@FreeBSD.org>

When using TSO and sending more than TCP_MAXWIN sendalot is set
and we loop back to 'again'. If the remainder is less or equal
to one full segment, the TSO flag was not cleared even though
it isn't necessary anymore. Enabling the TSO flag on a segment
that doesn't require any offloaded segmentation by the NIC may
cause confusion in the driver or hardware.

Reset the internal tso flag in tcp_output() on every iteration
of sendalot.

PR: kern/132832
Submitted by: Renaud Lienhart <renaud-at-vmware com>
MFC after: 1 week


# 6774e0f9 20-May-2010 Kenneth D. Merry <ken@FreeBSD.org>

MFC r206844:

Don't clear other flags (e.g. CSUM_TCP) when setting CSUM_TSO. This was
causing TSO to break for the Xen netfront driver.

Reviewed by: gibbs, rwatson


# 480d7c6c 06-May-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

MFC r207369:
MFP4: @176978-176982, 176984, 176990-176994, 177441

"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by: jhb
Discussed with: rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH


# 82cea7e6 29-Apr-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

MFP4: @176978-176982, 176984, 176990-176994, 177441

"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by: jhb
Discussed with: rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
MFC after: 6 days


# 3579cf4c 19-Apr-2010 Kenneth D. Merry <ken@FreeBSD.org>

Don't clear other flags (e.g. CSUM_TCP) when setting CSUM_TSO. This was
causing TSO to break for the Xen netfront driver.

Reviewed by: gibbs, rwatson
MFC after: 7 days


# 24b458cf 17-Nov-2009 John Baldwin <jhb@FreeBSD.org>

MFC 198990:
Several years ago a feature was added to TCP that casued soreceive() to
send an ACK right away if data was drained from a TCP socket that had
previously advertised a zero-sized window. The current code requires the
receive window to be exactly zero for this to kick in. If window scaling is
enabled and the window is smaller than the scale, then the effective window
that is advertised is zero. However, in that case the zero-sized window
handling is not enabled because the window is not exactly zero. The fix
changes the code to check the raw window value against zero.


# c6d94805 06-Nov-2009 John Baldwin <jhb@FreeBSD.org>

Several years ago a feature was added to TCP that casued soreceive() to
send an ACK right away if data was drained from a TCP socket that had
previously advertised a zero-sized window. The current code requires the
receive window to be exactly zero for this to kick in. If window scaling is
enabled and the window is smaller than the scale, then the effective window
that is advertised is zero. However, in that case the zero-sized window
handling is not enabled because the window is not exactly zero. The fix
changes the code to check the raw window value against zero.

Reviewed by: bz
MFC after: 1 week


# 530c0060 01-Aug-2009 Robert Watson <rwatson@FreeBSD.org>

Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by: bz
Approved by: re (vimage blanket)


# eddfbb76 14-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)


# 6b0c5521 16-Jun-2009 John Baldwin <jhb@FreeBSD.org>

Trim extra sets of ()'s.

Requested by: bde


# bcf11e8d 05-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# 78b50714 11-Apr-2009 Robert Watson <rwatson@FreeBSD.org>

Update stats in struct tcpstat using two new macros, TCPSTAT_ADD() and
TCPSTAT_INC(), rather than directly manipulating the fields across the
kernel. This will make it easier to change the implementation of
these statistics, such as using per-CPU versions of the data structures.

MFC after: 3 days


# 4b79449e 02-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation


# 5cd54324 27-Nov-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Replace most INP_CHECK_SOCKAF() uses checking if it is an
IPv6 socket by comparing a constant inp vflag.
This is expected to help to reduce extra locking.

Suggested by: rwatson
Reviewed by: rwatson
MFC after: 6 weeks


# 44e33a07 19-Nov-2008 Marko Zec <zec@FreeBSD.org>

Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions. As a rule,
initialization at instatiation for such variables should never be
introduced again from now on. Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact. In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 8b615593 02-Oct-2008 Marko Zec <zec@FreeBSD.org>

Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 3418daf2 13-Sep-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Implement IPv6 support for TCP MD5 Signature Option (RFC 2385)
the same way it has been implemented for IPv4.

Reviewed by: bms (skimmed)
Tested by: Nick Hilliard (nick netability.ie) (with more changes)
MFC after: 2 months


# c4982fae 07-Sep-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Add a second KASSERT checking for len >= 0 in the tcp output path.

This is different to the first one (as len gets updated between those
two) and would have caught various edge cases (read bugs) at a well
defined place I had been debugging the last months instead of
triggering (random) panics further down the call graph.

MFC after: 2 months


# 603724d3 17-Aug-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch


# f2512ba1 31-Jul-2008 Rui Paulo <rpaulo@FreeBSD.org>

MFp4 (//depot/projects/tcpecn/):

TCP ECN support. Merge of my GSoC 2006 work for NetBSD.
TCP ECN is defined in RFC 3168.

Partly reviewed by: dwmalone, silby
Obtained from: NetBSD


# b2722702 15-Jul-2008 Rui Paulo <rpaulo@FreeBSD.org>

Fix commment in typo.

M tcp_output.c


# 8501a69c 17-Apr-2008 Robert Watson <rwatson@FreeBSD.org>

Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.

This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.

MFC after: 3 months
Tested by: kris (superset of committered patch)


# 3a4018c4 07-Apr-2008 Andre Oppermann <andre@FreeBSD.org>

Remove TCP options ordering assumptions in tcp_addoptions(). Ordering
was changed in rev. 1.161 of tcp_var.h. All option now test for sufficient
space in TCP header before getting added.

Reported by: Mark Atkinson <atkin901-at-yahoo.com>
Tested by: Mark Atkinson <atkin901-at-yahoo.com>
MFC after: 1 week


# 5b2e33ea 07-Apr-2008 Andre Oppermann <andre@FreeBSD.org>

Remove now unnecessary comment.


# c343c524 07-Apr-2008 Andre Oppermann <andre@FreeBSD.org>

Use #defines for TCP options padding after EOL to be consistent.

Reviewed by: bz


# 413deb12 09-Mar-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Padding after EOL option must be zeros according to RFC793 but
the NOPs used are 0x01.
While we could simply pad with EOLs (which are 0x00), rather use an
explicit 0x00 constant there to not confuse poeple with 'EOL padding'.
Put in a comment saying just that.

Problem discussed on: src-committers with andre, silby, dwhite as
follow up to the rev. 1.161 commit of tcp_var.h.
MFC after: 11 days


# ee763d0d 30-Nov-2007 Bjoern A. Zeeb <bz@FreeBSD.org>

Centralize and correct computation of TCP-MD5 signature offset within
the packet (tcp header options field).

Reviewed by: tools/regression/netinet/tcpconnect
MFC after: 3 days
Tested by: Nick Hilliard (see net@)


# 4a411b9f 28-Nov-2007 Bjoern A. Zeeb <bz@FreeBSD.org>

Let opt be an array. Though &opt[0] == opt == &opt, &opt is highly
confusing and hard to understand so change it to just opt and
remove the extra cast no longer/not needed.

Discussed with: rwatson
MFC after: 3 days


# 9ad0173d 21-Nov-2007 Bjoern A. Zeeb <bz@FreeBSD.org>

Make TSO work with IPSEC compiled into the kernel.

The lookup hurts a bit for connections but had been there anyway
if IPSEC was compiled in. So moving the lookup up a bit gives us
TSO support at not extra cost.

PR: kern/115586
Tested by: gallatin
Discussed with: kmacy
MFC after: 2 months


# 30d239bc 24-Oct-2007 Robert Watson <rwatson@FreeBSD.org>

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# 4b421e2d 07-Oct-2007 Mike Silbersack <silby@FreeBSD.org>

Add FBSDID to all files in netinet so that people can more
easily include file version information in bug reports.

Approved by: re (kensmith)


# b2630c29 02-Jul-2007 George V. Neville-Neil <gnn@FreeBSD.org>

Commit the change from FAST_IPSEC to IPSEC. The FAST_IPSEC
option is now deprecated, as well as the KAME IPsec code.
What was FAST_IPSEC is now IPSEC.

Approved by: re
Sponsored by: Secure Computing


# 2cb64cb2 01-Jul-2007 George V. Neville-Neil <gnn@FreeBSD.org>

Commit IPv6 support for FAST_IPSEC to the tree.
This commit includes only the kernel files, the rest of the files
will follow in a second commit.

Reviewed by: bz
Approved by: re
Supported by: Secure Computing


# 104ebb2a 09-Jun-2007 Andre Oppermann <andre@FreeBSD.org>

Make the handling of the tcp window explicit for the SYN_SENT case
in tcp_outout(). This is currently not strictly necessary but paves
the way to simplify the entire SYN options handling quite a bit.
Clarify comment. No change in effective behavour with this commit.

RFC1323 requires the window field in a SYN (i.e., a <SYN> or
<SYN,ACK>) segment itself never be scaled.


# b7de7d87 09-Jun-2007 Andre Oppermann <andre@FreeBSD.org>

Don't send pure window updates when the peer has closed the connection
and won't ever send more data.


# 0ba5d2ee 18-May-2007 John Baldwin <jhb@FreeBSD.org>

Fix statistical accounting for bytes and packets during sack retransmits.

MFC after: 1 week
Submitted by: mohans


# 4b8e42ba 10-May-2007 Andre Oppermann <andre@FreeBSD.org>

Fix an incorrect replace of a timer reference made during the TCP timer
rewrite in rev. 1.132. This unmasked yet another bug that causes certain
connections to get indefinately stuck in LAST_ACK state.


# 3529149e 06-May-2007 Andre Oppermann <andre@FreeBSD.org>

Use existing TF_SACK_PERMIT flag in struct tcpcb t_flags field instead of
a decdicated sack_enable int for this bool. Change all users accordingly.


# 0d957bba 20-Apr-2007 Andre Oppermann <andre@FreeBSD.org>

o Remove unused and redundant TCP option definitions
o Replace usage of MAX_TCPOPTLEN with the correctly constructed and
derived MAX_TCPOPTLEN


# b8152ba7 11-Apr-2007 Andre Oppermann <andre@FreeBSD.org>

Change the TCP timer system from using the callout system five times
directly to a merged model where only one callout, the next to fire,
is registered.

Instead of callout_reset(9) and callout_stop(9) the new function
tcp_timer_activate() is used which then internally manages the callout.

The single new callout is a mutex callout on inpcb simplifying the
locking a bit.

tcp_timer() is the called function which handles all race conditions
in one place and then dispatches the individual timer functions.

Reviewed by: rwatson (earlier version)


# 5dd9dfef 04-Apr-2007 Andre Oppermann <andre@FreeBSD.org>

Retire unused TCP_SACK_DEBUG.


# ad3f9ab3 21-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

ANSIfy function declarations and remove register keywords for variables.
Consistently apply style to all function declarations.


# eec9d82d 21-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

Subtract optlen in the maximum length check for TSO and finally avoid
slightly oversized TSO mbuf chains.

Submitted by: kmacy


# 8b8ed7a7 19-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

Match up SYSCTL_INT declarations in style.


# 4e023759 19-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

Maintain a pointer and offset pair into the socket buffer mbuf chain to
avoid traversal of the entire socket buffer for larger offsets on stream
sockets.

Adjust tcp_output() make use of it.

Tested by: gallatin


# 02a1a643 15-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

Consolidate insertion of TCP options into a segment from within tcp_output()
and syncache_respond() into its own generic function tcp_addoptions().

tcp_addoptions() is alignment agnostic and does optimal packing in all cases.

In struct tcpopt rename to_requested_s_scale to just to_wscale.

Add a comment with quote from RFC1323: "The Window field in a SYN (i.e.,
a <SYN> or <SYN,ACK>) segment itself is never scaled."

Reviewed by: silby, mohans, julian
Sponsored by: TCP/IP Optimization Fundraise 2005


# 6aa5b623 01-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

Prevent TSO mbuf chain from overflowing a few bytes by subtracting the
TCP options size before the TSO total length calculation.

Bug found by: kmacy


# 8bec3467 27-Feb-2007 Gleb Smirnoff <glebius@FreeBSD.org>

Add EHOSTDOWN and ENETUNREACH to the list of soft errors, that shouldn't
be returned up to the caller.

PR: 100172
Submitted by: "Andrew - Supernews" <andrew supernews.net>
Reviewed by: rwatson, bms


# 72757d9a 27-Feb-2007 Gleb Smirnoff <glebius@FreeBSD.org>

Toss the code, that handles errors from ip_output(), to make it more
readable:
- Merge two embedded if() into one.
- Introduce switch() block to handle different kinds of errors.

Reviewed by: rwatson, bms


# 6741ecf5 01-Feb-2007 Andre Oppermann <andre@FreeBSD.org>

Auto sizing TCP socket buffers.

Normally the socket buffers are static (either derived from global
defaults or set with setsockopt) and do not adapt to real network
conditions. Two things happen: a) your socket buffers are too small
and you can't reach the full potential of the network between both
hosts; b) your socket buffers are too big and you waste a lot of
kernel memory for data just sitting around.

With automatic TCP send and receive socket buffers we can start with a
small buffer and quickly grow it in parallel with the TCP congestion
window to match real network conditions.

FreeBSD has a default 32K send socket buffer. This supports a maximal
transfer rate of only slightly more than 2Mbit/s on a 100ms RTT
trans-continental link. Or at 200ms just above 1Mbit/s. With TCP send
buffer auto scaling and the default values below it supports 20Mbit/s
at 100ms and 10Mbit/s at 200ms. That's an improvement of factor 10, or
1000%. For the receive side it looks slightly better with a default of
64K buffer size.

New sysctls are:
net.inet.tcp.sendbuf_auto=1 (enabled)
net.inet.tcp.sendbuf_inc=8192 (8K, step size)
net.inet.tcp.sendbuf_max=262144 (256K, growth limit)
net.inet.tcp.recvbuf_auto=1 (enabled)
net.inet.tcp.recvbuf_inc=16384 (16K, step size)
net.inet.tcp.recvbuf_max=262144 (256K, growth limit)

Tested by: many (on HEAD and RELENG_6)
Approved by: re
MFC after: 1 month


# aed55708 22-Oct-2006 Robert Watson <rwatson@FreeBSD.org>

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# 2c30ec0a 28-Sep-2006 Andre Oppermann <andre@FreeBSD.org>

When tcp_output() receives an error upon sending a packet it reverts parts
of its internal state to ignore the failed send and try again a bit later.
If the error is EPERM the packet got blocked by the local firewall and the
revert may cause the session to get stuck and retry indefinitely. This way
we treat it like a packet loss and let the retransmit timer and timeouts
do their work over time.

The correct behavior is to drop a connection that gets an EPERM error.
However this _may_ introduce some POLA problems and a two commit approach
was chosen.

Discussed with: glebius
PR: kern/25986
PR: kern/102653


# 6a2257d9 28-Sep-2006 Andre Oppermann <andre@FreeBSD.org>

When doing TSO correctly do the check to prevent a maximum sized IP packet
from overflowing.


# 31ecb34a 15-Sep-2006 Andre Oppermann <andre@FreeBSD.org>

When doing TSO subtract hdrlen from TCP_MAXWIN to prevent ip->ip_len
from wrapping when we generate a maximally sized packet for later
segmentation.

Noticed by: gallatin
Sponsored by: TCP/IP Optimization Fundraise 2005


# bf6d304a 13-Sep-2006 Andre Oppermann <andre@FreeBSD.org>

Rewrite of TCP syncookies to remove locking requirements and to enhance
functionality:

- Remove a rwlock aquisition/release per generated syncookie. Locking
is now integrated with the bucket row locking of syncache itself and
syncookies no longer add any additional lock overhead.
- Syncookie secrets are different for and stored per syncache buck row.
Secrets expire after 16 seconds and are reseeded on-demand.
- The computational overhead for syncookie generation and verification
is one MD5 hash computation as before.
- Syncache can be turned off and run with syncookies only by setting the
sysctl net.inet.tcp.syncookies_only=1.

This implementation extends the orginal idea and first implementation
of FreeBSD by using not only the initial sequence number field to store
information but also the timestamp field if present. This way we can
keep track of the entire state we need to know to recreate the session in
its original form. Almost all TCP speakers implement RFC1323 timestamps
these days. For those that do not we still have to live with the known
shortcomings of the ISN only SYN cookies. The use of the timestamp field
causes the timestamps to be randomized if syncookies are enabled.

The idea of SYN cookies is to encode and include all necessary information
about the connection setup state within the SYN-ACK we send back and thus
to get along without keeping any local state until the ACK to the SYN-ACK
arrives (if ever). Everything we need to know should be available from
the information we encoded in the SYN-ACK.

A detailed description of the inner working of the syncookies mechanism
is included in the comments in tcp_syncache.c.

Reviewed by: silby (slightly earlier version)
Sponsored by: TCP/IP Optimization Fundraise 2005


# b3c0f300 06-Sep-2006 Andre Oppermann <andre@FreeBSD.org>

Second step of TSO (TCP segmentation offload) support in our network stack.

TSO is only used if we are in a pure bulk sending state. The presence of
TCP-MD5, SACK retransmits, SACK advertizements, IPSEC and IP options prevent
using TSO. With TSO the TCP header is the same (except for the sequence number)
for all generated packets. This makes it impossible to transmit any options
which vary per generated segment or packet.

The length of TSO bursts is limited to TCP_MAXWIN.

The sysctl net.inet.tcp.tso globally controls the use of TSO and is enabled.

TSO enabled sends originating from tcp_output() have the CSUM_TCP and CSUM_TSO
flags set, m_pkthdr.csum_data filled with the header pseudo-checksum and
m_pkthdr.tso_segsz set to the segment size (net payload size, not counting
IP+TCP headers or TCP options).

IPv6 currently lacks a pseudo-header checksum function and thus doesn't support
TSO yet.

Tested by: Jack Vogel <jfvogel-at-gmail.com>
Sponsored by: TCP/IP Optimization Fundraise 2005


# 4b8e98d6 23-Feb-2006 Qing Li <qingli@FreeBSD.org>

This patch fixes the problem where the current TCP code can not handle
simultaneous open. Both the bug and the patch were verified using the
ANVL test suite.

PR: kern/74935
Submitted by: qingli (before I became committer)
Reviewed by: andre
MFC after: 5 days


# ef39adf0 18-Nov-2005 Andre Oppermann <andre@FreeBSD.org>

Consolidate all IP Options handling functions into ip_options.[ch] and
include ip_options.h into all files making use of IP Options functions.

From ip_input.c rev 1.306:
ip_dooptions(struct mbuf *m, int pass)
save_rte(m, option, dst)
ip_srcroute(m0)
ip_stripoptions(m, mopt)

From ip_output.c rev 1.249:
ip_insertoptions(m, opt, phlen)
ip_optcopy(ip, jp)
ip_pcbopts(struct inpcb *inp, int optname, struct mbuf *m)

No functional changes in this commit.

Discussed with: rwatson
Sponsored by: TCP/IP Optimization Fundraise 2005


# 34333b16 02-Nov-2005 Andre Oppermann <andre@FreeBSD.org>

Retire MT_HEADER mbuf type and change its users to use MT_DATA.

Having an additional MT_HEADER mbuf type is superfluous and redundant
as nothing depends on it. It only adds a layer of confusion. The
distinction between header mbuf's and data mbuf's is solely done
through the m->m_flags M_PKTHDR flag.

Non-native code is not changed in this commit. For compatibility
MT_HEADER is mapped to MT_DATA.

Sponsored by: TCP/IP Optimization Fundraise 2005


# 2cdbfa66 20-May-2005 Paul Saab <ps@FreeBSD.org>

Replace t_force with a t_flag (TF_FORCEDATA).

Submitted by: Raja Mukerji.
Reviewed by: Mohan, Silby, Andre Opperman.


# 0077b016 11-May-2005 Paul Saab <ps@FreeBSD.org>

When looking for the next hole to retransmit from the scoreboard,
or to compute the total retransmitted bytes in this sack recovery
episode, the scoreboard is traversed. While in sack recovery, this
traversal occurs on every call to tcp_output(), every dupack and
every partial ack. The scoreboard could potentially get quite large,
making this traversal expensive.

This change optimizes this by storing hints (for the next hole to
retransmit and the total retransmitted bytes in this sack recovery
episode) reducing the complexity to find these values from O(n) to
constant time.

The debug code that sanity checks the hints against the computed
value will be removed eventually.

Submitted by: Mohan Srinivasan, Noritoshi Demizu, Raja Mukerji.


# be3f3b5e 21-Apr-2005 Paul Saab <ps@FreeBSD.org>

Fix for interaction problems between TCP SACK and TCP Signature.
If TCP Signatures are enabled, the maximum allowed sack blocks aren't
going to fit. The fix is to compute how many sack blocks fit and tack
these on last. Also on SYNs, defer padding until after the SACK
PERMITTED option has been added.

Found by: Mohan Srinivasan.
Submitted by: Mohan Srinivasan, Noritoshi Demizu.
Reviewed by: Raja Mukerji.


# 1600372b 20-Apr-2005 Andre Oppermann <andre@FreeBSD.org>

Ignore ICMP Source Quench messages for TCP sessions. Source Quench is
ineffective, depreciated and can be abused to degrade the performance
of active TCP sessions if spoofed.

Replace a bogus call to tcp_quench() in tcp_output() with the direct
equivalent tcpcb variable assignment.

Security: draft-gont-tcpm-icmp-attacks-03.txt Section 7.1
MFC after: 3 days


# 8d03f2b5 12-Jan-2005 Paul Saab <ps@FreeBSD.org>

Fix a TCP SACK related crash resulting from incorrect computation
of len in tcp_output(), in the case where the FIN has already been
transmitted. The mis-computation of len is because of a gcc
optimization issue, which this change works around.

Submitted by: Mohan Srinivasan


# c398230b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# 7d5ed1ce 29-Nov-2004 Paul Saab <ps@FreeBSD.org>

Fixes a bug in SACK causing us to send data beyond the receive window.

Found by: Pawel Worach and Daniel Hartmeier
Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com


# c94c54e4 02-Nov-2004 Andre Oppermann <andre@FreeBSD.org>

Remove RFC1644 T/TCP support from the TCP side of the network stack.

A complete rationale and discussion is given in this message
and the resulting discussion:

http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706

Note that this commit removes only the functional part of T/TCP
from the tcp_* related functions in the kernel. Other features
introduced with RFC1644 are left intact (socket layer changes,
sendmsg(2) on connection oriented protocols) and are meant to
be reused by a simpler and less intrusive reimplemention of the
previous T/TCP functionality.

Discussed on: -arch


# ab5c14d8 29-Oct-2004 Robert Watson <rwatson@FreeBSD.org>

Correct a bug in TCP SACK that could result in wedging of the TCP stack
under high load: only set function state to loop and continuing sending
if there is no data left to send.

RELENG_5_3 candidate.

Feet provided: Peter Losher <Peter underscore Losher at isc dot org>
Diagnosed by: Aniel Hartmeier <daniel at benzedrine dot cx>
Submitted by: mohan <mohans at yahoo-inc dot com>


# cf2942b6 09-Oct-2004 Robert Watson <rwatson@FreeBSD.org>

Acquire the send socket buffer lock around tcp_output() activities
reaching into the socket buffer. This prevents a number of potential
races, including dereferencing of sb_mb while unlocked leading to
a NULL pointer deref (how I found it). Potentially this might also
explain other "odd" TCP behavior on SMP boxes (although haven't
seen it reported).

RELENG_5 candidate.


# a55db2b6 05-Oct-2004 Paul Saab <ps@FreeBSD.org>

- Estimate the amount of data in flight in sack recovery and use it
to control the packets injected while in sack recovery (for both
retransmissions and new data).
- Cleanups to the sack codepaths in tcp_output.c and tcp_sack.c.
- Add a new sysctl (net.inet.tcp.sack.initburst) that controls the
number of sack retransmissions done upon initiation of sack recovery.

Submitted by: Mohan Srinivasan <mohans@yahoo-inc.com>


# b5d47ff5 04-Sep-2004 John-Mark Gurney <jmg@FreeBSD.org>

fix up socket/ip layer violation... don't assume/know that
SO_DONTROUTE == IP_ROUTETOIF and SO_BROADCAST == IP_ALLOWBROADCAST...


# a4f757cd 16-Aug-2004 Robert Watson <rwatson@FreeBSD.org>

White space cleanup for netinet before branch:

- Trailing tab/space cleanup
- Remove spurious spaces between or before tabs

This change avoids touching files that Andre likely has in his working
set for PFIL hooks changes for IPFW/DUMMYNET.

Approved by: re (scottl)
Submitted by: Xin LI <delphij@frontfree.net>


# 5d3b1b75 27-Jul-2004 Jayanth Vijayaraghavan <jayanth@FreeBSD.org>

Fix a bug in the sack code that was causing data to be retransmitted
with the FIN bit set for all segments, if a FIN has already been sent before.
The fix will allow the FIN bit to be set for only the last segment, in case
it has to be retransmitted.

Fix another bug that would have caused snd_nxt to be pulled by len if
there was an error from ip_output. snd_nxt should not be touched
during sack retransmissions.


# e9f2f80e 26-Jul-2004 Jayanth Vijayaraghavan <jayanth@FreeBSD.org>

Fix for a SACK bug where the very last segment retransmitted
from the SACK scoreboard could result in the next (untransmitted)
segment to be skipped.


# 04f0d9a0 19-Jul-2004 Jayanth Vijayaraghavan <jayanth@FreeBSD.org>

Let IN_FASTREOCOVERY macro decide if we are in recovery mode.

Nuke sackhole_limit for now. We need to add it back to limit the total
number of sack blocks in the system.


# f787edd8 19-Jul-2004 Jayanth Vijayaraghavan <jayanth@FreeBSD.org>

Fix a potential panic in the SACK code that was causing
1) data to be sent to the right of snd_recover.
2) send more data then whats in the send buffer.

The fix is to postpone sack retransmit to a subsequent recovery episode
if the current retransmit pointer is beyond snd_recover.

Thanks to Mohan Srinivasan for helping fix the bug.

Submitted by:Daniel Lang


# 6d90faf3 23-Jun-2004 Paul Saab <ps@FreeBSD.org>

Add support for TCP Selective Acknowledgements. The work for this
originated on RELENG_4 and was ported to -CURRENT.

The scoreboarding code was obtained from OpenBSD, and many
of the remaining changes were inspired by OpenBSD, but not
taken directly from there.

You can enable/disable sack using net.inet.tcp.do_sack. You can
also limit the number of sack holes that all senders can have in
the scoreboard with net.inet.tcp.sackhole_limit.

Reviewed by: gnn
Obtained from: Yahoo! (Mohan Srinivasan, Jayanth Vijayaraghavan)


# f3e0b7ef 18-Jun-2004 Bruce M Simpson <bms@FreeBSD.org>

Appease GCC.


# 5214cb3f 17-Jun-2004 Bruce M Simpson <bms@FreeBSD.org>

If SO_DEBUG is enabled for a TCP socket, and a received segment is
encapsulated within an IPv6 datagram, do not abuse the 'ipov' pointer
when registering trace records. 'ipov' is specific to IPv4, and
will therefore be uninitialized.

[This fandango is only necessary in the first place because of our
host-byte-order IP field pessimization.]

PR: kern/60856
Submitted by: Galois Zheng


# da181cc1 17-Jun-2004 Bruce M Simpson <bms@FreeBSD.org>

Don't set FIN on a retransmitted segment after a FIN has been sent,
unless the segment really contains the last of the data for the stream.

PR: kern/34619
Obtained from: OpenBSD (tcp_output.c rev 1.47)
Noticed by: Joseph Ishac
Reviewed by: George Neville-Neil


# c18b97c6 03-May-2004 Robert Watson <rwatson@FreeBSD.org>

Switch to using the inpcb MAC label instead of socket MAC label when
labeling new mbufs created from sockets/inpcbs in IPv4. This helps avoid
the need for socket layer locking in the lower level network paths
where inpcb locks are already frequently held where needed. In
particular:

- Use the inpcb for label instead of socket in raw_append().
- Use the inpcb for label instead of socket in tcp_output().
- Use the inpcb for label instead of socket in tcp_respond().
- Use the inpcb for label instead of socket in tcp_twrespond().
- Use the inpcb for label instead of socket in syncache_respond().

While here, modify tcp_respond() to avoid assigning NULL to a stack
variable and centralize assertions about the inpcb when inp is
assigned.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, McAfee Research


# f36cfd49 07-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson


# 265ed012 13-Feb-2004 Bruce M Simpson <bms@FreeBSD.org>

Brucification.

Submitted by: bde


# bca0e5bf 12-Feb-2004 Bruce M Simpson <bms@FreeBSD.org>

style(9) pass; whitespace and comments.

Submitted by: njl


# 45d370ee 11-Feb-2004 Bruce M Simpson <bms@FreeBSD.org>

Fix a typo; left out preprocessor conditional for sigoff variable, which
is only used by TCP_SIGNATURE code.

Noticed by: Roop Nanuwa


# 1cfd4b53 10-Feb-2004 Bruce M Simpson <bms@FreeBSD.org>

Initial import of RFC 2385 (TCP-MD5) digest support.

This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.

For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.

Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.

There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.

Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.

This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.

Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.

Sponsored by: sentex.net


# f073c60f 03-Feb-2004 Hajimu UMEMOTO <ume@FreeBSD.org>

pass pcb rather than so. it is expected that per socket policy
works again.


# 201d185b 22-Jan-2004 Andre Oppermann <andre@FreeBSD.org>

Split the overloaded variable 'win' into two for their specific purposes:
recwin and sendwin. This removes a big source of confusion and makes
following the code much easier.

Reviewed by: sam (mentor)
Obtained from: DragonFlyBSD rev 1.6 (hsu)


# 97d8d152 20-Nov-2003 Andre Oppermann <andre@FreeBSD.org>

Introduce tcp_hostcache and remove the tcp specific metrics from
the routing table. Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.

It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination. Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.

tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.

It removes significant locking requirements from the tcp stack with
regard to the routing table.

Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)


# fa286d7d 08-Nov-2003 Sam Leffler <sam@FreeBSD.org>

replace mtx_assert by INP_LOCK_ASSERT

Supported by: FreeBSD Foundation


# 27a940c9 07-Nov-2003 Sam Leffler <sam@FreeBSD.org>

unbreak compilation of FAST_IPSEC

Supported by: FreeBSD Foundation


# 0f9ade71 04-Nov-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

- cleanup SP refcnt issue.
- share policy-on-socket for listening socket.
- don't copy policy-on-socket at all. secpolicy no longer contain
spidx, which saves a lot of memory.
- deep-copy pcb policy if it is an ipsec policy. assign ID field to
all SPD entries. make it possible for racoon to grab SPD entry on
pcb.
- fixed the order of searching SA table for packets.
- fixed to get a security association header. a mode is always needed
to compare them.
- fixed that the incorrect time was set to
sadb_comb_{hard|soft}_usetime.
- disallow port spec for tunnel mode policy (as we don't reassemble).
- an user can define a policy-id.
- clear enc/auth key before freeing.
- fixed that the kernel crashed when key_spdacquire() was called
because key_spdacquire() had been implemented imcopletely.
- preparation for 64bit sequence number.
- maintain ordered list of SA, based on SA id.
- cleanup secasvar management; refcnt is key.c responsibility;
alloc/free is keydb.c responsibility.
- cleanup, avoid double-loop.
- use hash for spi-based lookup.
- mark persistent SP "persistent".
XXX in theory refcnt should do the right thing, however, we have
"spdflush" which would touch all SPs. another solution would be to
de-register persistent SPs from sptree.
- u_short -> u_int16_t
- reduce kernel stack usage by auto variable secasindex.
- clarify function name confusion. ipsec_*_policy ->
ipsec_*_pcbpolicy.
- avoid variable name confusion.
(struct inpcbpolicy *)pcb_sp, spp (struct secpolicy **), sp (struct
secpolicy *)
- count number of ipsec encapsulations on ipsec4_output, so that we
can tell ip_output() how to handle the packet further.
- When the value of the ul_proto is ICMP or ICMPV6, the port field in
"src" of the spidx specifies ICMP type, and the port field in "dst"
of the spidx specifies ICMP code.
- avoid from applying IPsec transport mode to the packets when the
kernel forwards the packets.

Tested by: nork
Obtained from: KAME


# 91f467d5 13-Aug-2003 Hartmut Brandt <harti@FreeBSD.org>

The tcp_trace call needs the length of the header. Unfortunately the
code has rotten a bit so that the header length is not correct at
the point when tcp_trace is called. Temporarily compute the correct
value before the call and restore the old value after. This makes
ports/benchmarks/dbs to almost work.

This is a NOP unless you compile with TCPDEBUG.


# 79909384 19-Feb-2003 Jonathan Lemon <jlemon@FreeBSD.org>

Convert tcp_fillheaders(tp, ...) -> tcpip_fillheaders(inp, ...) so the
routine does not require a tcpcb to operate. Since we no longer keep
template mbufs around, move pseudo checksum out of this routine, and
merge it with the length update.

Sponsored by: DARPA, NAI Labs


# 3bfd6421 19-Feb-2003 Jonathan Lemon <jlemon@FreeBSD.org>

Clean up delayed acks and T/TCP interactions:
- delay acks for T/TCP regardless of delack setting
- fix bug where a single pass through tcp_input might not delay acks
- use callout_active() instead of callout_pending()

Sponsored by: DARPA, NAI Labs


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 314e5a3d 18-Jan-2003 Jeffrey Hsu <hsu@FreeBSD.org>

Optimize away call to bzero() in the common case by directly checking
if a connection has any cached TAO information.


# abac41a6 16-Oct-2002 Matthew Dillon <dillon@FreeBSD.org>

Fix oops in my last commit, I was calculating a new length but then not
using it. (The code is already correct in -stable).

Found by: silby


# b9234faf 15-Oct-2002 Sam Leffler <sam@FreeBSD.org>

Tie new "Fast IPsec" code into the build. This involves the usual
configuration stuff as well as conditional code in the IPv4 and IPv6
areas. Everything is conditional on FAST_IPSEC which is mutually
exclusive with IPSEC (KAME IPsec implmentation).

As noted previously, don't use FAST_IPSEC with INET6 at the moment.

Reviewed by: KAME, rwatson
Approved by: silence
Supported by: Vernier Networks


# 5d846453 15-Oct-2002 Sam Leffler <sam@FreeBSD.org>

Replace aux mbufs with packet tags:

o instead of a list of mbufs use a list of m_tag structures a la openbsd
o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit
ABI/module number cookie
o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and
use this in defining openbsd-compatible m_tag_find and m_tag_get routines
o rewrite KAME use of aux mbufs in terms of packet tags
o eliminate the most heavily used aux mbufs by adding an additional struct
inpcb parameter to ip_output and ip6_output to allow the IPsec code to
locate the security policy to apply to outbound packets
o bump __FreeBSD_version so code can be conditionalized
o fixup ipfilter's call to ip_output based on __FreeBSD_version

Reviewed by: julian, luigi (silent), -arch, -net, darren
Approved by: julian, silence from everyone else
Obtained from: openbsd (mostly)
MFC after: 1 month


# 28257b5c 10-Oct-2002 Matthew Dillon <dillon@FreeBSD.org>

Update various comments mainly related to retransmit/FIN that I
documented while working on a previous bug.

Fix a PERSIST bug. Properly account for a FIN sent during a PERSIST.

MFC after: 7 days


# 4a03a8a8 16-Sep-2002 Jennifer Yang <jennifer@FreeBSD.org>

Tempary fix for inet6. The final fix is to change in6_pcbnotify to take pcbinfo instead
of pcbhead. It is on the way.


# 1fcc99b5 17-Aug-2002 Matthew Dillon <dillon@FreeBSD.org>

Implement TCP bandwidth delay product window limiting, similar to (but
not meant to duplicate) TCP/Vegas. Add four sysctls and default the
implementation to 'off'.

net.inet.tcp.inflight_enable enable algorithm (defaults to 0=off)
net.inet.tcp.inflight_debug debugging (defaults to 1=on)
net.inet.tcp.inflight_min minimum window limit
net.inet.tcp.inflight_max maximum window limit

MFC after: 1 week


# 3d6ade3a 11-Aug-2002 Jennifer Yang <jennifer@FreeBSD.org>

Assert that the inpcb lock is held when calling tcp_output().

Approved by: hsu


# c488362e 31-Jul-2002 Robert Watson <rwatson@FreeBSD.org>

Introduce support for Mandatory Access Control and extensible
kernel access control.

Instrument the TCP socket code for packet generation and delivery:
label outgoing mbufs with the label of the socket, and check socket and
mbuf labels before permitting delivery to a socket. Assign labels
to newly accepted connections when the syncache/cookie code has done
its business. Also set peer labels as convenient. Currently,
MAC policies cannot influence the PCB matching algorithm, so cannot
implement polyinstantiation. Note that there is at least one case
where a PCB is not available due to the TCP packet not being associated
with any socket, so we don't label in that case, but need to handle
it in a special manner.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# f10e85d7 23-Jun-2002 Luigi Rizzo <luigi@FreeBSD.org>

Slightly restructure the #ifdef INET6 sections to make the code
more readable.

Remove the six "register" attributes from variables tcp_output(), the
compiler surely knows well how to allocate them.


# eb5afeba 13-Jun-2002 Mike Silbersack <silby@FreeBSD.org>

Re-commit w/fix:

Ensure that the syn cache's syn-ack packets contain the same
ip_tos, ip_ttl, and DF bits as all other tcp packets.

PR: 39141
MFC after: 2 weeks

This time, make sure that ipv4 specific code (aka all of the above)
is only run in the ipv4 case.


# 70d2b170 13-Jun-2002 Mike Silbersack <silby@FreeBSD.org>

Back out ip_tos/ip_ttl/DF "fix", it just panic'd my box. :)

Pointy-hat to: silby


# 21c3b2fc 13-Jun-2002 Mike Silbersack <silby@FreeBSD.org>

Ensure that the syn cache's syn-ack packets contain the same
ip_tos, ip_ttl, and DF bits as all other tcp packets.

PR: 39141
MFC after: 2 weeks


# 4cc20ab1 31-May-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Back out my lats commit of locking down a socket, it conflicts with hsu's work.

Requested by: hsu


# 243917fe 19-May-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Lock down a socket, milestone 1.

o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

- so_count
- so_options
- so_linger
- so_state

o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:

- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()

Reviewed by: alfred


# 3260eb18 14-Dec-2001 Mike Silbersack <silby@FreeBSD.org>

Reduce the local network slowstart flightsize from infinity to 4 packets.

Now that we've increased the size of our send / receive buffers, bursting
an entire window onto the network may cause congestion. As a result,
we will slow start beginning with a flightsize of 4 packets.

Problem reported by: Thomas Zenker <thz@Lennartz-electronic.de>

MFC after: 3 days


# 7c183182 12-Dec-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Fix up tabs from cut&n&paste.


# 6e551fb6 10-Dec-2001 David E. O'Brien <obrien@FreeBSD.org>

Update to C99, s/__FUNCTION__/__func__/,
also don't use ANSI string concatenation.


# 262c1c1a 02-Dec-2001 Matthew Dillon <dillon@FreeBSD.org>

Fix a bug with transmitter restart after receiving a 0 window. The
receiver was not sending an immediate ack with delayed acks turned on
when the input buffer is drained, preventing the transmitter from
restarting immediately.

Propogate the TCP_NODELAY option to accept()ed sockets. (Helps tbench and
is a good idea anyway).

Some cleanup. Identify additonal issues in comments.

MFC after: 1 day


# d912c694 30-Nov-2001 Matthew Dillon <dillon@FreeBSD.org>

The transmit burst limit for newreno completely breaks TCP's performance
if the receive side is using delayed acks. Temporarily remove it.

MFC after: 0 days


# be2ac88c 21-Nov-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Introduce a syncache, which enables FreeBSD to withstand a SYN flood
DoS in an improved fashion over the existing code.

Reviewed by: silby (in a previous iteration)
Sponsored by: DARPA, NAI Labs


# c24d5dae 05-Oct-2001 Jayanth Vijayaraghavan <jayanth@FreeBSD.org>

Add a flag TF_LASTIDLE, that forces a previously idle connection
to send all its data, especially when the data is less than one MSS.
This fixes an issue where the stack was delaying the sending
of data, eventhough there was enough window to send all the data and
the sending of data was emptying the socket buffer.

Problem found by Yoshihiro Tsuchiya (tsuchiya@flab.fujitsu.co.jp)

Submitted by: Jayanth Vijayaraghavan


# 08517d53 22-Jun-2001 Mike Silbersack <silby@FreeBSD.org>

Eliminate the allocation of a tcp template structure for each
connection. The information contained in a tcptemp can be
reconstructed from a tcpcb when needed.

Previously, tcp templates required the allocation of one
mbuf per connection. On large systems, this change should
free up a large number of mbufs.

Reviewed by: bmilekic, jlemon, ru
MFC after: 2 weeks


# 33841545 10-Jun-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

Sync with recent KAME.
This work was based on kame-20010528-freebsd43-snap.tgz and some
critical problem after the snap was out were fixed.
There are many many changes since last KAME merge.

TODO:
- The definitions of SADB_* in sys/net/pfkeyv2.h are still different
from RFC2407/IANA assignment because of binary compatibility
issue. It should be fixed under 5-CURRENT.
- ip6po_m member of struct ip6_pktopts is no longer used. But, it
is still there because of binary compatibility issue. It should
be removed under 5-CURRENT.

Reviewed by: itojun
Obtained from: KAME
MFC after: 3 weeks


# fb919e4d 01-May-2001 Mark Murray <markm@FreeBSD.org>

Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by: bde (with reservations)


# 46aa3347 27-Oct-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Convert all users of fldoff() to offsetof(). fldoff() is bad
because it only takes a struct tag which makes it impossible to
use unions, typedefs etc.

Define __offsetof() in <machine/ansi.h>

Define offsetof() in terms of __offsetof() in <stddef.h> and <sys/types.h>

Remove myriad of local offsetof() definitions.

Remove includes of <stddef.h> in kernel code.

NB: Kernelcode should *never* include from /usr/include !

Make <sys/queue.h> include <machine/ansi.h> to avoid polluting the API.

Deprecate <struct.h> with a warning. The warning turns into an error on
01-12-2000 and the file gets removed entirely on 01-01-2001.

Paritials reviews by: various.
Significant brucifications by: bde


# 6612c70e 11-Sep-2000 Archie Cobbs <archie@FreeBSD.org>

Don't do snd_nxt rollback optimization (rev. 1.46) for SYN packets.
It causes a panic when/if snd_una is incremented elsewhere (this
is a conservative change, because originally no rollback occurred
for any packets at all).

Submitted by: Vivek Sadananda Pai <vivek@imimic.com>


# 7734ea06 03-Aug-2000 Archie Cobbs <archie@FreeBSD.org>

Improve performance in the case where ip_output() returns an error.
When this happens, we know for sure that the packet data was not
received by the peer. Therefore, back out any advancing of the
transmit sequence number so that we send the same data the next
time we transmit a packet, avoiding a guaranteed missed packet and
its resulting TCP transmit slowdown.

In most systems ip_output() probably never returns an error, and
so this problem is never seen. However, it is more likely to occur
with device drivers having short output queues (causing ENOBUFS to
be returned when they are full), not to mention low memory situations.

Moreover, because of this problem writers of slow devices were
required to make an unfortunate choice between (a) having a relatively
short output queue (with low latency but low TCP bandwidth because
of this problem) or (b) a long output queue (with high latency and
high TCP bandwidth). In my particular application (ISDN) it took
an output queue equal to ~5 seconds of transmission to avoid ENOBUFS.
A more reasonable output queue of 0.5 seconds resulted in only about
50% TCP throughput. With this patch full throughput was restored in
the latter case.

Reviewed by: freebsd-net


# 7d200109 12-Jul-2000 Jayanth Vijayaraghavan <jayanth@FreeBSD.org>

re-enable the tcp newreno code.


# 686cdd19 04-Jul-2000 Jun-ichiro itojun Hagino <itojun@FreeBSD.org>

sync with kame tree as of july00. tons of bug fixes/improvements.

API changes:
- additional IPv6 ioctls
- IPsec PF_KEY API was changed, it is mandatory to upgrade setkey(8).
(also syntax change)


# 59f577ad 02-Jun-2000 Jonathan Lemon <jlemon@FreeBSD.org>

When attempting to transmit a packet, if the system fails to allocate
a mbuf, it may return without setting any timers. If no more data is
scheduled to be transmitted (this was a FIN) the system will sit in
LAST_ACK state forever.

Thus, when mbuf allocation fails, set the retransmit timer if neither
the retransmit or persist timer is already pending.

Problem discovered by: Mike Silbersack (silby@silby.com)
Pushed for a fix by: Bosko Milekic <bmilekic@dsuper.net>
Reviewed by: jayanth


# 4aae1da6 11-May-2000 Jayanth Vijayaraghavan <jayanth@FreeBSD.org>

Temporarily turn off the newreno flag until we can track down the known
data corruption problem.


# 46f58482 05-May-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Implement TCP NewReno, as documented in RFC 2582. This allows
better recovery for multiple packet losses in a single window.
The algorithm can be toggled via the sysctl net.inet.tcp.newreno,
which defaults to "on".

Submitted by: Jayanth Vijayaraghavan <jayanth@yahoo-inc.com>


# db4f9cc7 27-Mar-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Add support for offloading IP/TCP/UDP checksums to NIC hardware which
supports them.


# a683a7dd 08-Feb-2000 Yoshinobu Inoue <shin@FreeBSD.org>

Avoid kernel panic when tcp rfc1323 and rfc1644 options are enabled
at the same time.

When rfc1323 and rfc1644 option are enabled by sysctl,
and tcp over IPv6 is tried, kernel panic happens by the
following check in tcp_output(), because now hdrlen is bigger
in such case than before.

/*#ifdef DIAGNOSTIC*/
if (max_linkhdr + hdrlen > MHLEN)
panic("tcphdr too big");
/*#endif*/

So change the above check to compare with MCLBYTES in #ifdef INET6 case.
Also, allocate a mbuf cluster for the header mbuf, in that case.

Bug reported at KAME environment.
Approved by: jkh

Reviewed by: sumikawa
Obtained from: KAME project


# 3a2a9f79 15-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

Fixed the problem that IPsec connection hangs when bigger data is sent.
-opt_ipsec.h was missing on some tcp files (sorry for basic mistake)
-made buildable as above fix
-also added some missing IPv4 mapped IPv6 addr consideration into
ipsec4_getpolicybysock


# fb59c426 09-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

tcp updates to support IPv6.
also a small patch to sys/nfs/nfs_socket.c, as max_hdr size change.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 9b8b58e0 30-Aug-1999 Jonathan Lemon <jlemon@FreeBSD.org>

Restructure TCP timeout handling:

- eliminate the fast/slow timeout lists for TCP and instead use a
callout entry for each timer.
- increase the TCP timer granularity to HZ
- implement "bad retransmit" recovery, as presented in
"On Estimating End-to-End Network Path Properties", by Allman and Paxson.

Submitted by: jlemon, wollmann


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# 9a039a5f 26-May-1999 David Greenman <dg@FreeBSD.org>

Added net.inet.tcp.path_mtu_discovery variable which when set to 0
(default 1) disables PMTUD globally. Although PMTUD can be disabled in
the standard case by locking the MTU on a static route (including the
default route), this method doesn't work in the face of dynamic routing
protocols like gated.


# 29089b51 07-Apr-1999 Julian Elischer <julian@FreeBSD.org>

Two cosmetic changes, one a typo and the other, a clarification.


# b0acefa8 20-Jan-1999 Bill Fenner <fenner@FreeBSD.org>

Add a flag, passed to pru_send routines, PRUS_MORETOCOME. This
flag means that there is more data to be put into the socket buffer.
Use it in TCP to reduce the interaction between mbuf sizes and the
Nagle algorithm.

Based on: "Justin C. Walker" <justin@apple.com>'s description of Apple's
fix for this problem.


# 9105bb46 13-Jul-1998 Bruce Evans <bde@FreeBSD.org>

Fixed overflow and sign extension bugs in
`len = min(so->so_snd.sb_cc, win) - off;'. min() has type u_int
and `off' has type int, so when min() is 0 and `off' is 1, the RHS
overflows to 0U - 1 = UINT_MAX. `len' has type long, so when
sizeof(long) == sizeof(int), the LHS normally overflows to to the
correct value of -1, but when sizeof(long) > sizeof(int), the LHS
is UINT_MAX.

Fixed some u_long's that should have been fixed-sized types.


# a04884fc 24-May-1998 Bill Fenner <fenner@FreeBSD.org>

Take IP options into account when calculating the allowable length
of the TCP payload. See RFC1122 section 4.2.2.6 . This allows
Path MTU discovery to be used along with IP options.

PR: problem discovered by Kevin Lahey <kml@nas.nasa.gov>


# 8e5db87c 06-Apr-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Remove the last traces of TUBA.

Inspired by: PR kern/3317


# d68fa50c 20-Feb-1998 Bruce Evans <bde@FreeBSD.org>

Don't depend on "implicit int".


# 610a2e9c 07-Oct-1997 Bill Fenner <fenner@FreeBSD.org>

Don't allow the window to be increased beyond what is possible to
represent in the TCP header. The old code did effectively:
win = min(win, MAX_ALLOWED);
win = max(win, what_i_think_i_advertised_last_time);
so if what_i_think_i_advertised_last_time is bigger than can be
represented in the header (e.g. large buffers and no window scaling)
then we stuff a too-big number into a short. This fix reverses the
order of the comparisons.

PR: kern/4712


# 0cc12cc5 16-Sep-1997 Joerg Wunsch <joerg@FreeBSD.org>

Make TCPDEBUG a new-style option.


# 1fd0b058 02-Aug-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# ca98b82c 02-Apr-1997 David Greenman <dg@FreeBSD.org>

Reorganize elements of the inpcb struct to take better advantage of
cache lines. Removed the struct ip proto since only a couple of chars
were actually being used in it. Changed the order of compares in the
PCB hash lookup to take advantage of partial cache line fills (on PPro).

Discussed-with: wollman


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 0453d3cb 08-Jun-1996 Bruce Evans <bde@FreeBSD.org>

Changed some memcpy()'s back to bcopy()'s.

gcc only inlines memcpy()'s whose count is constant and didn't inline
these. I want memcpy() in the kernel go away so that it's obvious that
it doesn't need to be optimized. Now it is only used for one struct
copy in si.c.


# 2d8266af 14-Apr-1996 David Greenman <dg@FreeBSD.org>

Two fixes from Rich Stevens:

1) Set the persist timer to help time-out connections in the CLOSING state.
2) Honor the keep-alive timer in the CLOSING state.

This fixes problems with connections getting "stuck" due to incompletion
of the final connection shutdown which can be a BIG problem on busy WWW
servers.


# 2ee45d7d 11-Mar-1996 David Greenman <dg@FreeBSD.org>

Move or add #include <queue.h> in preparation for upcoming struct socket
changes.


# 81165e48 17-Jan-1996 Andras Olah <olah@FreeBSD.org>

Be more conservative when T/TCP extensions are disabled. In particular,
do not send data and/or FIN on SYN segments in this case.


# b7a44e34 05-Dec-1995 Garrett Wollman <wollman@FreeBSD.org>

Path MTU Discovery is now standard.


# a45d2726 03-Nov-1995 Andras Olah <olah@FreeBSD.org>

Fix a logical error in T/TCP: when we actively open a connection, we
have to decide whether to send a CC or CCnew option in our SYN segment
depending on the contents of our TAO cache. This decision has to be
made once when the connection starts. The earlier code delayed this
decision until the segment was assembled in tcp_output() and
retransmitted SYN segments could have different CC options.

Reviewed by: Richard Stevens, davidg, wollman


# 3d1f141b 16-Oct-1995 Garrett Wollman <wollman@FreeBSD.org>

The ability to administratively change the MTU of an interface presents
a few new wrinkles for MTU discovery which tcp_output() had better
be prepared to handle. ip_output() is also modified to do something
helpful in this case, since it has already calculated the information
we need.


# d7f570e6 22-Sep-1995 Garrett Wollman <wollman@FreeBSD.org>

Merge 4.4-Lite-2: update version number (we already have the same fixes).

Obtained from: 4.4BSD-Lite-2


# efe4b0eb 21-Sep-1995 Garrett Wollman <wollman@FreeBSD.org>

Second try: get 4.4-Lite-2 into the source tree. The conflicts don't
matter because none of our working source files are on the CSRG branch
any more.

Obtained from: 4.4BSD-Lite-2


# f138387a 20-Sep-1995 Garrett Wollman <wollman@FreeBSD.org>

Add support in TCP for Path MTU discovery. This is highly experimental
and gated on `options MTUDISC' in the source. It is also practically
untested becausse (sniff!) I don't have easy access to a network with
an MTU of less than an Ethernet. If you have a small MTU network,
please try it and tell me if it works!


# 51823c3a 13-Sep-1995 Garrett Wollman <wollman@FreeBSD.org>

If tcp_output() is unable to allocate space for a copy of the data waiting
to be sent, just clean up and return ENOBUFS rather than silently
proceeding without sending any of the data. This makes it consistent
with the `#ifdef notyet' case immediately above.

Reviewed by: Andras Olah <olah@freebsd.org>
Obtained from: Lite-2


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# 94a5d9b6 09-May-1995 David Greenman <dg@FreeBSD.org>

Replaced some bcopy()'s with memcpy()'s so that gcc while inline/optimize.


# 15bd2b43 08-Apr-1995 David Greenman <dg@FreeBSD.org>

Implemented PCB hashing. Includes new functions in_pcbinshash, in_pcbrehash,
and in_pcblookuphash.


# 41f82abe 15-Feb-1995 Garrett Wollman <wollman@FreeBSD.org>

Transaction TCP support now standard. Hack away!


# a0292f23 09-Feb-1995 Garrett Wollman <wollman@FreeBSD.org>

Merge Transaction TCP, courtesy of Andras Olah <olah@cs.utwente.nl> and
Bob Braden <braden@isi.edu>.

NB: This has not had David's TCP ACK hack re-integrated. It is not clear
what the correct solution to this problem is, if any. If a better solution
doesn't pop up in response to this message, I'll put David's code back in
(or he's welcome to do so himself).


# 8eea1207 25-Jan-1995 David Greenman <dg@FreeBSD.org>

Kill previous commit as it isn't necessary.


# b99f012e 24-Jan-1995 David Greenman <dg@FreeBSD.org>

Extended the previous change to cover the non-options case, too.


# 297a37f3 23-Jan-1995 David Greenman <dg@FreeBSD.org>

Applied fix from Andreas Schulz with a different comment by me. Fixes a
bug where TCP connections are closed prematurely.

Submitted by: Andreas Schulz


# 610ee2f9 15-Sep-1994 David Greenman <dg@FreeBSD.org>

Made TCPDEBUG truely optional. Based on changes I made in FreeBSD 1.1.5.
Fixed somebody's idea of a joke - about the first half of the lines in
in_proto.c were spaced over by one space.


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources