History log of /freebsd-current/sys/netinet/tcp_input.c
Revision Date Author Comments
# df9de82f 25-May-2024 Michael Tuexen <tuexen@FreeBSD.org>

tcp: fix sending RST after second inp lookup

When we first find an inp, we set also the tp. If then a second
lookup is necessary, the inp is recomputed. If this fails, the
tp is not cleared, which resulted in failing KASSERT.
Therefore, clear the tp when staring the inp lookup procedure.
Reported by: Jenkins
Fixes: 02d15215cef2 ("tcp: improve blackhole support")
MFC after: 1 week
Sponsored by: Netflix, Inc.


# 02d15215 23-May-2024 Michael Tuexen <tuexen@FreeBSD.org>

tcp: improve blackhole support

There are two improvements to the TCP blackhole support:
(1) If net.inet.tcp.blackhole is set to 2, also sent no RST whenever
a segment is received on an existing closed socket or if there is
a port mismatch when using UDP encapsulation.
(2) If net.inet.tcp.blackhole is set to 3, no RST segment is sent in
response to incoming segments on closed sockets or in response to
unexpected segments on listening sockets.
Thanks to gallatin@ for suggesting such an improvement.

Reviewed by: gallatin
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D45304


# fce03f85 05-May-2024 Randall Stewart <rrs@FreeBSD.org>

TCP can be subject to Sack Attacks lets fix this issue.

There is a type of attack that a TCP peer can launch on a connection. This is for sure in Rack or BBR and probably even the default stack if it uses lists in sack processing. The idea of the attack is that the attacker is driving you to look at 100's of sack blocks that only update 1 byte. So for example if you have 1 - 10,000 bytes outstanding the attacker sends in something like:

ACK 0 SACK(1-512) SACK(1024 - 1536), SACK(2048-2536), SACK(4096 - 4608), SACK(8192-8704)
This first sack looks fine but then the attacker sends

ACK 0 SACK(1-512) SACK(1025 - 1537), SACK(2049-2537), SACK(4097 - 4609), SACK(8193-8705)
ACK 0 SACK(1-512) SACK(1027 - 1539), SACK(2051-2539), SACK(4099 - 4611), SACK(8195-8707)
...
These blocks are making you hunt across your linked list and split things up so that you have an entry for every other byte. Has your list grows you spend more and more CPU running through the lists. The idea here is the attacker chooses entries as far apart as possible that make you run through the list. This example is small but in theory if the window is open to say 1Meg you could end up with 100's of thousands link list entries.

To combat this we introduce three things.

when the peer requests a very small MSS we stop processing SACK's from them. This prevents a malicious peer from just using a small MSS to do the same thing.
Any time we get a sack block, we use the sack-filter to remove sacks that are smaller than the smallest v4 mss (minus 40 for max TCP options) unless it ties up to snd_max (since that is legal). All other sacks in theory should be at least an MSS. If we get such an attacker that means we basically start skipping all but MSS sized Sacked blocks.
The sack filter used to throw away data when its bounds were exceeded, instead now we increase its size to 15 and then throw away sack's if the filter gets over-run to prevent the malicious attacker from over-running the sack filter and thus we start to process things anyway.
The default stack will need to start using the sack-filter which we have talked about in past conference calls to take full advantage of the protections offered by it (and reduce cpu consumption when processing sacks).

After this set of changes is in rack can drop its SAD detection completely

Reviewed by:tuexen@, rscheff@
Differential Revision: <https://reviews.freebsd.org/D44903>


# c9cd686b 18-Apr-2024 Michael Tuexen <tuexen@FreeBSD.org>

tcp: drop data received after a FIN has been processed

RFC 9293 describes the handling of data in the CLOSE-WAIT, CLOSING,
LAST-ACK, and TIME-WAIT states:
This should not occur since a FIN has been received from the remote
side. Ignore the segment text.
Therefore, implement this handling.

Reviewed by: rrs, rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44746


# e8c149ab 07-Apr-2024 Michael Tuexen <tuexen@FreeBSD.org>

tcp: add some debug output

Also log, when dropping text or FIN after having received a FIN.
This is the intended behavior described in RFC 9293.
A follow-up patch will enforce this behavior for the base stack
and the RACK stack.
Reviewed by: rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44669


# 3e1c8a35 06-Apr-2024 Michael Tuexen <tuexen@FreeBSD.org>

tcp: improve consistency

No functional change intended.

Reported by: Coverity Scan
CID: 1523781
Reviewed by: rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44645


# dd7b86e2 18-Mar-2024 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove IS_FASTOPEN() macro

The macro is more obfuscating than helping as it just checks a single flag
of t_flags. All other t_flags bits are checked without a macro.

A bigger problem was that declaration of the macro in tcp_var.h depended
on a kernel option. It is a bad practice to create such definitions in
installable headers.

Reviewed by: rscheff, tuexen, kib
Differential Revision: https://reviews.freebsd.org/D44362


# 40fdc6d2 24-Feb-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: provide correct snd_fack on post_recovery

Ensure that snd_fack holds a valid value when doing
the post_recovery CC processing, for preparation of
the cc_cubic update, so that local pipe calculations
can correctly refer to snd_fack during and after CC events.

Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43957


# fcea1cc9 14-Feb-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: fix RTO ssthresh for non-6675 pipe calculation

Follow up on D43768 to properly deal with the non-default
pipe calculation. When CC_RTO is processed, the timeout
will have already pulled back snd_nxt. Further, snd_fack
is not pulled along with snd_una.

Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43876


# 3eeb22cb 10-Feb-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: clean scoreboard when releasing the socket buffer

The SACK scoreboard is conceptually an extention of the socket
buffer. Remove it when the socket buffer goes away with
soisdisconnected(). Verify that this is also the expected
state in tcp_discardcb().

PR: 276761
Reviewed by: glebius, tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43805


# 0b3f9e43 27-Jan-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: move cc_post_recovery past snd_una update

The RFC6675 pipe calculation (sack.revised, enabled
by default since D28702), uses outdated information,
while the previous default calculated it correctly
with up-to-date information from the incoming ACK.

This difference can become as large as the receive
window (not the congestion window previously),
potentially triggering a massive burst of new packets.

MFC after: 1 week
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43520


# 2d05a1c8 25-Jan-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: commonize check for more data to send, style changes

Use SEQ_SUB instead of a plain subtraction, for an implict
type conversion and prevention of a possible overflow.
Use curly brackets in stacked if statements throughout.
Use of the ? operator to enhance readability when clearing
the FIN flag in tcp_output().

None of the above change the function.

Reviewed By: tuexen, cc, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43539


# c7c325d0 24-Jan-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: pass maxseg around instead of calculating locally

Improve slowpath processing (reordering, retransmissions)
slightly by calculating maxseg only once. This typically
saves one of two calls to tcp_maxseg().

Reviewed By: glebius, tuexen, cc, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43536


# 429f14f8 08-Jan-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: clean PRR state after ECN congestion recovery.

PRR state was not properly reset on subsequent ECN CE
events. Clean up after local transmission failures too.

Reviewed by: tuexen, cc, #transport
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43170


# f4574e2d 08-Jan-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: prevent spurious empty segments and fix uncommon panic

Only try sending more data on pure ACKs when there is
more data available in the send buffer.

In the case of a retransmitted SYN not being sent due to
an internal error, the snd_una/snd_nxt accounting could
be off, leading to a panic. Pulling snd_nxt up to snd_una
prevents this from happening.

Reported by: fengdreamer@126.com
Reviewed by: cc, tuexen, #transport
MFC after: 1 week
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43343


# 30409ecd 06-Jan-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: do not purge SACK scoreboard on first RTO

Keeping the SACK scoreboard intact after the first RTO
and retransmitting all data anew only on subsequent RTOs
allows a more timely and efficient loss recovery under
many adverse cirumstances.

Reviewed By: tuexen, #transport
MFC after: 10 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D42906


# 893ed42e 06-Jan-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Make use of enum for sack_changed

No functional change.

Reviewed By: tuexen, #transport
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43346


# 513f2e2e 19-Dec-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: always set tcp_tun_port to a correct value

The tcp_tun_port field that is used to pass port value between UDP
and TCP in case of tunneling is a generic field that used to pass
data between network layers. It can be contaminated on entry, e.g.
by a VLAN tag set by a NIC driver. Explicily set it, so that it
is zeroed out in a normal not-tunneled TCP. If it contains garbage,
tcp_twcheck() later can enter wrong block of code and treat the packet
as incorrectly tunneled one. On main and stable/14 that will end up
with sending incorrect responses, but on stable/13 with ipfw(8) and
pcb-matching rules it may end up in a panic.

This is a minimal conservative patch to be merged to stable branches.
Later we may redesign this.

PR: 275169
Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D43065


# 9276ad23 07-Dec-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: shift PRR sending cadence slightly left

Don't let PRR pass up on the opportunity of clocking
out packets on arrival of ACKs - by pulling sends
forward by about half a packet. Prevents unexpectedly
long runs of incoming ACKs without eliciting a
packet transmission.

MFC after: 1 week
Reviewed By: #transport, tuexen
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D42918


# f42518ff 30-Nov-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: for LRD move sysctl from tcp.do_lrd tp tcp.sack.lrd, remove sockopt

Moving lrd sysctl to the tcp.sack branch, since LRD only works with SACK.
Remove the sockopt to programmatically control LRD per session.

Reviewed By: #transport, tuexen, rrs
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D42851


# 34c45bc6 29-Nov-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: enable LRD by default

Lost Retransmission Detection was added as a
feature in May 2021, but disabled by default.

Enabling the feature by default to reduce the
flow completion time by avoiding RTOs when
retransmissions get lost too.

Reviewed By: tuexen, #transport, zlei
MFC after: 10 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D42845


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 49a6fbe3 15-Nov-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

[tcp] add PRR 6937bis heuristic and retire prr_conservative sysctl

Improve Proportional Rate Reduction (RFC6937) by using a
heuristic, which automatically chooses between
conservative CRB and more aggressive SSRB modes.
Only when snd_una advances (a partial ACK), SSRB may be
used. Also, that ACK must not have any indication of
ongoing loss - using the addition of new holes into the
scoreboard as proxy for such an event.

MFC after: 4 weeks
Reviewed By: #transport, kbowling, rrs
Sponsored By: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28822


# e2c6a6d2 09-Oct-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: include RFC6675 IsLost() in pipe calculation

Add more accounting while processing SACK data, to
keep track of when a packet is deemed lost using
the RFC6675 guidance.

Together with PRR (RFC6972) this allows a sender to
retransmit presumed lost packets faster, and loss
recovery to complete earlier.

Reviewed By: cc, rrs, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D39299


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# b352ef58 26-Jul-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Handle <RST,ACK> in SYN-RCVD

Patch base stack to correctly handle the RST bit independently
of other header flags per TCP RFC.

MFC after: 1 week
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D40982


# e5738ee0 11-May-2023 Cheng Cui <cc@FreeBSD.org>

Under RSS, assign a TCP flow's inp_flowid anyway.

Summary:
This brings some benefit of a tcp flow identification for some kernel
modules, such as siftr.

Reviewers: rrs, rscheff, tuexen, #transport!
Approved by: tuexen (mentor), rrs
Subscribers: imp, melifaro, glebius
Differential Revision: https://reviews.freebsd.org/D40061


# 35bc0bcc 07-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: reduce argument list to functions that pass a segment

The socket argument is superfluous, as a tcpcb always has one and
only one socket.

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39434


# 78e6c3aa 27-Mar-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: update error counter when dropping a packet due to bad source

Use the same counter that ip_input()/ip6_input() use for bad destination
address. For IPv6 this is already heavily abused ip6s_badscope, which
needs to be split into several separate error counters.

Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D39234


# 69c7c811 16-Mar-2023 Randall Stewart <rrs@FreeBSD.org>

Move access to tcp's t_logstate into inline functions and provide new tracepoint and bbpoint capabilities.

The TCP stacks have long accessed t_logstate directly, but in order to do tracepoints and the new bbpoints
we need to move to using the new inline functions. This adds them and moves rack to now use
the tcp_tracepoints.

Reviewed by: tuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D38831


# 713264f6 06-Mar-2023 Mark Johnston <markj@FreeBSD.org>

netinet: Tighten checks for unspecified source addresses

The assertions added in commit b0ccf53f2455 ("inpcb: Assert against
wildcard addrs in in_pcblookup_hash_locked()") revealed that protocol
layers may pass the unspecified address to in_pcblookup().

Add some checks to filter out such packets before we attempt an inpcb
lookup:
- Disallow the use of an unspecified source address in in_pcbladdr() and
in6_pcbladdr().
- Disallow IP packets with an unspecified destination address.
- Disallow TCP packets with an unspecified source address, and add an
assertion to verify the comment claiming that the case of an
unspecified destination address is handled by the IP layer.

Reported by: syzbot+9ca890fb84e984e82df2@syzkaller.appspotmail.com
Reported by: syzbot+ae873c71d3c71d5f41cb@syzkaller.appspotmail.com
Reported by: syzbot+e3e689aba1d442905067@syzkaller.appspotmail.com
Reviewed by: glebius, melifaro
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38570


# 18b83b62 26-Jan-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: reduce the size of t_rttupdated in tcpcb

During tcp session start, various mechanisms need to
track a few initial RTTs before becoming active.
Prevent overflows of the corresponding tracking counter
and reduce the size of tcpcb simultaneously.

Reviewed By: #transport, tuexen, guest-ccui
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D21117


# aab8c844 05-Jan-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp/ipfw: fix "ipfw fwd localaddr,port"

The ipfw(4) feature of forwarding to local address without modifying
a packet was broken. The first lookup needs always be a non-wildcard
one, cause its goal is to find an already existing socket. Otherwise
a local wildcard listener with the same port number may match resulting
in the connection being forwared to wrong port.

Reported by: Pavel Polyakov <bsd kobyla.org>
Fixes: d88eb4654f372d0451139a1dbf525a8f2cad1cf8


# eaabc937 14-Dec-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: retire TCPDEBUG

This subsystem is superseded by modern debugging facilities,
e.g. DTrace probes and TCP black box logging.

We intentionally leave SO_DEBUG in place, as many utilities may
set it on a socket. Also the tcp::debug DTrace probes look at
this flag on a socket.

Reviewed by: gnn, tuexen
Discussed with: rscheff, rrs, jtl
Differential revision: https://reviews.freebsd.org/D37694


# e68b3792 07-Dec-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: embed inpcb into tcpcb

For the TCP protocol inpcb storage specify allocation size that would
provide space to most of the data a TCP connection needs, embedding
into struct tcpcb several structures, that previously were allocated
separately.

The most import one is the inpcb itself. With embedding we can provide
strong guarantee that with a valid TCP inpcb the tcpcb is always valid
and vice versa. Also we reduce number of allocs/frees per connection.
The embedded inpcb is placed in the beginning of the struct tcpcb,
since in_pcballoc() requires that. However, later we may want to move
it around for cache line efficiency, and this can be done with a little
effort. The new intotcpcb() macro is ready for such move.

The congestion algorithm data, the TCP timers and osd(9) data are
also embedded into tcpcb, and temprorary struct tcpcb_mem goes away.
There was no extra allocation here, but we went through extra pointer
every time we accessed this data.

One interesting side effect is that now TCP data is allocated from
SMR-protected zone. Potentially this allows the TCP stacks or other
TCP related modules to utilize that for their own synchronization.

Large part of the change was done with sed script:

s/tp->ccv->/tp->t_ccv./g
s/tp->ccv/\&tp->t_ccv/g
s/tp->cc_algo/tp->t_cc/g
s/tp->t_timers->tt_/tp->tt_/g
s/CCV\(ccv, osd\)/\&CCV(ccv, t_osd)/g

Dependency side effect is that code that needs to know struct tcpcb
should also know struct inpcb, that added several <netinet/in_pcb.h>.

Differential revision: https://reviews.freebsd.org/D37127


# bd4f9866 16-Nov-2022 Michael Tuexen <tuexen@FreeBSD.org>

tcp: remove unused t_rttbest

No functional change intended.

Reviewed by: rscheff@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D37401


# 9eb0e832 08-Nov-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: provide macros to access inpcb and socket from a tcpcb

There should be no functional changes with this commit.

Reviewed by: rscheff
Differential revision: https://reviews.freebsd.org/D37123


# f71cb9f7 08-Nov-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: inp_socket is valid through the lifetime of a TCP inpcb

The inp_socket is cleared only in in_pcbdetach(), which for TCP is
always accompanied with inp_pcbfree(). An inpcb that went through
in_pcbfree() shall never be returned by any kind of pcb lookup.

Reviewed by: tuexen
Differential revision: https://reviews.freebsd.org/D37062


# f567d55f 08-Nov-2022 Gleb Smirnoff <glebius@FreeBSD.org>

inpcb: don't return INP_DROPPED entries from pcb lookups

The in_pcbdrop() KPI, which is used solely by TCP, allows to remove a
pcb from hash list and mark it as dropped. The comment suggests that
such pcb won't be returned by lookups. Indeed, every call to
in_pcblookup*() is accompanied by a check for INP_DROPPED. Do what
comment suggests: never return such pcbs and remove unnecessary checks.

Reviewed by: tuexen
Differential revision: https://reviews.freebsd.org/D37061


# 004bb636 05-Nov-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Move sysctl OIDs related to ECN to tcp_ecn.c

Keep all ECN related code in (mostly) one place.

No functional change.

Event: IETF 115 Hackathon
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D37285


# b1258b76 06-Nov-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: add conservative d.cep accounting algorithm

Accurate ECN asks to conservatively estimate, when the
ACE counter may have wrapped due to a single ACK covering a larger
number of segments. This is described in Annex A.2 of the
accurate-ecn draft.

Event: IETF 115 Hackathon
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D37281


# c348e880 31-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: make tcp_handle_wakeup() static and robust

It is called only from tcp_input() and always has valid parameter.

Reviewed by: rscheff, tuexen
Differential revision: https://reviews.freebsd.org/D37115


# eda63345 25-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove useless today lock assertion in a middle of function

It was added back in 7cfc6904408b, when there was a jump label
above and tcp_input() hadn't been locked all through.


# 83c1ec92 20-Oct-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: ECN preparations for ECN++, AccECN (tcp_respond)

tcp_respond is another function to build a tcp control packet
quickly. With ECN++ and AccECN, both the IP ECN header, and
the TCP ECN flags are supposed to reflect the correct state.

Also ensure that on receiving multiple ECN SYN-ACKs, the
responses triggered will reflect the latest state.

Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36973


# 0d744519 06-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove tcptw, the compressed timewait state structure

The memory savings the tcptw brought back in 2003 (see 340c35de6a2) no
longer justify the complexity required to maintain it. For longer
explanation please check out the email [1].

Surpisingly through almost 20 years the TCP stack functionality of
handling the TIME_WAIT state with a normal tcpcb did not bitrot. The
existing tcp_input() properly handles a tcpcb in TCPS_TIME_WAIT state,
which is confirmed by the packetdrill tcp-testsuite [2].

This change just removes tcptw and leaves INP_TIMEWAIT. The flag will
be removed in a separate commit. This makes it easier to review and
possibly debug the changes.

[1] https://lists.freebsd.org/archives/freebsd-net/2022-January/001206.html
[2] https://github.com/freebsd-net/tcp-testsuite

Differential revision: https://reviews.freebsd.org/D36398


# cd84e78f 04-Oct-2022 Randall Stewart <rrs@FreeBSD.org>

tcp idle reduce does not work for a server.

TCP has an idle-reduce feature that allows a connection to reduce its
cwnd after it has been idle more than an RTT. This feature only works
for a sending side connection. It does this by at output checking the
idle time (t_rcvtime vs ticks) to see if its more than the RTO timeout.

The problem comes if you are a web server. You get a request and
then send out all the data.. then go idle. The next time you would
send is in response to a request from the peer asking for more data.
But the thing is you updated t_rcvtime when the request came in so
you never reduce.

The fix is to do the idle reduce check also on inbound.

Reviewed by: tuexen, rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36721


# 08af8aac 27-Sep-2022 Randall Stewart <rrs@FreeBSD.org>

Tcp progress timeout

Rack has had the ability to timeout connections that just sit idle automatically. This
feature of course is off by default and requires the user set it on (though the socket option
has been missing in tcp_usrreq.c). Lets get the progress timeout fully supported in
the base stack as well as rack.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36716


# d1b07f36 26-Sep-2022 Randall Stewart <rrs@FreeBSD.org>

TCP complete end status work.

The ending of a connection can tell us a lot about what happened i.e. did
it fail to setup, did it timeout, was it a normal close. Often times this is
useful information to help analyze and debug issues. Rack has had
end status for some time but the base stack as not. Lets go a ahead
and add in the missing bits to populate the end status.

Reviewed by: tuexen, rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36712


# e5049a17 26-Sep-2022 Randall Stewart <rrs@FreeBSD.org>

TCP rack does not work properly with cubic.

Right now if you use rack with cubic (the new default cc) you will have
improper results. This is because rack uses different variables than
the base stack (or bbr) and thus tcp_compute_pipe() always returns
so that cubic will choose a 30% backoff not the 50% backoff it should
when it is newreno compatibility mode. The fix is to allow a stack (rack)
to override its own compute_pipe.

Reviewed by: tuexen, rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36711


# 5ae83e0d 21-Sep-2022 Michael Tuexen <tuexen@FreeBSD.org>

tcp: send ACKs when requested

When doing Limited Transmit send an ACK when needed by the protocol
processing (like sending ACKs with a DSACK block).

PR: 264257
PR: 263445
PR: 260393
Reviewed by: rscheff@
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D36631


# 493105c2 21-Sep-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: fix simultaneous open and refine e80062a2d43

- The soisconnected() call on transition from SYN_RCVD to ESTABLISHED
is also necessary for a half-synchronized connection. Fix that
just setting the flag, when we transfer SYN-SENT -> SYN-RECEIVED.
- Provide a comment that explains at what conditions the call to
soisconnected() is necessary.
- Hence mechanically rename the TF_INCQUEUE flag to TF_SONOTCONN.
- Extend the change to the BBR and RACK stacks.

Note: the interaction between the accept_filter(9) and the socket layer
is not fully consistent, yet. For most accept filters this call to
soisconnected() will not move the connection from the incomplete queue
to the complete. The move would happen only when the filter has received
the desired data, and soisconnected() would be called once again from
sorwakeup(). Ideally, we should mark socket as connected only there,
and leave the soisconnected() from SYN_RCVD->ESTABLISHED only for the
simultaneous open case. However, this doesn't yet work.

Reviewed by: rscheff, tuexen, rrs
Differential revision: https://reviews.freebsd.org/D36641


# e80062a2 08-Sep-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: avoid call to soisconnected() on transition to ESTABLISHED

This call existed since pre-FreeBSD times, and it is hard to understand
why it was there in the first place. After 6f3caa6d815 it definitely
became necessary always and commit message from f1ee30ccd60 confirms that.
Now that 6f3caa6d815 is effectively backed out by 07285bb4c22, the call
appears to be useful only for sockets that landed on the incomplete queue,
e.g. sockets that have accept_filter(9) enabled on them.

Provide a new TCP flag to mark connections that are known to be on the
incomplete queue, and call soisconnected() only for those connections.

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D36488


# c21b7b55 31-Aug-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: finish SACK loss recovery on sudden lack of SACK blocks

While a receiver should continue sending SACK blocks for the
duration of a SACK loss recovery, if for some reason the
TCP options no longer contain these SACK blocks, but we
already started maintaining the Scoreboard, keep on handling
incoming ACKs (without SACK) as belonging to the SACK recovery.

Reported by: thj
Reviewed by: tuexen, #transport
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36046


# c0060575 29-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove a dead code leftover from T/TCP,

that doesn't have any value today.


# d9f6ac88 17-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protosw: retire PRU_ flags and their char names

For many years only TCP debugging used them, but relatively recently
TCP DTrace probes also start to use them. Move their declarations
into tcp_debug.h, but start including tcp_debug.h unconditionally,
so that compilation with DTrace and without TCPDEBUG is possible.


# d88eb465 10-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: address a wire level race with 2 ACKs at the end of TCP handshake

Imagine we are in SYN-RCVD state and two ACKs arrive at the same time,
both valid, e.g. coming from the same host and with valid sequence.

First packet would locate the listening socket in the inpcb database,
write-lock it and start expanding the syncache entry into a socket.
Meanwhile second packet would wait on the write lock of the listening
socket. First packet will create a new ESTABLISHED socket, free the
syncache entry and unlock the listening socket. Second packet would
call into syncache_expand(), but this time it will fail as there
is no syncache entry. Second packet would generate RST, effectively
resetting the remote connection.

It seems to me, that it is impossible to solve this problem with
just rearranging locks, as the race happens at a wire level.

To solve the problem, for an ACK packet arrived on a listening socket,
that failed syncache lookup, perform a second non-wildcard lookup right
away. That lookup may find the new born socket. Otherwise, we indeed
send RST.

Tested by: kp
Reviewed by: tuexen, rrs
PR: 265154
Differential revision: https://reviews.freebsd.org/D36066


# e7231d07 07-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_input: update comment to match reality.


# 74703901 04-Jul-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: use a TCP flag to check if connection has been close(2)d

The flag SS_NOFDREF is a private flag of the socket layer. It also
is supposed to be read with SOCK_LOCK(), which we don't own here.

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D35663


# ad3ad064 27-Jun-2022 Gleb Smirnoff <glebius@FreeBSD.org>

blackhole(4): fix operator precedence

Fixes: 3ea9a7cf7b09a355cde3a76824809402b99d0892


# f5766992 15-Jun-2022 Hans Petter Selasky <hselasky@FreeBSD.org>

tcp: Correctly compute the TCP goodput in bits per second by using SEQ_SUB().

TCP sequence number differences should be computed using SEQ_SUB().

Differential Revision: https://reviews.freebsd.org/D35505
Reviewed by: rscheff@
MFC after: 1 week
Sponsored by: NVIDIA Networking


# 43283184 12-May-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: use socket buffer mutexes in struct socket directly

Since c67f3b8b78e the sockbuf mutexes belong to the containing socket,
and socket buffers just point to it. In 74a68313b50 macros that access
this mutex directly were added. Go over the core socket code and
eliminate code that reaches the mutex by dereferencing the sockbuf
compatibility pointer.

This change requires a KPI change, as some functions were given the
sockbuf pointer only without any hint if it is a receive or send buffer.

This change doesn't cover the whole kernel, many protocols still use
compatibility pointers internally. However, it allows operation of a
protocol that doesn't use them.

Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35152


# f7220c48 05-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: move ECN handling code to a common file

Reduce the burden to maintain correct and
extensible ECN related code across multiple
stacks and codepaths.

Formally no functional change.

Incidentially this establishes correct
ECN operation in one instance.

Reviewed By: rrs, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34162


# 7994ef3c 04-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

Revert "tcp: move ECN handling code to a common file"

This reverts commit 0c424c90eaa6602e07bca7836b1d178b91f2a88a.


# 0c424c90 04-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: move ECN handling code to a common file

Reduce the burden to maintain correct and
extensible ECN related code across multiple
stacks and codepaths.

Formally no functional change.

Incidentially this establishes correct
ECN operation in one instance.

Reviewed By: rrs, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34162


# 1ebf4607 03-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Access all 12 TCP header flags via inline function

In order to consistently provide access to all
(including reserved) TCP header flag bits,
use an accessor function tcp_get_flags and
tcp_set_flags. Also expand any flag variable from
uint8_t / char to uint16_t.

Reviewed By: hselasky, tuexen, glebius, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34130


# 4531b345 27-Jan-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Tidying up the conditionals for unwinding a spurious RTO

- Use the semantically correct TSTMP_xx macro when comparing
timestamps. (No functional change)
- check for bad retransmits only when TSopt is present in ACK
(don't assume there will be a valid TSopt in the TCP options struct)
- exclude tsecr == 0, since that most likely indicates an
invalid ts echo return (tsecr) value.

Reviewed By: tuexen, #transport
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34062


# 68e623c3 27-Jan-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Rewind erraneous RTO only while performing RTO retransmissions

Under rare circumstances, a spurious retranmission is
incorrectly detected and rewound, messing up various tcpcb values,
which can lead to a panic when SACK is in use.

Reviewed By: tuexen, chengc_netapp.com, #transport
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D33979


# 40fa3e40 26-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: mechanically substitute call to tfb_tcp_output to new method.

Made with sed(1) execution:

sed -Ef sed -i "" $(grep --exclude tcp_var.h -lr tcp_output sys/)

sed:
s/tp->t_fb->tfb_tcp_output\(tp\)/tcp_output(tp)/
s/to tfb_tcp_output\(\)/to tcp_output()/

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D33366


# 75add59a 17-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: allocate statistics in the main tcp_init()

No reason to have a separate SYSINIT.


# db0ac6de 02-Dec-2021 Cy Schubert <cy@FreeBSD.org>

Revert "wpa: Import wpa_supplicant/hostapd commit 14ab4a816"

This reverts commit 266f97b5e9a7958e365e78288616a459b40d924a, reversing
changes made to a10253cffea84c0c980a36ba6776b00ed96c3e3b.

A mismerge of a merge to catch up to main resulted in files being
committed which should not have been.


# de2d4784 02-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

SMR protection for inpcbs

With introduction of epoch(9) synchronization to network stack the
inpcb database became protected by the network epoch together with
static network data (interfaces, addresses, etc). However, inpcb
aren't static in nature, they are created and destroyed all the
time, which creates some traffic on the epoch(9) garbage collector.

Fairly new feature of uma(9) - Safe Memory Reclamation allows to
safely free memory in page-sized batches, with virtually zero
overhead compared to uma_zfree(). However, unlike epoch(9), it
puts stricter requirement on the access to the protected memory,
needing the critical(9) section to access it. Details:

- The database is already build on CK lists, thanks to epoch(9).
- For write access nothing is changed.
- For a lookup in the database SMR section is now required.
Once the desired inpcb is found we need to transition from SMR
section to r/w lock on the inpcb itself, with a check that inpcb
isn't yet freed. This requires some compexity, since SMR section
itself is a critical(9) section. The complexity is hidden from
KPI users in inp_smr_lock().
- For a inpcb list traversal (a pcblist sysctl, or broadcast
notification) also a new KPI is provided, that hides internals of
the database - inp_next(struct inp_iterator *).

Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33022


# 3ea9a7cf 28-Oct-2021 Gleb Smirnoff <glebius@FreeBSD.org>

blackhole(4): disable for locally originated TCP/UDP packets

In most cases blackholing for locally originated packets is undesired,
leads to different kind of lags and delays. Provide sysctls to enforce
it, e.g. for debugging purposes.

Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D32718


# 74d7fc87 19-Jun-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Add PRR cwnd reduction for non-SACK loss

This completes PRR cwnd reduction in all circumstances
for the base TCP stack (SACK loss recovery, ECN window reduction,
non-SACK loss recovery), preventing the arriving ACKs to
clock out new data at the old, too high rate. This
reduces the chance to induce additional losses while
recovering from loss (during congested network conditions).

For non-SACK loss recovery, each ACK is assumed to have
one MSS delivered. In order to prevent ACK-split attacks,
only one window worth of ACKs is considered to actually
have delivered new data.

MFC after: 6 weeks
Reviewed By: rrs, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29441


# f4bb1869 14-Jun-2021 Mark Johnston <markj@FreeBSD.org>

Consistently use the SOLISTENING() macro

Some code was using it already, but in many places we were testing
SO_ACCEPTCONN directly. As a small step towards fixing some bugs
involving synchronization with listen(2), make the kernel consistently
use SOLISTENING(). No functional change intended.

MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# 4747500d 04-Jun-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: A better fix for the previously attempted fix of the ack-war issue with tcp.

So it turns out that my fix before was not correct. It ended with us failing
some of the "improved" SYN tests, since we are not in the correct states.
With more digging I have figured out the root of the problem is that when
we receive a SYN|FIN the reassembly code made it so we create a segq entry
to hold the FIN. In the established state where we were not in order this
would be correct i.e. a 0 len with a FIN would need to be accepted. But
if you are in a front state we need to strip the FIN so we correctly handle
the ACK but ignore the FIN. This gets us into the proper states
and avoids the previous ack war.

I back out some of the previous changes but then add a new change
here in tcp_reass() that fixes the root cause of the issue. We still
leave the rack panic fixes in place however.

Reviewed by: mtuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30627


# 8c69d988 27-May-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: When we have an out-of-order FIN we do want to strip off the FIN bit.

The last set of commits fixed both a panic (in rack) and an ACK-war (in freebsd and bbr).
However there was a missing case, i.e. where we get an out-of-order FIN by itself.
In such a case we don't want to leave the FIN bit set, otherwise we will do the
wrong thing and ack the FIN incorrectly. Instead we need to go through the
tcp_reasm() code and that way the FIN will be stripped and all will be well.

Reviewed by: mtuexen,rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30497


# 13c0e198 25-May-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Fix bugs related to the PUSH bit and rack and an ack war

Michaels testing with UDP tunneling found an issue with the push bit, which was only partly fixed
in the last commit. The problem is the left edge gets transmitted before the adjustments are done
to the send_map, this means that right edge bits must be considered to be added only if
the entire RSM is being retransmitted.

Now syzkaller also continued to find a crash, which Michael sent me the reproducer for. Turns
out that the reproducer on default (freebsd) stack made the stack get into an ack-war with itself.
After fixing the reference issues in rack the same ack-war was found in rack (and bbr). Basically
what happens is we go into the reassembly code and lose the FIN bit. The trick here is we
should not be going into the reassembly code if tlen == 0 i.e. the peer never sent you anything.
That then gets the proper action on the FIN bit but then you end up in LAST_ACK with no
timers running. This is because the usrclosed function gets called and the FIN's and such have
already been exchanged. So when we should be entering FIN_WAIT2 (or even FIN_WAIT1) we get
stuck in LAST_ACK. Fixing this means tweaking the usrclosed function so that we properly
recognize the condition and drop into FIN_WAIT2 where a timer will allow at least TP_MAXIDLE
before closing (to allow time for the peer to retransmit its FIN if the ack is lost). Setting the fast_finwait2
timer can speed this up in testing.

Reviewed by: mtuexen,rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30451


# 032bf749 21-May-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

[tcp] Keep socket buffer locked until upcall

r367492 would unlock the socket buffer before eventually calling the upcall.
This leads to problematic interaction with NFS kernel server/client components
(MP threads) accessing the socket buffer with potentially not correctly updated
state.

Reported by: rmacklem
Reviewed By: tuexen, #transport
Tested by: rmacklem, otis
MFC after: 2 weeks
Sponsored By: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29690


# 0471a8c7 10-May-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: SACK Lost Retransmission Detection (LRD)

Recover from excessive losses without reverting to a
retransmission timeout (RTO). Disabled by default, enable
with sysctl net.inet.tcp.do_lrd=1

Reviewed By: #transport, rrs, tuexen, #manpages
Sponsored by: Netapp, Inc.
Differential Revision: https://reviews.freebsd.org/D28931


# 5d8fd932 06-May-2021 Randall Stewart <rrs@FreeBSD.org>

This brings into sync FreeBSD with the netflix versions of rack and bbr.
This fixes several breakages (panics) since the tcp_lro code was
committed that have been reported. Quite a few new features are
now in rack (prefecting of DGP -- Dynamic Goodput Pacing among the
largest). There is also support for ack-war prevention. Documents
comming soon on rack..

Sponsored by: Netflix
Reviewed by: rscheff, mtuexen
Differential Revision: https://reviews.freebsd.org/D30036


# 48be5b97 28-Apr-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: stop spurious rescue retransmissions and potential asserts

Reported by: pho@
MFC after: 3 days
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29970


# 1db08fbe 16-Apr-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_input: always request read-locking of PCB for any pure SYN segment.

This is further rework of 08d9c920275. Now we carry the knowledge of
lock type all the way through tcp_input() and also into tcp_twcheck().
Ideally the rlocking for pure SYNs should propagate all the way into
the alternative TCP stacks, but not yet today.

This should close a race when socket is bind(2)-ed but not yet
listen(2)-ed and a SYN-packet arrives racing with listen(2), discovered
recently by pho@.


# 7b5053ce 16-Apr-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_input: remove comments and assertions about tcpbinfo locking

They aren't valid since d40c0d47cd2.


# 9e644c23 18-Apr-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: add support for TCP over UDP

Adding support for TCP over UDP allows communication with
TCP stacks which can be implemented in userspace without
requiring special priviledges or specific support by the OS.
This is joint work with rrs.

Reviewed by: rrs
Sponsored by: Netflix, Inc.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D29469


# d1de2b05 17-Apr-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Rename rfc6675_pipe to sack.revised, and enable by default

As full support of RFC6675 is in place, deprecating
net.inet.tcp.rfc6675_pipe and enabling by default
net.inet.tcp.sack.revised.

Reviewed By: #transport, kbowling, rrs
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28702


# 8d5719aa 18-Mar-2021 Gleb Smirnoff <glebius@FreeBSD.org>

syncache: simplify syncache_add() KPI to return struct socket pointer
directly, not overwriting the listen socket pointer argument.
Not a functional change.


# 08d9c920 18-Mar-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_input/syncache: acquire only read lock on PCB for SYN,!ACK packets

When packet is a SYN packet, we don't need to modify any existing PCB.
Normally SYN arrives on a listening socket, we either create a syncache
entry or generate syncookie, but we don't modify anything with the
listening socket or associated PCB. Thus create a new PCB lookup
mode - rlock if listening. This removes the primary contention point
under SYN flood - the listening socket PCB.

Sidenote: when SYN arrives on a synchronized connection, we still
don't need write access to PCB to send a challenge ACK or just to
drop. There is only one exclusion - tcptw recycling. However,
existing entanglement of tcp_input + stacks doesn't allow to make
this change small. Consider this patch as first approach to the problem.

Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D29576


# 90cca08e 08-Apr-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Prepare PRR to work with NewReno LossRecovery

Add proper PRR vnet declarations for consistency.
Also add pointer to tcpopt struct to tcp_do_prr_ack, in preparation
for it to deal with non-SACK window reduction (after loss).

No functional change.

MFC after: 2 weeks
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29440


# b9f803b7 25-Mar-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Use PRR for ECN congestion recovery

MFC after: 2 weeks
Reviewed By: #transport, rrs
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28972


# eb3a59a8 25-Mar-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Refactor PRR code

No functional change intended.

MFC after: 2 weeks
Reviewed By: #transport, rrs
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29411


# 0533fab8 25-Mar-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Perform simple fast retransmit when SACK Blocks are missing on SACK session

MFC after: 2 weeks
Reviewed By: #transport, rrs
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28634


# 40f41ece 22-Mar-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: improve handling of SYN segments in SYN-SENT state

Ensure that the stack does not generate a DSACK block for user
data received on a SYN segment in SYN-SENT state.

Reviewed by: rscheff
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D29376
Sponsored by: Netflix, Inc.


# e5313869 05-Mar-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Add prr_out in preparation for PRR/nonSACK and LRD

Reviewed By: #transport, kbowling
MFC after: 3 days
Sponsored By: Netapp, Inc.
Differential Revision: https://reviews.freebsd.org/D29058


# 4a8f3aad 05-Mar-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: remove incorrect reset of SACK variable in PRR

Reviewed By: #transport, rrs, tuexen
PR: 253848
MFC after: 3 days
Sponsored By: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29083


# bb4a7d94 04-Mar-2021 Kristof Provost <kp@FreeBSD.org>

net: Introduce IPV6_DSCP(), IPV6_ECN() and IPV6_TRAFFIC_CLASS() macros

Introduce convenience macros to retrieve the DSCP, ECN or traffic class
bits from an IPv6 header.

Use them where appropriate.

Reviewed by: ae (previous version), rscheff, tuexen, rgrimes
MFC after: 2 weeks
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D29056


# 0b0f8b35 01-Mar-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

calculate prr_out correctly when pipe < ssthresh

Reviewed By: #transport, tuexen
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28998


# e9071000 28-Feb-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

Improve PRR initial transmission timing

Reviewed By: tuexen, #transport
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28953


# 9e83a6a5 26-Feb-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

Include new data sent in PRR calculation

Reviewed By: #transport, kbowling
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28941


# 2593f858 25-Feb-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

A TCP server has to take into consideration, if TCP_NOOPT is preventing
the negotiation of TCP features. This affects most TCP options but
adherance to RFC7323 with the timestamp option will prevent a session
from getting established.

PR: 253576
Reviewed By: tuexen, #transport
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28652


# 31d7a27c 25-Feb-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

PRR: Avoid accounting left-edge twice in partial ACK.

Reviewed By: #transport, kbowling
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28819


# 48396dc7 25-Feb-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

Address two incorrect calculations and enhance readability of PRR code

- address second instance of cwnd potentially becoming zero
- fix sublte bug due to implicit int to uint typecase in max()
- fix bug due to typo in hand-coded CEILING() function by using howmany() macro
- use int instead of long, and add a missing long typecast
- replace if conditionals with easier to read imax/imin (as in pseudocode)

Reviewed By: #transport, kbowling
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28813


# a8e431e1 20-Feb-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

PRR: use accurate rfc6675_pipe when enabled

Reviewed By: #transport, tuexen
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28816


# 853fd7a2 19-Feb-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

Ensure cwnd doesn't shrink to zero with PRR

Under some circumstances, PRR may end up with a fully
collapsed cwnd when finalizing the loss recovery.

Reviewed By: #transport, kbowling
Reported by: Liang Tian
MFC after: 1 week
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28780


# 3c40e1d5 15-Feb-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

update the SACK loss recovery to RFC6675, with the following new features:
- improved pipe calculation which does not degrade under heavy loss
- engaging in Loss Recovery earlier under adverse conditions
- Rescue Retransmission in case some of the trailing packets of a request got lost

All above changes are toggled with the sysctl "rfc6675_pipe" (disabled by default).

Reviewers: #transport, tuexen, lstewart, slavash, jtl, hselasky, kib, rgrimes, chengc_netapp.com, thj, #manpages, kbowling, #netapp, rscheff
Reviewed By: #transport
Subscribers: imp, melifaro
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D18985


# 8268d82c 15-Feb-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Remove per-packet ifa refcounting from IPv6 fast path.

Currently ip6_input() calls in6ifa_ifwithaddr() for
every local packet, in order to check if the target ip
belongs to the local ifa in proper state and increase
its counters.

in6ifa_ifwithaddr() references found ifa.
With epoch changes, both `ip6_input()` and all other current callers
of `in6ifa_ifwithaddr()` do not need this reference
anymore, as epoch provides stability guarantee.

Given that, update `in6ifa_ifwithaddr()` to allow
it to return ifa without referencing it, while preserving
option for getting referenced ifa if so desired.

MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28648


# 6a376af0 26-Jan-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

TCP PRR: Patch div/0 in tcp_prr_partialack

With clearing of recover_fs in bc7ee8e5bc555, div/0
was observed while processing partial_acks.

Suspect that rewind of an erraneous RTO may be
causing this - with the above change, recover_fs
would no longer retained at the last calculated
value, and reset. But CC_RTO_ERR can reenable
IN_RECOVERY(), without setting this again.

Adding a safety net prior to the division in that
function, which I missed in D28114.


# 84761f3d 26-Jan-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

Adjust line length in tcp_prr_partialack

Summary:
Wrap lines before column 80 in new prr code checked in recently.

No functional changes.

Reviewers: tuexen, rrs, jtl, mm, kbowling, #transport

Reviewed By: tuexen, mm, #transport

Subscribers: imp, melifaro

Differential Revision: https://reviews.freebsd.org/D28329


# bc7ee8e5 19-Jan-2021 Richard Scheffenegger <srichard@netapp.com>

Address panic with PRR due to missed initialization of recover_fs

Summary:
When using the base stack in conjunction with RACK, it appears that
infrequently, ++tp->t_dupacks is instantly larger than tcprexmtthresh.

This leaves the recover flightsize (sackhint.recover_fs) uninitialized,
leading to a div/0 panic.

Address this by properly initializing the variable just prior to first
use, if it is not properly initialized.

In order to prevent stale information from a prior recovery to
negatively impact the PRR calculations in this event, also clear
recover_fs once loss recovery is finished.

Finally, improve the readability of the initialization of recover_fs
when t_dupacks == tcprexmtthresh by adjusting the indentation and
using the max(1, snd_nxt - snd_una) macro.

Reviewers: rrs, kbowling, tuexen, jtl, #transport, gnn!, jmg, manu, #manpages

Reviewed By: rrs, kbowling, #transport

Subscribers: bdrewery, andrew, rpokala, ae, emaste, bz, bcran, #linuxkpi, imp, melifaro

Differential Revision: https://reviews.freebsd.org/D28114


# d2b3cedd 13-Jan-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: add sysctl to tolerate TCP segments missing timestamps

When timestamp support has been negotiated, TCP segements received
without a timestamp should be discarded. However, there are broken
TCP implementations (for example, stacks used by Omniswitch 63xx and
64xx models), which send TCP segments without timestamps although
they negotiated timestamp support.
This patch adds a sysctl variable which tolerates such TCP segments
and allows to interoperate with broken stacks.

Reviewed by: jtl@, rscheff@
Differential Revision: https://reviews.freebsd.org/D28142
Sponsored by: Netflix, Inc.
PR: 252449
MFC after: 1 week


# cc3c3485 13-Jan-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: fix handling of TCP RST segments missing timestamps

A TCP RST segment should be processed even it is missing TCP
timestamps.

Reported by: dmgk@, kevans@
Reviewed by: rscheff@, dmgk@
Sponsored by: Netflix, Inc.
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D28143


# 0e1d7c25 04-Dec-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Add TCP feature Proportional Rate Reduction (PRR) - RFC6937

PRR improves loss recovery and avoids RTOs in a wide range
of scenarios (ACK thinning) over regular SACK loss recovery.

PRR is disabled by default, enable by net.inet.tcp.do_prr = 1.
Performance may be impeded by token bucket rate policers at
the bottleneck, where net.inet.tcp.do_prr_conservate = 1
should be enabled in addition.

Submitted by: Aris Angelogiannopoulos
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D18892


# 75fcd27a 23-Nov-2020 Michael Tuexen <tuexen@FreeBSD.org>

Fix two occurences of a typo in a comment introduced in r367530.

Reported by: lstewart@
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27148


# 283c76c7 09-Nov-2020 Michael Tuexen <tuexen@FreeBSD.org>

RFC 7323 specifies that:
* TCP segments without timestamps should be dropped when support for
the timestamp option has been negotiated.
* TCP segments with timestamps should be processed normally if support
for the timestamp option has not been negotiated.
This patch enforces the above.

PR: 250499
Reviewed by: gnn, rrs
MFC after: 1 week
Sponsored by: Netflix, Inc
Differential Revision: https://reviews.freebsd.org/D27148


# 4d0770f1 08-Nov-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Prevent premature SACK block transmission during loss recovery

Under specific conditions, a window update can be sent with
outdated SACK information. Some clients react to this by
subsequently delaying loss recovery, making TCP perform very
poorly.

Reported by: chengc_netapp.com
Reviewed by: rrs, jtl
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D24237


# 39a12f01 24-Oct-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: move cwnd and ssthresh updates into cc modules

This will pave the way of setting ssthresh differently in TCP CUBIC, according
to RFC8312 section 4.7.

No functional change, only code movement.

Submitted by: chengc_netapp.com
Reviewed by: rrs, tuexen, rscheff
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D26807


# 662c1305 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

net: clean up empty lines in .c and .h files


# 1951fa79 25-Aug-2020 Michael Tuexen <tuexen@FreeBSD.org>

RFC 3465 defines a limit L used in TCP slow start for limiting the number
of acked bytes as described in Section 2.2 of that document.
This patch ensures that this limit is not also applied in congestion
avoidance. Applying this limit also in congestion avoidance can result in
using less bandwidth than allowed.

Reported by: l.tian.email@gmail.com
Reviewed by: rrs, rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D26120


# f359d6eb 13-Aug-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Improve SACK support code for RFC6675 and PRR

Adding proper accounting of sacked_bytes and (per-ACK)
delivered data to the SACK scoreboard. This will
allow more aspects of RFC6675 to be implemented as well
as Proportional Rate Reduction (RFC6937).

Prior to this change, the pipe calculation controlled with
net.inet.tcp.rfc6675_pipe was also susceptible to incorrect
results when more than 3 (or 4) holes in the sequence space
were present, which can no longer all fit into a single
ACK's SACK option.

Reviewed by: kbowling, rgrimes (mentor)
Approved by: rgrimes (mentor, blanket)
MFC after: 3 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D18624


# e854dd38 08-Jun-2020 Randall Stewart <rrs@FreeBSD.org>

An important statistic in determining if a server process (or client) is being delayed
is to know the time to first byte in and time to first byte out. Currently we
have no way to know these all we have is t_starttime. That (t_starttime) tells us
what time the 3 way handshake completed. We don't know when the first
request came in or how quickly we responded. Nor from a client perspective
do we know how long from when we sent out the first byte before the
server responded.

This small change adds the ability to track the TTFB's. This will show up in
BB logging which then can be pulled for later analysis. Note that currently
the tracking is via the ticks variable of all three variables. This provides
a very rough estimate (hz=1000 its 1ms). A follow-on set of work will be
to change all three of these values into something with a much finer resolution
(either microseconds or nanoseconds), though we may want to make the resolution
configurable so that on lower powered machines we could still use the much
cheaper ticks variable.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D24902


# f1ea4e41 03-Jun-2020 Randall Stewart <rrs@FreeBSD.org>

This fixes a couple of skyzaller crashes. Most
of them have to do with TFO. Even the default stack
had one of the issues:

1) We need to make sure for rack that we don't advance
snd_nxt beyond iss when we are not doing fast open. We
otherwise can get a bunch of SYN's sent out incorrectly
with the seq number advancing.
2) When we complete the 3-way handshake we should not ever
append to reassembly if the tlen is 0, if TFO is enabled
prior to this fix we could still call the reasemmbly. Note
this effects all three stacks.
3) Rack like its cousin BBR should track if a SYN is on a
send map entry.
4) Both bbr and rack need to only consider len incremented on a SYN
if the starting seq is iss, otherwise we don't increment len which
may mean we return without adding a sendmap entry.

This work was done in collaberation with Michael Tuexen, thanks for
all the testing!
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D25000


# af2fb894 21-May-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

With RFC3168 ECN, CWR SHOULD only be sent with new data

Overly conservative data receivers may ignore the CWR flag
on other packets, and keep ECE latched. This can result in
continous reduction of the congestion window, and very poor
performance when ECN is enabled.

Reviewed by: rgrimes (mentor), rrs
Approved by: rgrimes (mentor), tuexen (mentor)
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D23364


# 8e051165 21-May-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Retain only mutually supported TCP options after simultaneous SYN

When receiving a parallel SYN in SYN-SENT state, remove all the
options only we supported locally before sending the SYN,ACK.

This addresses a consistency issue on parallel opens.

Also, on such a parallel open, the stack could be coaxed into
running with timestamps enabled, even if administratively disabled.

Reviewed by: tuexen (mentor)
Approved by: tuexen (mentor)
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D23371


# 6e16d877 21-May-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Handle ECN handshake in simultaneous open

While testing simultaneous open TCP with ECN, found that
negotiation fails to arrive at the expected final state.

Reviewed by: tuexen (mentor)
Approved by: tuexen (mentor), rgrimes (mentor)
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D23373


# b2ade6b1 29-Apr-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Correctly set up the initial TCP congestion window in all cases,
by not including the SYN bit sequence space in cwnd related calculations.
Snd_und is adjusted explicitly in all cases, outside the cwnd update, instead.

This fixes an off-by-one conformance issue with regular TCP sessions not
using Appropriate Byte Counting (RFC3465), sending one more packet during
the initial window than expected.

PR: 235256
Reviewed by: tuexen (mentor), rgrimes (mentor)
Approved by: tuexen (mentor), rgrimes (mentor)
MFC after: 3 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D19000


# bb410f9f 21-Apr-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

revert rS360143 - Correctly set up initial cwnd
due to syzkaller panics found

Reported by: tuexen
Approved by: tuexen (mentor)
Sponsored by: NetApp, Inc.


# 73b76966 21-Apr-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Correctly set up the initial TCP congestion window
in all cases, by adjust snd_una right after the
connection initialization, to include the one byte
in sequence space occupied by the SYN bit.

This does not change the regular ACK processing,
while making the BYTES_THIS_ACK macro to work properly.

PR: 235256
Reviewed by: tuexen (mentor), rgrimes (mentor)
Approved by: tuexen (mentor), rgrimes (mentor)
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D19000


# 7ca6e296 12-Mar-2020 Michael Tuexen <tuexen@FreeBSD.org>

Use KMOD_TCPSTAT_INC instead of TCPSTAT_INC for RACK and BBR, since
these are kernel modules. Also add a KMOD_TCPSTAT_ADD and use that
instead of TCPSTAT_ADD.

Reviewed by: jtl@, rrs@
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D23904


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# a3574665 13-Feb-2020 Michael Tuexen <tuexen@FreeBSD.org>

sack_newdata and snd_recover hold the same value. Therefore, use only
a single instance: use snd_recover also where sack_newdata was used.

Submitted by: Richard Scheffenegger
Differential Revision: https://reviews.freebsd.org/D18811


# 481be5de 12-Feb-2020 Randall Stewart <rrs@FreeBSD.org>

White space cleanup -- remove trailing tab's or spaces
from any line.

Sponsored by: Netflix Inc.


# 9cc711c9 25-Jan-2020 Michael Tuexen <tuexen@FreeBSD.org>

Sending CWR after an RTO is according to RFC 3168 generally required
and not only for the DCTCP congestion control.

Submitted by: Richard Scheffenegger
Reviewed by: rgrimes, tuexen@, Cheng Cui
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23119


# a2d59694 25-Jan-2020 Michael Tuexen <tuexen@FreeBSD.org>

As a TCP client only enable ECN when the corresponding sysctl variable
indicates that ECN should be negotiated for the client side.

Submitted by: Richard Scheffenegger
Reviewed by: rgrimes@, tuexen@
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23228


# ee97681e 24-Jan-2020 Michael Tuexen <tuexen@FreeBSD.org>

Don't delay the ACK for a TCP segment with the CWR flag set.
This allows the data sender to increase the CWND faster.

Submitted by: Richard Scheffenegger
Reviewed by: rgrimes@, tuexen@, Cheng Cui
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D22670


# 8f63a52b 24-Jan-2020 Michael Tuexen <tuexen@FreeBSD.org>

The server side of TCP fast open relies on the delayed ACK timer to allow
including user data in the SYN-ACK. When DSACK support was added in
r347382, an immediate ACK was sent even for the received SYN with
user data. This patch fixes that and allows again to send user data with
the SYN-ACK.

Reported by: Jeremy Harris
Reviewed by: Richard Scheffenegger, rrs@
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D23212


# 334fc582 08-Jan-2020 Bjoern A. Zeeb <bz@FreeBSD.org>

vnet: virtualise more network stack sysctls.

Virtualise tcp_always_keepalive, TCP and UDP log_in_vain. All three are
set in the netoptions startup script, which we would love to run for VNETs
as well [1].

While virtualising the log_in_vain sysctls seems pointles at first for as
long as the kernel message buffer is not virtualised, it at least allows
an administrator to debug the base system or an individual jail if needed
without turning the logging on for all jails running on a system.

PR: 243193 [1]
MFC after: 2 weeks


# 4ad24737 06-Jan-2020 Randall Stewart <rrs@FreeBSD.org>

This catches rack up in the recent changes to ECN and
also commonizes the functions that both the freebsd and
rack stack uses.

Sponsored by:Netflix Inc
Differential Revision: https://reviews.freebsd.org/D23052


# e11c9783 31-Dec-2019 Michael Tuexen <tuexen@FreeBSD.org>

Fix delayed ACK generation for DCTCP.

Submitted by: Richard Scheffenegger
Reviewed by: chengc@netapp.com, rgrimes@, tuexen@
Differential Revision: https://reviews.freebsd.org/D22644


# 83a2839f 31-Dec-2019 Michael Tuexen <tuexen@FreeBSD.org>

Clear the flag indicating that the last received packet was marked CE also
in the case where a packet not marked was received.

Submitted by: Richard Scheffenegger
Reviewed by: rgrimes@, tuexen@
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D19143


# adc56f5a 02-Dec-2019 Edward Tomasz Napierala <trasz@FreeBSD.org>

Make use of the stats(3) framework in the TCP stack.

This makes it possible to retrieve per-connection statistical
information such as the receive window size, RTT, or goodput,
using a newly added TCP_STATS getsockopt(3) option, and extract
them using the stats_voistat_fetch(3) API.

See the net/tcprtt port for an example consumer of this API.

Compared to the existing TCP_INFO system, the main differences
are that this mechanism is easy to extend without breaking ABI,
and provides statistical information instead of raw "snapshots"
of values at a given point in time. stats(3) is more generic
and can be used in both userland and the kernel.

Reviewed by: thj
Tested by: thj
Obtained from: Netflix
Relnotes: yes
Sponsored by: Klara Inc, Netflix
Differential Revision: https://reviews.freebsd.org/D20655


# 3cf38784 01-Dec-2019 Michael Tuexen <tuexen@FreeBSD.org>

Move all ECN related flags from the flags to the flags2 field.
This allows adding more ECN related flags in the future.
No functional change intended.

Submitted by: Richard Scheffenegger
Reviewed by: rrs@, tuexen@
Differential Revision: https://reviews.freebsd.org/D22497


# b72e56e7 01-Dec-2019 Michael Tuexen <tuexen@FreeBSD.org>

This is an initial step in implementing the new congestion window
validation as specified in RFC 7661.

Submitted by: Richard Scheffenegger
Reviewed by: rrs@, tuexen@
Differential Revision: https://reviews.freebsd.org/D21798


# fa49a964 01-Dec-2019 Michael Tuexen <tuexen@FreeBSD.org>

In order for the TCP Handshake to support ECN++, and further ECN-related
improvements, the ECN bits need to be exposed to the TCP SYNcache.
This change is a minimal modification to the function headers, without any
functional change intended.

Submitted by: Richard Scheffenegger
Reviewed by: rgrimes@, rrs@, tuexen@
Differential Revision: https://reviews.freebsd.org/D22436


# a4adf6cc 30-Nov-2019 Bjoern A. Zeeb <bz@FreeBSD.org>

Fix m_pullup() problem after removing PULLDOWN_TESTs and KAME EXT_*macros.

r354748-354750 replaced the KAME macros with m_pulldown() calls.
Contrary to the rest of the network stack m_len checks before m_pulldown()
were not put in placed (see r354748).
Put these m_len checks in place for now (to go along with the style of the
network stack since the initial commits). These are not put in for
performance but to avoid an error scenario (even though it also will help
performance at the moment as it avoid allocating an extra mbuf; not because
of the unconditional function call).

The observed error case went like this:
(1) an mbuf with M_EXT arrives and we call m_pullup() unconditionally on it.
(2) m_pullup() will call m_get() unless the requested length is larger than
MHLEN (in which case it'll m_freem() the perfectly fine mbuf) and migrate the
requested length of data and pkthdr into the new mbuf.
(3) If m_get() succeeds, a further m_pullup() call going over MHLEN will fail.
This was observed with failing auto-configuration as an RA packet of
200 bytes exceeded MHLEN and the m_pullup() called from nd6_ra_input()
dropped the mbuf.
(Re-)adding the m_len checks before m_pullup() calls avoids this problems
with mbufs using external storage for now.

MFC after: 3 weeks
Sponsored by: Netflix


# 4e619b17 15-Nov-2019 Bjoern A. Zeeb <bz@FreeBSD.org>

IP6_EXTHDR_CHECK(): remove the last instances

While r354748 removed almost all IP6_EXTHDR_CHECK() calls, these
are not part of the PULLDOWN_TESTS.
Equally convert these IP6_EXTHDR_CHECK()s here to m_pullup() and remove
the extra check and m_pullup() in tcp_input() under isipv6 given
tcp6_input() has done exactly that pullup already.

MFC after: 8 weeks
Sponsored by: Netflix


# a8fe77d8 12-Nov-2019 Bjoern A. Zeeb <bz@FreeBSD.org>

netinet*: update *mp to pass the proper value back

In ip6_[direct_]input() we are looping over the extension headers
to deal with the next header. We pass a pointer to an mbuf pointer
to the handling functions. In certain cases the mbuf can be updated
there and we need to pass the new one back. That missing in
dest6_input() and route6_input(). In tcp6_input() we should also
update it before we call tcp_input().

In addition to that mark the mbuf NULL all the times when we return
that we are done with handling the packet and no next header should
be checked (IPPROTO_DONE). This will eventually allow us to assert
proper behaviour and catch the above kind of errors more easily,
expecting *mp to always be set.

This change is extracted from a larger patch and not an exhaustive
change across the entire stack yet.

PR: 240135
Reported by: prabhakar.lakhera gmail.com
MFC after: 3 weeks
Sponsored by: Netflix


# d40c0d47 07-Nov-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Now that all of the tcp_input() and all its branches are executed
in the network epoch, we can greatly simplify synchronization.
Remove all unneccesary epoch enters hidden under INP_INFO_RLOCK macro.
Remove some unneccesary assertions and convert necessary ones into the
NET_EPOCH_ASSERT macro.


# 503f4e47 07-Nov-2019 Bjoern A. Zeeb <bz@FreeBSD.org>

netinet*: variable cleanup

In preparation for another change factor out various variable cleanups.
These mainly include:
(1) do not assign values to variables during declaration: this makes
the code more readable and does allow for better grouping of
variable declarations,
(2) do not assign values to variables before need; e.g., if a variable
is only used in the 2nd half of a function and we have multiple
return paths before that, then do not set it before it is needed, and
(3) try to avoid assigning the same value multiple times.

MFC after: 3 weeks
Sponsored by: Netflix


# 6d261981 09-Sep-2019 Michael Tuexen <tuexen@FreeBSD.org>

Only update SACK/DSACK lists when a non-empty segment was received.
This fixes hitting a KASSERT with a valid packet exchange.

Reviewed by: rrs@, Richard Scheffenegger
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21567


# ecc5b1d1 03-Sep-2019 Michael Tuexen <tuexen@FreeBSD.org>

Fix the SACK block generation in the base TCP stack by bringing it in
sync with the RACK stack.

Reviewed by: rrs@
MFC after: 5 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21513


# fe5dee73 02-Sep-2019 Michael Tuexen <tuexen@FreeBSD.org>

This patch improves the DSACK handling to conform with RFC 2883.
The lowest SACK block is used when multiple Blocks would be elegible as
DSACK blocks ACK blocks get reordered - while maintaining the ordering of
SACK blocks not relevant in the DSACK context is maintained.

Reviewed by: rrs@, tuexen@
Obtained from: Richard Scheffenegger
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D21038


# c4556b2f 12-Aug-2019 Andrey V. Elsukov <ae@FreeBSD.org>

Save ip_ttl value and restore it after checksum calculation.

Since ipvoly is used for checksum calculation, part of original IP
header is zeroed. This part includes ip_ttl field, that can be used
later in IP_MINTTL socket option handling.

PR: 239799
MFC after: 1 week


# b5a154d8 09-May-2019 Michael Tuexen <tuexen@FreeBSD.org>

Don't use C++ style comments.

These where introduced in r347382.
Reported by: ngie@


# 5acfd95c 09-May-2019 Michael Tuexen <tuexen@FreeBSD.org>

Receiver side DSACK implemenation.

This adds initial support for RFC 2883.

Submitted by: Richard Scheffenegger
Reviewed by: rrs@
Differential Revision: https://reviews.freebsd.org/D19334


# 560c0586 21-Feb-2019 Michael Tuexen <tuexen@FreeBSD.org>

The receive buffer autoscaling for TCP is based on a linear growth, which
is acceptable in the congestion avoidance phase, but not during slow start.
The MTU is is also not taken into account.
Use a method instead, which is based on exponential growth working also in
slow start and being independent from the MTU.

This is joint work with rrs@.

Reviewed by: rrs@, Richard Scheffenegger
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D18375


# 116ef4d6 31-Jan-2019 Michael Tuexen <tuexen@FreeBSD.org>

When handling SYN-ACK segments in the SYN-RCVD state, set tp->snd_wnd
consistently.

This inconsistency was observed when working on the bug reported in
PR 235256, although it does not fix the reported issue. The fix for
the PR will be a separate commit.

PR: 235256
Reviewed by: rrs@, Richard Scheffenegger
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D19033


# bf7fcdb1 27-Jan-2019 Michael Tuexen <tuexen@FreeBSD.org>

Fix the detection of ECN-setup SYN-ACK packets.

RFC 3168 defines an ECN-setup SYN-ACK packet as on with the ECE flags
set and the CWR flags not set. The code was only checking if ECE flag
is set. This patch adds the check to verify that the CWR flags is not
set.

Submitted by: Richard Scheffenegger
Reviewed by: tuexen@
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18996


# 7dc90a1d 25-Jan-2019 Michael Tuexen <tuexen@FreeBSD.org>

Fix a bug in the restart window computation of TCP New Reno

When implementing support for IW10, an update in the computation
of the restart window used after an idle phase was missed. To
minimize code duplication, implement the logic in tcp_compute_initwnd()
and call it. This fixes a bug in NewReno, which was not aware of
IW10.

Submitted by: Richard Scheffenegger
Reviewed by: tuexen@
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18940


# 93899d10 18-Oct-2018 Michael Tuexen <tuexen@FreeBSD.org>

The handling of RST segments in the SYN-RCVD state exists in the
code paths. Both are not consistent and the one on the syn cache code
does not conform to the relevant specifications (Page 69 of RFC 793
and Section 4.2 of RFC 5961).

This patch fixes this:
* The sequence numbers checks are fixed as specified on
page Page 69 RFC 793.
* The sysctl variable net.inet.tcp.insecure_rst is now honoured
and the behaviour as specified in Section 4.2 of RFC 5961.

Approved by: re (gjb@)
Reviewed by: bz@, glebius@, rrs@,
Differential Revision: https://reviews.freebsd.org/D17595
Sponsored by: Netflix, Inc.


# 384a5c3c 01-Oct-2018 Andrey V. Elsukov <ae@FreeBSD.org>

Add INP_INFO_WUNLOCK_ASSERT() macro and use it instead of
INP_INFO_UNLOCK_ASSERT() in TCP-related code. For encapsulated traffic
it is possible, that the code is running in net_epoch_preempt section,
and INP_INFO_UNLOCK_ASSERT() is very strict assertion for such case.

PR: 231428
Reviewed by: mmacy, tuexen
Approved by: re (kib)
Differential Revision: https://reviews.freebsd.org/D17335


# 5dff1c38 21-Aug-2018 Michael Tuexen <tuexen@FreeBSD.org>

Enabling the IPPROTO_IPV6 level socket option IPV6_USE_MIN_MTU on a TCP
socket resulted in sending fragmented IPV6 packets.

This is fixes by reducing the MSS to the appropriate value. In addtion,
if the socket option is set before the handshake happens, announce this
MSS to the peer. This is not stricly required, but done since TCP
is conservative.

PR: 173444
Reviewed by: bz@, rrs@
MFC after: 1 month
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16796


# c28440db 19-Aug-2018 Randall Stewart <rrs@FreeBSD.org>

This change represents a substantial restructure of the way we
reassembly inbound tcp segments. The old algorithm just blindly
dropped in segments without coalescing. This meant that every
segment could take up greater and greater room on the linked list
of segments. This of course is now subject to a tighter limit (100)
of segments which in a high BDP situation will cause us to be a
lot more in-efficent as we drop segments beyond 100 entries that
we receive. What this restructure does is cause the reassembly
buffer to coalesce segments putting an emphasis on the two
common cases (which avoid walking the list of segments) i.e.
where we add to the back of the queue of segments and where we
add to the front. We also have the reassembly buffer supporting
a couple of debug options (black box logging as well as counters
for code coverage). These are compiled out by default but can
be added by uncommenting the defines.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D16626


# 8db239dc 30-Jul-2018 Michael Tuexen <tuexen@FreeBSD.org>

Fix some TCP fast open issues.

The following issues are fixed:
* Whenever a TCP server with TCP fast open enabled, calls accept(),
recv(), send(), and close() before the TCP-ACK segment has been received,
the TCP connection is just dropped and the reception of the TCP-ACK
segment triggers the sending of a TCP-RST segment.
* Whenever a TCP server with TCP fast open enabled, calls accept(), recv(),
send(), send(), and close() before the TCP-ACK segment has been received,
the first byte provided in the second send call is not transferred.
* Whenever a TCP client with TCP fast open enabled calls sendto() followed
by close() the TCP connection is just dropped.

Reviewed by: jtl@, kbowling@, rrs@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16485


# 6138da62 30-Jul-2018 Michael Tuexen <tuexen@FreeBSD.org>

Add missing send/recv dtrace probes for TCP.

These missing probe are mostly in the syncache and timewait code.

Reviewed by: markj@, rrs@
MFC after: 1 month
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16369


# a026a53a 10-Jul-2018 Michael Tuexen <tuexen@FreeBSD.org>

Use appropriate MSS value when populating the TCP FO client cookie cache

When a client receives a SYN-ACK segment with a TFP fast open cookie,
but without an MSS option, an MSS value from uninitialised stack memory is used.
This patch ensures that in case no MSS option is included in the SYN-ACK,
the appropriate value as given in RFC 7413 is used.

Reviewed by: kbowling@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16175


# 6573d758 03-Jul-2018 Matt Macy <mmacy@FreeBSD.org>

epoch(9): allow preemptible epochs to compose

- Add tracker argument to preemptible epochs
- Inline epoch read path in kernel and tied modules
- Change in_epoch to take an epoch as argument
- Simplify tfb_tcp_do_segment to not take a ti_locked argument,
there's no longer any benefit to dropping the pcbinfo lock
and trying to do so just adds an error prone branchfest to
these functions
- Remove cases of same function recursion on the epoch as
recursing is no longer free.
- Remove the the TAILQ_ENTRY and epoch_section from struct
thread as the tracker field is now stack or heap allocated
as appropriate.

Tested by: pho and Limelight Networks
Reviewed by: kbowling at llnw dot com
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16066


# 9e58ff6f 18-Jun-2018 Matt Macy <mmacy@FreeBSD.org>

convert inpcbinfo hash and info rwlocks to epoch + mutex

- Convert inpcbinfo info & hash locks to epoch for read and mutex for write
- Garbage collect code that handled INP_INFO_TRY_RLOCK failures as
INP_INFO_RLOCK which can no longer fail

When running 64 netperfs sending minimal sized packets on a 2x8x2 reduces
unhalted core cycles samples in rwlock rlock/runlock in udp_send from 51% to
3%.

Overall packet throughput rate limited by CPU affinity and NIC driver design
choices.

On the receiver unhalted core cycles samples in in_pcblookup_hash went from
13% to to 1.6%

Tested by LLNW and pho@

Reviewed by: jtl
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15686


# 10d20c84 07-May-2018 Matt Macy <mmacy@FreeBSD.org>

Fix spurious retransmit recovery on low latency networks

TCP's smoothed RTT (SRTT) can be much larger than an actual observed RTT. This can be either because of hz restricting the calculable RTT to 10ms in VMs or 1ms using the default 1000hz or simply because SRTT recently incorporated a larger value.

If an ACK arrives before the calculated badrxtwin (now + SRTT):
tp->t_badrxtwin = ticks + (tp->t_srtt >> (TCP_RTT_SHIFT + 1));

We'll erroneously reset snd_una to snd_max. If multiple segments were dropped and this happens repeatedly the transmit rate will be limited to 1MSS per RTO until we've retransmitted all drops.

Reported by: rstone
Reviewed by: hiren, transport
Approved by: sbruno
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D8556


# 2529f56e 22-Mar-2018 Jonathan T. Looney <jtl@FreeBSD.org>

Add the "TCP Blackbox Recorder" which we discussed at the developer
summits at BSDCan and BSDCam in 2017.

The TCP Blackbox Recorder allows you to capture events on a TCP connection
in a ring buffer. It stores metadata with the event. It optionally stores
the TCP header associated with an event (if the event is associated with a
packet) and also optionally stores information on the sockets.

It supports setting a log ID on a TCP connection and using this to correlate
multiple connections that share a common log ID.

You can log connections in different modes. If you are doing a coordinated
test with a particular connection, you may tell the system to put it in
mode 4 (continuous dump). Or, if you just want to monitor for errors, you
can put it in mode 1 (ring buffer) and dump all the ring buffers associated
with the connection ID when we receive an error signal for that connection
ID. You can set a default mode that will be applied to a particular ratio
of incoming connections. You can also manually set a mode using a socket
option.

This commit includes only basic probes. rrs@ has added quite an abundance
of probes in his TCP development work. He plans to commit those soon.

There are user-space programs which we plan to commit as ports. These read
the data from the log device and output pcapng files, and then let you
analyze the data (and metadata) in the pcapng files.

Reviewed by: gnn (previous version)
Obtained from: Netflix, Inc.
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D11085


# 18a75309 25-Feb-2018 Patrick Kelsey <pkelsey@FreeBSD.org>

Greatly reduce the number of #ifdefs supporting the TCP_RFC7413 kernel option.

The conditional compilation support is now centralized in
tcp_fastopen.h and tcp_var.h. This doesn't provide the minimum
theoretical code/data footprint when TCP_RFC7413 is disabled, but
nearly all the TFO code should wind up being removed by the optimizer,
the additional footprint in the syncache entries is a single pointer,
and the additional overhead in the tcpcb is at the end of the
structure.

This enables the TCP_RFC7413 kernel option by default in amd64 and
arm64 GENERIC.

Reviewed by: hiren
MFC after: 1 month
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14048


# c560df6f 25-Feb-2018 Patrick Kelsey <pkelsey@FreeBSD.org>

This is an implementation of the client side of TCP Fast Open (TFO)
[RFC7413]. It also includes a pre-shared key mode of operation in
which the server requires the client to be in possession of a shared
secret in order to successfully open TFO connections with that server.

The names of some existing fastopen sysctls have changed (e.g.,
net.inet.tcp.fastopen.enabled -> net.inet.tcp.fastopen.server_enable).

Reviewed by: tuexen
MFC after: 1 month
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14047


# b2fe54bc 10-Feb-2018 Andrey V. Elsukov <ae@FreeBSD.org>

Reinitialize IP header length after checksum calculation. It is used
later by TCP-MD5 code.

This fixes the problem with broken TCP-MD5 over IPv4 when NIC has
disabled TCP checksum offloading.

PR: 223835
MFC after: 1 week


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# 3bdf4c42 11-Oct-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Declare more TCP globals in tcp_var.h, so that alternative TCP stacks
can use them. Gather all TCP tunables in tcp_var.h in one place and
alphabetically sort them, to ease maintainance of the list.

Don't copy and paste declarations in tcp_stacks/fastpath.c.


# 4da9052a 23-Aug-2017 Michael Tuexen <tuexen@FreeBSD.org>

Avoid TCP log messages which are false positives.

The check for timestamps are too early to handle SYN-ACK correctly.
So move it down after the corresponing processing has been done.

PR: 216832
Obtained from: antonfb@hesiod.org
MFC after: 1 week


# 43053c12 25-Jul-2017 Sean Bruno <sbruno@FreeBSD.org>

Revert r307901 - Inform CC modules about loss events.

This was discussed between various transport@ members and it was
requested to be reverted and discussed.

Submitted by: Kevin Bowling <kevin.bowling@kev009.com>
Reported by: lawrence
Reviewed by: hiren
Sponsored by: Limelight Networks


# 5d53981a 25-Jul-2017 Sean Bruno <sbruno@FreeBSD.org>

Revert r308180 - Set slow start threshold more accurrately on loss ...

This was discussed between various transport@ members and it was
requested to be reverted and discussed.

Submitted by: kevin
Reported by: lawerence
Reviewed by: hiren


# 98732609 01-Jun-2017 Michael Tuexen <tuexen@FreeBSD.org>

Improve comments to describe what the code does.

Reported by: jtl
Sponsored by: Netflix, Inc.


# ebfd7534 26-Apr-2017 Michael Tuexen <tuexen@FreeBSD.org>

When a SYN-ACK is received in SYN-SENT state, RFC 793 requires the
validation of SEG.ACK as the first step. If the ACK is not acceptable,
a RST segment should be sent and the segment should be dropped.
Up to now, the segment was partially processed.
This patch moves the check for the SEG.ACK validation up to the front
as required.
Reviewed by: hiren, gnn
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D10424


# 013f4df6 12-Apr-2017 Michael Tuexen <tuexen@FreeBSD.org>

The sysctl variable net.inet.tcp.drop_synfin is not honored in all states,
for example not in SYN-SENT.
This patch adds code to check the sysctl variable in other states than
LISTEN.
Thanks to ae and gnn for providing comments.
Reviewed by: gnn
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D9894


# e44c1887 10-Apr-2017 Steven Hartland <smh@FreeBSD.org>

Use estimated RTT for receive buffer auto resizing instead of timestamps

Switched from using timestamps to RTT estimates when performing TCP receive
buffer auto resizing, as not all hosts support / enable TCP timestamps.

Disabled reset of receive buffer auto scaling when not in bulk receive mode,
which gives an extra 20% performance increase.

Also extracted auto resizing to a common method shared between standard and
fastpath modules.

With this AWS S3 downloads at ~17ms latency on a 1Gbps connection jump from
~3MB/s to ~100MB/s using the default settings.

Reviewed by: lstewart, gnn
MFC after: 2 weeks
Relnotes: Yes
Sponsored by: Multiplay
Differential Revision: https://reviews.freebsd.org/D9668


# fbbd9655 28-Feb-2017 Warner Losh <imp@FreeBSD.org>

Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96


# 5ede40dc 11-Feb-2017 Ryan Stone <rstone@FreeBSD.org>

Don't zero out srtt after excess retransmits

If the TCP stack has retransmitted more than 1/4 of the total
number of retransmits before a connection drop, it decides that
its current RTT estimate is hopelessly out of date and decides
to recalculate it from scratch starting with the next ACK.

Unfortunately, it implements this by zeroing out the current RTT
estimate. Drop this hack entirely, as it makes it significantly more
difficult to debug connection issues. Instead check for excessive
retransmits at the point where srtt is updated from an ACK being
received. If we've exceeded 1/4 of the maximum retransmits,
discard the previous srtt estimate and replace it with the latest
rtt measurement.

Differential Revision: https://reviews.freebsd.org/D9519
Reviewed by: gnn
Sponsored by: Dell EMC Isilon


# fcf59617 06-Feb-2017 Andrey V. Elsukov <ae@FreeBSD.org>

Merge projects/ipsec into head/.

Small summary
-------------

o Almost all IPsec releated code was moved into sys/netipsec.
o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel
option IPSEC_SUPPORT added. It enables support for loading
and unloading of ipsec.ko and tcpmd5.ko kernel modules.
o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by
default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type
support was removed. Added TCP/UDP checksum handling for
inbound packets that were decapsulated by transport mode SAs.
setkey(8) modified to show run-time NAT-T configuration of SA.
o New network pseudo interface if_ipsec(4) added. For now it is
build as part of ipsec.ko module (or with IPSEC kernel).
It implements IPsec virtual tunnels to create route-based VPNs.
o The network stack now invokes IPsec functions using special
methods. The only one header file <netipsec/ipsec_support.h>
should be included to declare all the needed things to work
with IPsec.
o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed.
Now these protocols are handled directly via IPsec methods.
o TCP_SIGNATURE support was reworked to be more close to RFC.
o PF_KEY SADB was reworked:
- now all security associations stored in the single SPI namespace,
and all SAs MUST have unique SPI.
- several hash tables added to speed up lookups in SADB.
- SADB now uses rmlock to protect access, and concurrent threads
can do SA lookups in the same time.
- many PF_KEY message handlers were reworked to reflect changes
in SADB.
- SADB_UPDATE message was extended to support new PF_KEY headers:
SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They
can be used by IKE daemon to change SA addresses.
o ipsecrequest and secpolicy structures were cardinally changed to
avoid locking protection for ipsecrequest. Now we support
only limited number (4) of bundled SAs, but they are supported
for both INET and INET6.
o INPCB security policy cache was introduced. Each PCB now caches
used security policies to avoid SP lookup for each packet.
o For inbound security policies added the mode, when the kernel does
check for full history of applied IPsec transforms.
o References counting rules for security policies and security
associations were changed. The proper SA locking added into xform
code.
o xform code was also changed. Now it is possible to unregister xforms.
tdb_xxx structures were changed and renamed to reflect changes in
SADB/SPDB, and changed rules for locking and refcounting.

Reviewed by: gnn, wblock
Obtained from: Yandex LLC
Relnotes: yes
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D9352


# 2b9c9984 03-Jan-2017 George V. Neville-Neil <gnn@FreeBSD.org>

Fix DTrace TCP tracepoints to not use mtod() as it is both unnecessary and
dangerous. Those wanting data from an mbuf should use DTrace itself to get
the data.

PR: 203409
Reviewed by: hiren
MFC after: 1 week
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D9035


# 4ddd5aad 02-Dec-2016 Michael Tuexen <tuexen@FreeBSD.org>

Fix the handling of TCP FIN-segments in the CLOSED state

When a TCP segment with the FIN bit set was received in the CLOSED state,
a TCP RST-ACK-segment is sent. When computing SEG.ACK for this, the
FIN counts as one byte. This accounting was missing and is fixed by this
patch.

Reviewed by: hiren
MFC after: 1 month
Sponsored by: Netflix, Inc.
Differential Revision: https://svn.freebsd.org/base/head


# 35dfb8cb 19-Nov-2016 Michael Tuexen <tuexen@FreeBSD.org>

Ensure that TCP state changes to state-closing are reported via dtrace.
This does not cover state changes from TIME-WAIT.

Reviewed by: gnn
MFC after: 3 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8443


# 6779a1a1 17-Nov-2016 Michael Tuexen <tuexen@FreeBSD.org>

Notify the use via setting errno when a TCP RST segment is received
either in the CLOSING or LAST-ACK state.

Reviewed by: hiren
MFC after: 3 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8371


# e04310d5 01-Nov-2016 Hiren Panchasara <hiren@FreeBSD.org>

Set slow start threshold more accurately on loss to be flightsize/2 instead of
cwnd/2 as recommended by RFC5681. (spotted by mmacy at nextbsd dot org)

Restore pre-r307901 behavior of aligning ssthresh/cwnd on mss boundary. (spotted
by slawa at zxy dot spb dot ru)

Tested by: dim, Slawa <slawa at zxy dot spb dot ru>
MFC after: 1 month
X-MFC with: r307901
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D8349


# 4e7f7553 24-Oct-2016 Hiren Panchasara <hiren@FreeBSD.org>

FreeBSD tcp stack used to inform respective congestion control module about the
loss event but not use or obay the recommendations i.e. values set by it in some
cases.

Here is an attempt to solve that confusion by following relevant RFCs/drafts.
Stack only sets congestion window/slow start threshold values when there is no
CC module availalbe to take that action. All CC modules are inspected and
updated when needed to take appropriate action on loss.

tcp_stacks/fastpath module has been updated to adapt these changes.

Note: Probably, the most significant change would be to not bring congestion
window down to 1MSS on a loss signaled by 3-duplicate acks and letting
respective CC decide that value.

In collaboration with: Matt Macy <mmacy at nextbsd dot org>
Discussed on: transport@ mailing list
Reviewed by: jtl
MFC after: 1 month
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D8225


# dd13b7d3 24-Oct-2016 Hiren Panchasara <hiren@FreeBSD.org>

Undo r307899. It needs a bit more work and proper commit log.


# 95d82360 24-Oct-2016 Hiren Panchasara <hiren@FreeBSD.org>

In Collaboration with: Matt Macy <mmacy at nextbsd dot com>
Reviewed by: jtl
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D8225


# f5cf1e5f 18-Oct-2016 Julien Charbon <jch@FreeBSD.org>

Fix a double-free when an inp transitions to INP_TIMEWAIT state
after having been dropped.

This fixes enforces in_pcbdrop() logic in tcp_input():

"in_pcbdrop() is used by TCP to mark an inpcb as unused and avoid future packet
delivery or event notification when a socket remains open but TCP has closed."

PR: 203175
Reported by: Palle Girgensohn, Slawa Olhovchenkov
Tested by: Slawa Olhovchenkov
Reviewed by: Slawa Olhovchenkov
Approved by: gnn, Slawa Olhovchenkov
Differential Revision: https://reviews.freebsd.org/D8211
MFC after: 1 week
Sponsored by: Verisign, inc


# 784ce8fa 17-Oct-2016 Hiren Panchasara <hiren@FreeBSD.org>

Make sure tcp_mss() has the same check as tcp_mss_update() to have t_maxseg set
to at least 64.

This is still just a coverup to avoid kernel panic and not an actual fix.

PR: 213232
Reviewed by: glebius
MFC after: 1 week
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D8272


# 09c305eb 14-Oct-2016 Patrick Kelsey <pkelsey@FreeBSD.org>

Fix cases where the TFO pending counter would leak references, and eventually, memory.

Also renamed some tfo labels and added/reworked comments for clarity.

Based on an initial patch from jtl.

PR: 213424
Reviewed by: jtl
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D8235


# 6d172f58 14-Oct-2016 Jonathan T. Looney <jtl@FreeBSD.org>

The code currently resets the keepalive timer each time a packet is
received on a TCP session that has entered the ESTABLISHED state. This
results in a lot of calls to reset the keepalive timer.

This patch changes the behavior so we set the keepalive timer for the
keepalive idle time (TP_KEEPIDLE). When the keepalive timer fires, it will
first check to see if the session has been idle for TP_KEEPIDLE ticks. If
not, it will reschedule the keepalive timer for the time the session will
have been idle for TP_KEEPIDLE ticks.

For a session with regular communication, the keepalive timer should fire
approximately once every TP_KEEPIDLE ticks. For sessions with irregular
communication, the keepalive timer might fire more often. But, the
disruption from a periodic keepalive timer should be less than the regular
cost of resetting the keepalive timer on every packet.

(FWIW, this change saved approximately 1.73% of the busy CPU cycles on a
particular test system with a heavy TCP output load. Of course, the
actual impact is very specific to the particular hardware and workload.)

Reviewed by: gallatin, rrs
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8243


# 68bd7ed1 12-Oct-2016 Jonathan T. Looney <jtl@FreeBSD.org>

The TFO server-side code contains some changes that are not conditioned on
the TCP_RFC7413 kernel option. This change removes those few instructions
from the packet processing path.

While not strictly necessary, for the sake of consistency, I applied the
new IS_FASTOPEN macro to all places in the packet processing path that
used the (t_flags & TF_FASTOPEN) check.

Reviewed by: hiren
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8219


# 45274760 11-Oct-2016 Jonathan T. Looney <jtl@FreeBSD.org>

Currently, when tcp_input() receives a packet on a session that matches a
TCPCB, it checks (so->so_options & SO_ACCEPTCONN) to determine whether or
not the socket is a listening socket. However, this causes the code to
access a different cacheline. If we first check if the socket is in the
LISTEN state, we can avoid accessing so->so_options when processing packets
received for ESTABLISHED sessions.

If INVARIANTS is defined, the code still needs to access both variables to
check that so->so_options is consistent with the state.

Reviewed by: gallatin
MFC after: 1 week
Sponsored by: Netflix


# bd79708d 11-Oct-2016 Jonathan T. Looney <jtl@FreeBSD.org>

In the TCP stack, the hhook(9) framework provides hooks for kernel modules
to add actions that run when a TCP frame is sent or received on a TCP
session in the ESTABLISHED state. In the base tree, this functionality is
only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd,
and cc_vegas congestion control modules.

Presently, we incur overhead to check for hooks each time a TCP frame is
sent or received on an ESTABLISHED TCP session.

This change adds a new compile-time option (TCP_HHOOK) to determine whether
to include the hhook(9) framework for TCP. To retain backwards
compatibility, I added the TCP_HHOOK option to every configuration file that
already defined "options INET". (Therefore, this patch introduces no
functional change. In order to see a functional difference, you need to
compile a custom kernel without the TCP_HHOOK option.) This change will
allow users to easily exclude this functionality from their kernel, should
they wish to do so.

Note that any users who use a custom kernel configuration and use one of the
congestion control modules listed above will need to add the TCP_HHOOK
option to their kernel configuration.

Reviewed by: rrs, lstewart, hiren (previous version), sjg (makefiles only)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8185


# 3ac12506 06-Oct-2016 Jonathan T. Looney <jtl@FreeBSD.org>

Remove "long" variables from the TCP stack (not including the modular
congestion control framework).

Reviewed by: gnn, lstewart (partial)
Sponsored by: Juniper Networks, Netflix
Differential Revision: (multiple)
Tested by: Limelight, Netflix


# 1d7ee746 29-Sep-2016 Kurt Lidl <lidl@FreeBSD.org>

Properly preserve ip_tos bits for IPv4 packets

Restructure code slightly to save ip_tos bits earlier. Fix the bug
where the ip_tos field is zeroed out before assigning to the iptos
variable. Restore the ip_tos and ip_ver fields only if they have
been zeroed during the pseudo-header checksum calculation.

Reviewed by: cem, gnn, hiren
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D8053


# 4b7b743c 25-Aug-2016 Lawrence Stewart <lstewart@FreeBSD.org>

Pass the number of segments coalesced by LRO up the stack by repurposing the
tso_segsz pkthdr field during RX processing, and use the information in TCP for
more correct accounting and as a congestion control input. This is only a start,
and an audit of other uses for the data is left as future work.

Reviewed by: gallatin, rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D7564


# 4c105402 08-Jun-2016 Andrey V. Elsukov <ae@FreeBSD.org>

Cleanup unneded include "opt_ipfw.h".

It was used for conditional build IPFIREWALL_FORWARD support.
But IPFIREWALL_FORWARD option was removed a long time ago.


# 883054b4 19-May-2016 Don Lewis <truckman@FreeBSD.org>

Change net.inet.tcp.ecn.enable sysctl mib from a binary off/on
control to a three way setting.
0 - Totally disable ECN. (no change)
1 - Enable ECN if incoming connections request it. Outgoing
connections will request ECN. (no change from present != 0 setting)
2 - Enable ECN if incoming connections request it. Outgoing
conections will not request ECN.

Change the default value of net.inet.tcp.ecn.enable from 0 to 2.

Linux version 2.4.20 and newer, Solaris, and Mac OS X 10.5 and newer have
similar capabilities. The actual values above match Linux, and the default
matches the current Linux default.

Reviewed by: eadler
MFC after: 1 month
MFH: yes
Sponsored by: https://reviews.freebsd.org/D6386


# f59d975e 17-May-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Tiny refactor of r294869/r296881: use defines to mask the VNET() macro.

Suggested by: bz


# a4641f4e 03-May-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/net*: minor spelling fixes.

No functional change.


# b8c2cd15 21-Apr-2016 Jonathan T. Looney <jtl@FreeBSD.org>

Prevent underflows in tp->snd_wnd if the remote side ACKs more than
tp->snd_wnd. This can happen, for example, when the remote side responds to
a window probe by ACKing the one byte it contains.

Differential Revision: https://reviews.freebsd.org/D5625
Reviewed by: hiren
Obtained from: Juniper Networks (earlier version)
MFC after: 2 weeks
Sponsored by: Juniper Networks


# d4d32b9f 16-Mar-2016 Hans Petter Selasky <hselasky@FreeBSD.org>

Fix kernel build after adding new sysctl asserts in r296933.


# bf840a17 14-Mar-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Redo r294869. The array of counters for TCP states doesn't belong to
struct tcpstat, because the structure can be zeroed out by netstat(1) -z,
and of course running connection counts shouldn't be touched.

Place running connection counts into separate array, and provide
separate read-only sysctl oid for it.


# 4644fda3 27-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Rename netinet/tcp_cc.h to netinet/cc/cc.h.

Discussed with: lstewart


# 2de3e790 21-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

- Rename cc.h to more meaningful tcp_cc.h.
- Declare it a kernel only include, which it already is.
- Don't include tcp.h implicitly from tcp_cc.h


# b66d74c1 21-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Cleanup TCP files from unnecessary interface related includes.


# 0c39d38d 06-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Historically we have two fields in tcpcb to describe sender MSS: t_maxopd,
and t_maxseg. This dualism emerged with T/TCP, but was not properly cleaned
up after T/TCP removal. After all permutations over the years the result is
that t_maxopd stores a minimum of peer offered MSS and MTU reduced by minimum
protocol header. And t_maxseg stores (t_maxopd - TCPOLEN_TSTAMP_APPA) if
timestamps are in action, or is equal to t_maxopd otherwise. That's a very
rough estimate of MSS reduced by options length. Throughout the code it
was used in places, where preciseness was not important, like cwnd or
ssthresh calculations.

With this change:

- t_maxopd goes away.
- t_maxseg now stores MSS not adjusted by options.
- new function tcp_maxseg() is provided, that calculates MSS reduced by
options length. The functions gives a better estimate, since it takes
into account SACK state as well.

Reviewed by: jtl
Differential Revision: https://reviews.freebsd.org/D3593


# 2d8868db 29-Dec-2015 Jonathan T. Looney <jtl@FreeBSD.org>

When checking the inp_ip_minttl restriction for IPv6 packets, don't check
the IPv4 header.

CID: 1017920
Differential Revision: https://reviews.freebsd.org/D4727
Reviewed by: bz
MFC after: 2 weeks
Sponsored by: Juniper Networks


# 281a0fd4 24-Dec-2015 Patrick Kelsey <pkelsey@FreeBSD.org>

Implementation of server-side TCP Fast Open (TFO) [RFC7413].

TFO is disabled by default in the kernel build. See the top comment
in sys/netinet/tcp_fastopen.c for implementation particulars.

Reviewed by: gnn, jch, stas
MFC after: 3 days
Sponsored by: Verisign, Inc.
Differential Revision: https://reviews.freebsd.org/D4350


# 55bceb1e 15-Dec-2015 Randall Stewart <rrs@FreeBSD.org>

First cut of the modularization of our TCP stack. Still
to do is to clean up the timer handling using the async-drain.
Other optimizations may be coming to go with this. Whats here
will allow differnet tcp implementations (one included).
Reviewed by: jtl, hiren, transports
Sponsored by: Netflix Inc.
Differential Revision: D4055


# 021eaf79 08-Dec-2015 Hiren Panchasara <hiren@FreeBSD.org>

One of the ways to detect loss is to count duplicate acks coming back from the
other end till it reaches predetermined threshold which is 3 for us right now.
Once that happens, we trigger fast-retransmit to do loss recovery.

Main problem with the current implementation is that we don't honor SACK
information well to detect whether an incoming ack is a dupack or not. RFC6675
has latest recommendations for that. According to it, dupack is a segment that
arrives carrying a SACK block that identifies previously unknown information
between snd_una and snd_max even if it carries new data, changes the advertised
window, or moves the cumulative acknowledgment point.

With the prevalence of Selective ACK (SACK) these days, improper handling can
lead to delayed loss recovery.

With the fix, new behavior looks like following:

0) th_ack < snd_una --> ignore
Old acks are ignored.
1) th_ack == snd_una, !sack_changed --> ignore
Acks with SACK enabled but without any new SACK info in them are ignored.
2) th_ack == snd_una, window == old_window --> increment
Increment on a good dupack.
3) th_ack == snd_una, window != old_window, sack_changed --> increment
When SACK enabled, it's okay to have advertized window changed if the ack has
new SACK info.
4) th_ack > snd_una --> reset to 0
Reset to 0 when left edge moves.
5) th_ack > snd_una, sack_changed --> increment
Increment if left edge moves but there is new SACK info.

Here, sack_changed is the indicator that incoming ack has previously unknown
SACK info in it.

Note: This fix is not fully compliant to RFC6675. That may require a few
changes to current implementation in order to keep per-sackhole dupack counter
and change to the way we mark/handle sack holes.

PR: 203663
Reviewed by: jtl
MFC after: 3 weeks
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D4225


# 054d38e3 04-Nov-2015 Hiren Panchasara <hiren@FreeBSD.org>

Improve the sysctl node name.

X-MFC with: r290122
Sponsored by: Limelight Networks


# 12eeb81f 28-Oct-2015 Hiren Panchasara <hiren@FreeBSD.org>

Calculate the correct amount of bytes that are in-flight for a connection as
suggested by RFC 6675.

Currently differnt places in the stack tries to guess this in suboptimal ways.
The main problem is that current calculations don't take sacked bytes into
account. Sacked bytes are the bytes receiver acked via SACK option. This is
suboptimal because it assumes that network has more outstanding (unacked) bytes
than the actual value and thus sends less data by setting congestion window
lower than what's possible which in turn may cause slower recovery from losses.

As an example, one of the current calculations looks something like this:
snd_nxt - snd_fack + sackhint.sack_bytes_rexmit
New proposal from RFC 6675 is:
snd_max - snd_una - sackhint.sacked_bytes + sackhint.sack_bytes_rexmit
which takes sacked bytes into account which is a new addition to the sackhint
struct. Only thing we are missing from RFC 6675 is isLost() i.e. segment being
considered lost and thus adjusting pipe based on that which makes this
calculation a bit on conservative side.

The approach is very simple. We already process each ack with sack info in
tcp_sack_doack() and extract sack blocks/holes out of it. We'd now also track
this new variable sacked_bytes which keeps track of total sacked bytes reported.

One downside to this approach is that we may get incorrect count of sacked_bytes
if the other end decides to drop sack info in the ack because of memory pressure
or some other reasons. But in this (not very likely) case also the pipe
calculation would be conservative which is okay as opposed to being aggressive
in sending packets into the network.

Next step is to use this more accurate pipe estimation to drive congestion
window adjustments.

In collaboration with: rrs
Reviewed by: jason_eggnet dot com, rrs
MFC after: 2 weeks
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D3971


# 356c7958 27-Oct-2015 Hiren Panchasara <hiren@FreeBSD.org>

Add sysctl tunable net.inet.tcp.initcwnd_segments to specify initial congestion
window in number of segments on fly. It is set to 10 segments by default.

Remove net.inet.tcp.experimental.initcwnd10 which is now redundant. Also remove
the parent node net.inet.tcp.experimental as it's not needed anymore and also
because it was not well thought out.

Differential Revision: https://reviews.freebsd.org/D3858
In collaboration with: lstewart
Reviewed by: gnn (prev version), rwatson, allanjude, wblock (man page)
MFC after: 2 weeks
Relnotes: yes
Sponsored by: Limelight Networks


# 86a996e6 13-Oct-2015 Hiren Panchasara <hiren@FreeBSD.org>

There are times when it would be really nice to have a record of the last few
packets and/or state transitions from each TCP socket. That would help with
narrowing down certain problems we see in the field that are hard to reproduce
without understanding the history of how we got into a certain state. This
change provides just that.

It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is
destroyed, the list is freed. I thought this was likely to be more
performance-friendly than saving copies of the tcpcb. Plus, with the packets,
you should be able to reverse-engineer what happened to the tcpcb.

To enable the feature, you will need to compile a kernel with the TCPPCAP
option. Even then, the feature defaults to being deactivated. You can activate
it by setting a positive value for the number of captured packets. You can do
that on either a global basis or on a per-socket basis (via a setsockopt call).

There is no way to get the packets out of the kernel other than using kmem or
getting a coredump. I thought that would help some of the legal/privacy concerns
regarding such a feature. However, it should be possible to add a future effort
to export them in PCAP format.

I tested this at low scale, and found that there were no mbuf leaks and the peak
mbuf usage appeared to be unchanged with and without the feature.

The main performance concern I can envision is the number of mbufs that would be
used on systems with a large number of sockets. If you save five packets per
direction per socket and have 3,000 sockets, that will consume at least 30,000
mbufs just to keep these packets. I tried to reduce the concerns associated with
this by limiting the number of clusters (not mbufs) that could be used for this
feature. Again, in my testing, that appears to work correctly.

Differential Revision: D3100
Submitted by: Jonathan Looney <jlooney at juniper dot net>
Reviewed by: gnn, hiren


# 62d4443f 06-Oct-2015 Hiren Panchasara <hiren@FreeBSD.org>

Add a comment specifying how we implement rfc3042.

Differential Revision: D3746
MFC after: 1 week
Sponsored by: Limelight Networks


# 1558cb24 26-Sep-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Eliminate nd6_nud_hint() and its TCP bindings.

Initially function was introduced in r53541 (KAME initial commit) to
"provide hints from upper layer protocols that indicate a connection
is making "forward progress"" (quote from RFC 2461 7.3.1 Reachability
Confirmation).
However, it was converted to do nothing (e.g. just return) in r122922
(tcp_hostcache implementation) back in 2003. Some defines were moved
to tcp_var.h in r169541. Then, it was broken (for non-corner cases)
by r186119 (L2<>L3 split) in 2008 (NULL ifp in nd6_lookup). So,
right now this code is broken and has no "real" base users.

Differential Revision: https://reviews.freebsd.org/D3699


# 5d06879a 13-Sep-2015 George V. Neville-Neil <gnn@FreeBSD.org>

dd DTrace probe points, translators and a corresponding script
to provide the TCPDEBUG functionality with pure DTrace.

Reviewed by: rwatson
MFC after: 2 weeks
Sponsored by: Limelight Networks
Differential Revision: D3530


# ff9b006d 02-Aug-2015 Julien Charbon <jch@FreeBSD.org>

Decompose TCP INP_INFO lock to increase short-lived TCP connections scalability:

- The existing TCP INP_INFO lock continues to protect the global inpcb list
stability during full list traversal (e.g. tcp_pcblist()).

- A new INP_LIST lock protects inpcb list actual modifications (inp allocation
and free) and inpcb global counters.

It allows to use TCP INP_INFO_RLOCK lock in critical paths (e.g. tcp_input())
and INP_INFO_WLOCK only in occasional operations that walk all connections.

PR: 183659
Differential Revision: https://reviews.freebsd.org/D2599
Reviewed by: jhb, adrian
Tested by: adrian, nitroboost-gmail.com
Sponsored by: Verisign, Inc.


# 4741bfcb 29-Jul-2015 Patrick Kelsey <pkelsey@FreeBSD.org>

Revert r265338, r271089 and r271123 as those changes do not handle
non-inline urgent data and introduce an mbuf exhaustion attack vector
similar to FreeBSD-SA-15:15.tcp, but not requiring VNETs.

Address the issue described in FreeBSD-SA-15:15.tcp.

Reviewed by: glebius
Approved by: so
Approved by: jmallett (mentor)
Security: FreeBSD-SA-15:15.tcp
Sponsored by: Norse Corp, Inc.


# d57724fd 17-Jul-2015 Patrick Kelsey <pkelsey@FreeBSD.org>

Check TCP timestamp option flag so that the automatic receive buffer
scaling code does not use an uninitialized timestamp echo reply value
from the stack when timestamps are not enabled.

Differential Revision: https://reviews.freebsd.org/D3060
Reviewed by: hiren
Approved by: jmallett (mentor)
MFC after: 3 days
Sponsored by: Norse Corp, Inc.


# 4c3972f0a 22-Jun-2015 Hiren Panchasara <hiren@FreeBSD.org>

Reverting r284710.
Today I learned: iff == if and only if.

Suggested by: many


# 26f2eb69 22-Jun-2015 Hiren Panchasara <hiren@FreeBSD.org>

Fix a typo: s/iff/if/

Sponsored by: Limelight Networks


# c52102dd 19-May-2015 Hiren Panchasara <hiren@FreeBSD.org>

Correct the wording as we are increasing the window size.

Reviewed by: jhb
Sponsored by: Limelight Networks


# 64807b30 12-Jan-2015 Hiren Panchasara <hiren@FreeBSD.org>

DCTCP (Data Center TCP) implementation.

DCTCP congestion control algorithm aims to maximise throughput and minimise
latency in data center networks by utilising the proportion of Explicit
Congestion Notification (ECN) marked packets received from capable hardware as a
congestion signal.

Highlights:
Implemented as a mod_cc(4) module.
ECN (Explicit congestion notification) processing is done differently from
RFC3168.
Takes one-sided DCTCP into consideration where only one of the sides is using
DCTCP and other is using standard ECN.

IETF draft: http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
Thesis report by Midori Kato: https://eggert.org/students/kato-thesis.pdf

Submitted by: Midori Kato <katoon@sfc.wide.ad.jp> and
Lars Eggert <lars@netapp.com>
with help and modifications from
hiren
Differential Revision: https://reviews.freebsd.org/D604
Reviewed by: gnn


# 44eb8bbe 11-Dec-2014 Andrey V. Elsukov <ae@FreeBSD.org>

Do not count security policy violation twice.
ipsec*_in_reject() do this by their own.

Obtained from: Yandex LLC
Sponsored by: Yandex LLC


# c2529042 01-Dec-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Start process of removing the use of the deprecated "M_FLOWID" flag
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.

This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.

"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.

Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.

MFC after: 1 month
Sponsored by: Mellanox Technologies


# 651e4e6a 30-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Merge from projects/sendfile: extend protocols API to support
sending not ready data:
o Add new flag to pru_send() flags - PRUS_NOTREADY.
o Add new protocol method pru_ready().

Sponsored by: Nginx, Inc.
Sponsored by: Netflix


# cfa6009e 12-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

In preparation of merging projects/sendfile, transform bare access to
sb_cc member of struct sockbuf to a couple of inline functions:

sbavail() and sbused()

Right now they are equal, but once notion of "not ready socket buffer data",
will be checked in, they are going to be different.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 3e88eb90 08-Nov-2014 Andrey V. Elsukov <ae@FreeBSD.org>

Remove ip6_getdstifaddr() and all functions to work with auxiliary data.
It isn't safe to keep unreferenced ifaddrs. Use in6ifa_ifwithaddr() to
determine ifaddr corresponding to destination address. Since currently
we keep addresses with embedded scope zone, in6ifa_ifwithaddr is called
with zero zoneid and marked with XXX.

Also remove route and lle lookups from ip6_input. Use in6ifa_ifwithaddr()
instead.

Sponsored by: Yandex LLC


# 6df8a710 07-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Remove SYSCTL_VNET_* macros, and simply put CTLFLAG_VNET where needed.

Sponsored by: Nginx, Inc.


# 9fd573c3 22-Sep-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Improve transmit sending offload, TSO, algorithm in general.

The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.

Reviewed by: adrian, rmacklem
Sponsored by: Mellanox Technologies
MFC after: 1 week


# 3220a212 16-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

FreeBSD-SA-14:19.tcp raised attention to the state of our stack
towards blind SYN/RST spoofed attack.

Originally our stack used in-window checks for incoming SYN/RST
as proposed by RFC793. Later, circa 2003 the RST attack was
mitigated using the technique described in P. Watson
"Slipping in the window" paper [1].

After that, the checks were only relaxed for the sake of
compatibility with some buggy TCP stacks. First, r192912
introduced the vulnerability, just fixed by aforementioned SA.
Second, r167310 had slightly relaxed the default RST checks,
instead of utilizing net.inet.tcp.insecure_rst sysctl.

In 2010 a new technique for mitigation of these attacks was
proposed in RFC5961 [2]. The idea is to send a "challenge ACK"
packet to the peer, to verify that packet arrived isn't spoofed.
If peer receives challenge ACK it should regenerate its RST or
SYN with correct sequence number. This should not only protect
against attacks, but also improve communication with broken
stacks, so authors of reverted r167310 and r192912 won't be
disappointed.

[1] http://bandwidthco.com/whitepapers/netforensics/tcpip/TCP Reset Attacks.pdf
[2] http://www.rfc-editor.org/rfc/rfc5961.txt

Changes made:

o Revert r167310.
o Implement "challenge ACK" protection as specificed in RFC5961
against RST attack. On by default.
- Carefully preserve r138098, which handles empty window edge
case, not described by the RFC.
- Update net.inet.tcp.insecure_rst description.
o Implement "challenge ACK" protection as specificed in RFC5961
against SYN attack. On by default.
- Provide net.inet.tcp.insecure_syn sysctl, to turn off
RFC5961 protection.

The changes were tested at Netflix. The tested box didn't show
any anomalies compared to control box, except slightly increased
number of TCP connection in LAST_ACK state.

Reviewed by: rrs
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 831ad37e 16-Sep-2014 Xin LI <delphij@FreeBSD.org>

Fix Denial of Service in TCP packet processing.

Submitted by: glebius
Security: FreeBSD-SA-14:19.tcp


# a7c7f2a7 04-Sep-2014 John Baldwin <jhb@FreeBSD.org>

In tcp_input(), don't acquire the pcbinfo global write lock for SYN
packets targeting a listening socket. Permit to reduce TCP input
processing starvation in context of high SYN load (e.g. short-lived TCP
connections or SYN flood).

Submitted by: Julien Charbon <jcharbon@verisign.com>
Reviewed by: adrian, hiren, jhb, Mike Bentkofsky


# f7469d3e 09-Aug-2014 Hiren Panchasara <hiren@FreeBSD.org>

Improve comments by listing a criteria for automatic increment of receive socket
buffer.

Reviewed by: jmg


# 8f5a8818 07-Aug-2014 Kevin Lo <kevlo@FreeBSD.org>

Merge 'struct ip6protosw' and 'struct protosw' into one. Now we have
only one protocol switch structure that is shared between ipv4 and ipv6.

Phabric: D476
Reviewed by: jhb


# cc412412 02-Jul-2014 Hiren Panchasara <hiren@FreeBSD.org>


# 3150f357 24-May-2014 Bjoern A. Zeeb <bz@FreeBSD.org>

Remove the prototpye for the static inline function
tcp_signature_verify_input().
The function is defined before first use already.

MFC after: 2 weeks


# 5688fa66 23-May-2014 Bjoern A. Zeeb <bz@FreeBSD.org>

Remove the prototypes for things that are no longer file local but were
moved to the header file.

Pointy hat to: clang || bz
MFC after: 2 weeks
X-MFC with: r266596
Reported by: gcc build of sparc64


# 255cd9fd 23-May-2014 Bjoern A. Zeeb <bz@FreeBSD.org>

Move the tcp_fields_to_host() and tcp_fields_to_net() (inline)
functions to the tcp_var.h header file in order to avoid further
duplication with upcoming commits.

Reviewed by: np
MFC after: 2 weeks


# 2f719932 18-May-2014 Adrian Chadd <adrian@FreeBSD.org>

Ensure that the flowid hashtype is assigned to the inp if the flowid
is also assigned.


# e407b67b 04-May-2014 Gleb Smirnoff <glebius@FreeBSD.org>

The FreeBSD-SA-14:08.tcp was a lesson on not doing acrobatics with
mixing on stack memory and UMA memory in one linked list.

Thus, rewrite TCP reassembly code in terms of memory usage. The
algorithm remains unchanged.

We actually do not need extra memory to build a reassembly queue.
Arriving mbufs are always packet header mbufs. So we got the length
of data as pkthdr.len. We got m_nextpkt for linkage. And we need
only one pointer to point at the tcphdr, use PH_loc for that.

In tcpcb the t_segq fields becomes mbuf pointer. The t_segqlen
field now counts not packets, but bytes in the queue. This gives
us more precision when comparing to socket buffer limits.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 85536381 02-Apr-2014 Hiren Panchasara <hiren@FreeBSD.org>

Improve readability of comments for DELAY_ACK() macro.


# 153edc50 25-Mar-2014 Hiren Panchasara <hiren@FreeBSD.org>

Correct the comments as support for RFC 1644 has been removed for a long time.


# 62b90589 28-Jan-2014 Peter Wemm <peter@FreeBSD.org>

Adjust r239672 from rrs and r258821 from eadler.

By definition, the very first FIN is not a duplicate. Process it normally
and don't feed it to congestion control as though it were a dupe. Don't
prevent CC from seeing later dupe acks while in a half close state.


# b8b4cfcd 25-Dec-2013 Sergey Kandaurov <pluknet@FreeBSD.org>

Draft-ietf-tcpm-initcwnd-05 became RFC6928.

MFC after: 1 week


# 5f30ec9b 01-Dec-2013 Eitan Adler <eadler@FreeBSD.org>

In a situation where:
- The remote host sends a FIN
- in an ACK for a sequence number for which an ACK has already
been received
- There is still unacked data on route to the remote host
- The packet does not contain a window update

The packet may be dropped without processing the FIN flag.

PR: kern/99188
Submitted by: Staffan Ulfberg <staffan@ulfberg.se>
Discussed with: andre
MFC after: never


# d9fae5ab 26-Nov-2013 Andriy Gapon <avg@FreeBSD.org>

dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE

In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by: markj
MFC after: 3 weeks


# fa22ce15 25-Nov-2013 Adrian Chadd <adrian@FreeBSD.org>

Convert over the TCP probes to use mtod() rather than directly
dereferencing m->m_data.

Sponsored by: Netflix, Inc.


# 54366c0b 25-Nov-2013 Attilio Rao <attilio@FreeBSD.org>

- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip


# 76039bc8 26-Oct-2013 Gleb Smirnoff <glebius@FreeBSD.org>

The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# c1e5a6e5 22-Oct-2013 Andre Oppermann <andre@FreeBSD.org>

The TCP delayed ACK logic isn't aware of LRO passing up large aggregated
segments thinking it received only one segment. This causes it to enable
the delay the ACK for 100ms to wait for another segment which may never
come because all the data was received already.

Doing delayed ACK for LRO segments is bogus for two reasons: a) it pushes
us further away from acking every other packet; b) it introduces additional
delay in responding to the sender. The latter is especially bad because it
is in the nature of LRO to aggregated all segments of a burst with no more
coming until an ACK is sent back.

Change the delayed ACK logic to detect LRO segments by being larger than
the MSS for this connection and issuing an immediate ACK for them to keep
the ACK clock ticking without interruption.

Reported by: julian, cperciva
Tested by: cperciva
Reviewed by: lstewart
MFC after: 3 days


# c11a15bf 08-Oct-2013 Gleb Smirnoff <glebius@FreeBSD.org>

When processing ACK in tcp_do_segment, use sbcut_locked() instead of
sbdrop_locked() to cut acked mbufs from the socket buffer. Free this
chain a batch manner after the socket buffer lock is dropped.

This measurably reduces contention on socket buffer.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.
Approved by: re (marius)


# 57f60867 25-Aug-2013 Mark Johnston <markj@FreeBSD.org>

Implement the ip, tcp, and udp DTrace providers. The probe definitions use
dynamic translation so that their arguments match the definitions for
these providers in Solaris and illumos. Thus, existing scripts for these
providers should work unmodified on FreeBSD.

Tested by: gnn, hiren
MFC after: 1 month


# 6794f460 23-Jul-2013 Andrey V. Elsukov <ae@FreeBSD.org>

Remove the large part of struct ipsecstat. Only few fields of this
structure is used, but they already have equal fields in the struct
newipsecstat, that was introduced with FAST_IPSEC and then was merged
together with old ipsecstat structure.

This fixes kernel stack overflow on some architectures after migration
ipsecstat to PCPU counters.

Reported by: Taku YAMAMOTO, Maciej Milewski


# 07dacf03 09-Jul-2013 Andre Oppermann <andre@FreeBSD.org>

Extend debug logging of TCP timestamp related specification
violations.

Update related comments and style.


# 5da0521f 09-Jul-2013 Andrey V. Elsukov <ae@FreeBSD.org>

Use new macros to implement ipstat and tcpstat using PCPU counters.
Change interface of kread_counters() similar ot kread() in the netstat(1).


# 42a253e6 21-Jun-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Fix kmod_*stat_inc() after r249276. The incorrect code actually
increased the pointer, not the memory it points to.

In collaboration with: kib
Reported & tested by: Ian FREISLICH <ianf clue.co.za>
Sponsored by: Nginx, Inc.


# 6659296c 20-Jun-2013 Andrey V. Elsukov <ae@FreeBSD.org>

Use IPSECSTAT_INC() and IPSEC6STAT_INC() macros for ipsec statistics
accounting.

MFC after: 2 weeks


# 3c914c54 02-Jun-2013 Andre Oppermann <andre@FreeBSD.org>

Allow drivers to specify a maximum TSO length in bytes if they are
limited in the amount of data they can handle at once.

Drivers can set ifp->if_hw_tsomax before calling ether_ifattach() to
change the limit.

The lowest allowable size is IP_MAXPACKET / 8 (8192 bytes) as anything
less wouldn't be very useful anymore. The upper limit is still at
IP_MAXPACKET (65536 bytes). Raising it requires further auditing of
the IPv4/v6 code path's as the length field in the IP header would
overflow leading to confusion in firewalls and others packet handler on
the real size of the packet.

The placement into "struct ifnet" is a bit hackish but the best place
that was found. When the stack/driver boundary is updated it should
be handled in a better way.

Submitted by: cperciva (earlier version)
Reviewed by: cperciva
Tested by: cperciva
MFC after: 1 week (using spare struct members to preserve ABI)


# 5628dd08 23-Apr-2013 Andre Oppermann <andre@FreeBSD.org>

When doing RFC3042 limited transmit on the first on second
duplicate ACK make sure we actually have new data to send.
This prevents us from sending unneccessary pure ACKs.

Reported by: Matt Miller <matt@matthewjmiller.net>
Tested by: Matt Miller <matt@matthewjmiller.net>
MFC after: 2 weeks


# 982c1675 09-Apr-2013 Andre Oppermann <andre@FreeBSD.org>

Fix a race condition on tcp listen socket teardown with pending
connections in the accept queue and contiguous new incoming SYNs.

Compared to the original submitters patch I've moved the test
next to the SYN handling to have it together in a logical unit
and reworded the comment explaining the issue.

Submitted by: Matt Miller <matt@matthewjmiller.net>
Submitted by: Juan Mojica <jmojica@gmail.com>
Reviewed by: Matt Miller (changes)
Tested by: pho
MFC after: 1 week


# 4a21e86e 09-Apr-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Fix VIMAGE build.


# 5923c293 08-Apr-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Merge from projects/counters: TCP/IP stats.

Convert 'struct ipstat' and 'struct tcpstat' to counter(9).

This speeds up IP forwarding at extreme packet rates, and
makes accounting more precise.

Sponsored by: Nginx, Inc.


# ce7ad664 29-Mar-2013 Ed Maste <emaste@FreeBSD.org>

Keep fwd_tag around for subsequent pcb lookups

For TIMEWAIT handling tcp_input may have to jump back for an additional
pass through pcblookup. Prior to this change the fwd_tag had been
discarded after the first lookup, so a new connection attempt delivered
locally via 'ipfw fwd' would fail to find a match.

As of r248886 the tag will be detached and freed when passed to the
socket buffer.


# 5b648e79 22-Jan-2013 Lawrence Stewart <lstewart@FreeBSD.org>

Simplify and fix a bug in cc_ack_received()'s "are we congestion window limited"
logic (refer to [1] for associated discussion). snd_cwnd and snd_wnd are
unsigned long and on 64 bit hosts, min() will truncate them to 32 bits and could
therefore potentially corrupt the result (although under normal operation,
neither variable should legitmately exceed 32 bits).

[1] http://lists.freebsd.org/pipermail/freebsd-net/2013-January/034297.html

Submitted by: jhb
MFC after: 1 week


# b8056fae 18-Dec-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Fix !INET6 build after r244365.


# dd029d52 18-Dec-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Clear correct flag in INET6 case.


# f4912745 17-Dec-2012 Andrey V. Elsukov <ae@FreeBSD.org>

Since we use different flags to detect tcp forwarding, and we share the
same code for IPv4 and IPv6 in tcp_input, we should check both
M_IP_NEXTHOP and M_IP6_NEXTHOP flags.

MFC after: 3 days


# 78a7880f 12-Dec-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Fix a crash in tcp_input(), that happens when mbuf has a fwd_tag on it,
but later after processing and freeing the tag, we need to jump back again
to the findpcb label. Since the fwd_tag pointer wasn't NULL we tried to
process and free the tag for second time.

Reported & tested by: Pawel Tyll <ptyll nitronet.pl>
MFC after: 3 days


# 60ee3bb2 05-Nov-2012 Andre Oppermann <andre@FreeBSD.org>

Back out r242262. The simplified window change/update logic wasn't
complete and ready for production use.

PR: kern/173309


# ffdbf9da 01-Nov-2012 Andrey V. Elsukov <ae@FreeBSD.org>

Remove the recently added sysctl variable net.pfil.forward.
Instead, add protocol specific mbuf flags M_IP_NEXTHOP and
M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain
contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup
only when this flag is set.

Suggested by: andre


# 09440655 28-Oct-2012 Andre Oppermann <andre@FreeBSD.org>

Increase the initial CWND to 10 segments as defined in IETF TCPM
draft-ietf-tcpm-initcwnd-05. It explains why the increased initial
window improves the overall performance of many web services without
risking congestion collapse.

As long as it remains a draft it is placed under a sysctl marking it
as experimental:
net.inet.tcp.experimental.initcwnd10 = 1
When it becomes an official RFC soon the sysctl will be changed to
the RFC number and moved to net.inet.tcp.

This implementation differs from the RFC draft in that it is a bit
more conservative in the case of packet loss on SYN or SYN|ACK because
we haven't reduced the default RTO to 1 second yet. Also the restart
window isn't yet increased as allowed. Both will be adjusted with
upcoming changes.

Is is enabled by default. In Linux it is enabled since kernel 3.0.

MFC after: 2 weeks


# 79ce26a0 28-Oct-2012 Andre Oppermann <andre@FreeBSD.org>

Simplify and enhance the window change/update acceptance logic,
especially in the presence of bi-directional data transfers.

snd_wl1 tracks the right edge, including data in the reassembly
queue, of valid incoming data. This makes it like rcv_nxt plus
reassembly. It never goes backwards to prevent older, possibly
reordered segments from updating the window.

snd_wl2 tracks the left edge of sent data. This makes it a duplicate
of snd_una. However joining them right now is difficult due to
separate update dependencies in different places in the code flow.

snd_wnd tracks the current advertized send window by the peer. In
tcp_output() the effective window is calculated by subtracting the
already in-flight data, snd_nxt less snd_una, from it.

ACK's become the main clock of window updates and will always update
the window when the left edge of what we sent is advanced. The ACK
clock is the primary signaling mechanism in ongoing data transfers.
This works reliably even in the presence of reordering, reassembly
and retransmitted segments. The ACK clock is most important because
it determines how much data we are allowed to inject into the network.

Zero window updates get us out of persistence mode are crucial. Here
a segment that neither moves ACK nor SEQ but enlarges WND is accepted.

When the ACK clock is not active (that is we're not or no longer
sending any data) any segment that moves the extended right SEQ edge,
including out-of-order segments, updates the window. This gives us
updates especially during ping-pong transfers where the peer isn't
done consuming the already acknowledged data from the receive buffer
while responding with data.

The SSH protocol is a prime candidate to benefit from the improved
bi-directional window update logic as it has its own windowing
mechanism on top of TCP and is frequently sending back protocol ACK's.

Tcpdump provided by: darrenr
Tested by: darrenr
MFC after: 2 weeks


# 4faaea55 28-Oct-2012 Andre Oppermann <andre@FreeBSD.org>

Allow arbitrary MSS sizes and don't mind about the cluster size anymore.
We've got more cluster sizes for quite some time now and the orginally
imposed limits and the previously codified thoughts on efficiency gains
are no longer true.

MFC after: 2 weeks


# cf8f04f4 28-Oct-2012 Andre Oppermann <andre@FreeBSD.org>

When SYN or SYN/ACK had to be retransmitted RFC5681 requires us to
reduce the initial CWND to one segment. This reduction got lost
some time ago due to a change in initialization ordering.

Additionally in tcp_timer_rexmt() avoid entering fast recovery when
we're still in TCPS_SYN_SENT state.

MFC after: 2 weeks


# 22efabd4 28-Oct-2012 Andre Oppermann <andre@FreeBSD.org>

Adjust the initial default CWND upon connection establishment to the
new and increased values specified by RFC5681 Section 3.1.

The even larger initial CWND per RFC3390, if enabled, is not affected.

MFC after: 2 weeks


# c1de64a4 25-Oct-2012 Andrey V. Elsukov <ae@FreeBSD.org>

Remove the IPFIREWALL_FORWARD kernel option and make possible to turn
on the related functionality in the runtime via the sysctl variable
net.pfil.forward. It is turned off by default.

Sponsored by: Yandex LLC
Discussed with: net@
MFC after: 2 weeks


# 8ad458a4 23-Oct-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Do not reduce ip_len by size of IP header in the ip_input()
before passing a packet to protocol input routines.
For several protocols this mean that now protocol needs to
do subtraction itself, and for another half this means that
we do not need to add header length back to the packet.

Make ip_stripoptions() to adjust ip_len, since now we enter
this function with a packet header whose ip_len does represent
length of entire packet, not payload only.


# 8f134647 22-Oct-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Switch the entire IPv4 stack to keep the IP packet header
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.

After this change a packet processed by the stack isn't
modified at all[2] except for TTL.

After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.

[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.

[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.

Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>


# 105bd211 12-Oct-2012 Gleb Smirnoff <glebius@FreeBSD.org>

In ip_stripoptions():
- Remove unused argument and incorrect comment.
- Fixup ip_len after stripping.


# ec03d543 25-Aug-2012 Randall Stewart <rrs@FreeBSD.org>

This small change takes care of a race condition
that can occur when both sides close at the same time.
If that occurs, without this fix the connection enters
FIN1 on both sides and they will forever send FIN|ACK at
each other until the connection times out. This is because
we stopped processing the FIN|ACK and thus did not advance
the sequence and so never ACK'd each others FIN. This
fix adjusts it so we *do* process the FIN properly and
the race goes away ;-)

MFC after: 1 month


# 0989f56c 22-Jul-2012 Robert Watson <rwatson@FreeBSD.org>

Update some stale comments regarding tcbinfo locking in the TCP input
path: read locks on tcbinfo are no longer used, so won't happen. No
functional change.

MFC after: 3 days


# 09fe6320 19-Jun-2012 Navdeep Parhar <np@FreeBSD.org>

- Updated TOE support in the kernel.

- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs.
These are available as t3_tom and t4_tom modules that augment cxgb(4)
and cxgbe(4) respectively. The cxgb/cxgbe drivers continue to work as
usual with or without these extra features.

- iWARP driver for Terminator 3 ASIC (kernel verbs). T4 iWARP in the
works and will follow soon.

Build-tested with make universe.

30s overview
============
What interfaces support TCP offload? Look for TOE4 and/or TOE6 in the
capabilities of an interface:
# ifconfig -m | grep TOE

Enable/disable TCP offload on an interface (just like any other ifnet
capability):
# ifconfig cxgbe0 toe
# ifconfig cxgbe0 -toe

Which connections are offloaded? Look for toe4 and/or toe6 in the
output of netstat and sockstat:
# netstat -np tcp | grep toe
# sockstat -46c | grep toe

Reviewed by: bz, gnn
Sponsored by: Chelsio communications.
MFC after: ~3 months (after 9.1, and after ensuring MFC is feasible)


# 77d396fd 04-Jun-2012 Maksim Yevmenkin <emax@FreeBSD.org>

Plug more refcount leaks and possible NULL deref for interface
address list.

Submitted by: scottl@
MFC after: 3 days


# 356ab07e 28-May-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

It turns out that too many drivers are not only parsing the L2/3/4
headers for TSO but also for generic checksum offloading. Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.

To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.

Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.

This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.

Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.

Individual driver updates will have to follow, as will SCTP.

Reported by: gallatin, dim, ..
Reviewed by: gallatin (glanced at?)
MFC after: 3 days
X-MFC with: r235961,235959,235958


# 45747ba5 24-May-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

MFp4 bz_ipv6_fast:

Add code to handle pre-checked TCP checksums as indicated by mbuf
flags to save the entire computation for validation if not needed.

In the IPv6 TCP output path only compute the pseudo-header checksum,
set the checksum offset in the mbuf field along the appropriate flag
as done in IPv4.

In tcp_respond() just initialize the IPv6 payload length to 0 as
ip6_output() will properly set it.

Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems

Reviewed by: gnn (as part of the whole)
MFC After: 3 days


# 3a9391def 24-May-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

MFp4 bz_ipv6_fast:

Factor out the tcp_hc_getmtu() call. As the comments say it
applies to both v4 and v6, so only write it once making it easier
to read the protocol family specifc code.

Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems

Reviewed by: gnn (as part of the whole)
MFC After: 3 days


# ef341ee1 16-Apr-2012 Gleb Smirnoff <glebius@FreeBSD.org>

When we receive an ICMP unreach need fragmentation datagram, we take
proposed MTU value from it and update the TCP host cache. Then
tcp_mss_update() is called on the corresponding tcpcb. It finds the
just allocated entry in the TCP host cache and updates MSS on the
tcpcb. And then we do a fast retransmit of what we have in the tcp
send buffer.

This sequence gets broken if the TCP host cache is exausted. In this
case allocation fails, and later called tcp_mss_update() finds nothing
in cache. The fast retransmit is done with not reduced MSS and is
immidiately replied by remote host with new ICMP datagrams and the
cycle repeats. This ping-pong can go up to wirespeed.

To fix this:
- tcp_mss_update() gets new parameter - mtuoffer, that is like
offer, but needs to have min_protoh subtracted.
- tcp_mtudisc() as notification method renamed to tcp_mtudisc_notify().
- tcp_mtudisc() now accepts not a useless error argument, but proposed
MTU value, that is passed to tcp_mss_update() as mtuoffer.

Reported by: az
Reported by: Andrey Zonov <andrey zonov.org>
Reviewed by: andre (previous version of patch)


# d8951c8a 15-Feb-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

Fix PAWS (Protect Against Wrapped Sequence numbers) in cases when
hz >> 1000 and thus getting outside the timestamp clock frequenceny of
1ms < x < 1s per tick as mandated by RFC1323, leading to connection
resets on idle connections.

Always use a granularity of 1ms using getmicrouptime() making all but
relevant callouts independent of hz.

Use getmicrouptime(), not getmicrotime() as the latter may make a jump
possibly breaking TCP nfsroot mounts having our timestamps move forward
for more than 24.8 days in a second without having been idle for that
long.

PR: kern/61404
Reviewed by: jhb, mav, rrs
Discussed with: silby, lstewart
Sponsored by: Sandvine Incorporated (originally in 2011)
MFC after: 6 weeks


# 9077f387 05-Feb-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Add new socket options: TCP_KEEPINIT, TCP_KEEPIDLE, TCP_KEEPINTVL and
TCP_KEEPCNT, that allow to control initial timeout, idle time, idle
re-send interval and idle send count on a per-socket basis.

Reviewed by: andre, bz, lstewart


# 1e96ae81 05-Jan-2012 John Baldwin <jhb@FreeBSD.org>

Remove the assertion from tcp_input() that rcv_nxt is always greater
than or equal to rcv_adv and fix tcp_twstart() to handle this case by
assuming the last window was zero rather than a negative value.

The code in tcp_input() already safely handled this case. It can happen
due to delayed ACKs along with a remote sender that sends data beyond
the window we previously advertised. If we have room in our socket buffer
for the extra data beyond the advertised window, we will accept it.
However, if the ACK for that segment is delayed, then we will not
effectively fixup rcv_adv to account for that extra data until the
next segment arrives and forces out an ACK. When that next segment
arrives, rcv_nxt will be beyond rcv_adv.

Tested by: pjd
MFC after: 1 week


# 6472ac3d 07-Nov-2011 Ed Schouten <ed@FreeBSD.org>

Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.

The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.


# ddd0c4a9 02-Nov-2011 Sergey Kandaurov <pluknet@FreeBSD.org>

Restore sysctl names for tcp_sendspace/tcp_recvspace.

They seem to be changed unintentionally in r226437, and there were no
any mentions of renaming in commit log message.

Reported by: Anton Yuzhaninov <citrin citrin ru>


# fba0cea1 16-Oct-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Add syntactic sugar missed in r226437 and then not added either when moving
things around in r226448 but desperately needed to always make things
compile successfully.

MFC after: 1 week


# 873789cb 16-Oct-2011 Andre Oppermann <andre@FreeBSD.org>

Move the tcp_sendspace and tcp_recvspace sysctl's from
the middle of tcp_usrreq.c to the top of tcp_output.c
and tcp_input.c respectively next to the socket buffer
autosizing controls.

MFC after: 1 week


# 9ec4a4cc 16-Oct-2011 Andre Oppermann <andre@FreeBSD.org>

Remove the ss_fltsz and ss_fltsz_local sysctl's which have
long been superseded by the RFC3390 initial CWND sizing.

Also remove the remnants of TCP_METRICS_CWND which used the
TCP hostcache to set the initial CWND in a non-RFC compliant
way.

MFC after: 1 week


# e233e2ac 16-Oct-2011 Andre Oppermann <andre@FreeBSD.org>

VNET virtualize tcp_sendspace/tcp_recvspace and change the
type to INT. A long is not necessary as the TCP window is
limited to 2**30. A larger initial window isn't useful.

MFC after: 1 week


# 4af309c8 06-Oct-2011 Attilio Rao <attilio@FreeBSD.org>

For the INP_TIMEWAIT case, there is no valid tcpcb object tied to the
inpcb object.
Skip the TCP_SIGNATURE check in that case as it is consistent with the
output path (no TCP_SIGNATURE for outcoming packets in TIMEWAIT state)
and also because for TIMEWAIT state the verify may be less effective.

Sponsored by: Sandvine Incorporated
Reported by: rwatson
No objections by: rwatson
MFC after: 3 days


# b233773b 25-Aug-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Increase the defaults for the maximum socket buffer limit,
and the maximum TCP send and receive buffer limits from 256kB
to 2MB.

For sb_max_adj we need to add the cast as already used in the sysctl
handler to not overflow the type doing the maths.

Note that this is just the defaults. They will allow more memory
to be consumed per socket/connection if needed but not change the
default "idle" memory consumption. All values are still tunable
by sysctls.

Suggested by: gnn
Discussed on: arch (Mar and Aug 2011)
MFC after: 3 weeks
Approved by: re (kib)


# 6f697424 20-Aug-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Fix compilation in case of defined(INET) && defined(IPFIREWALL_FORWARD)
but no INET6.

Reported by: avg
Tested by: avg
MFC after: 4 weeks
X-MFC with: r225044
Approved by: re (kib)


# 8a006adb 20-Aug-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Add support for IPv6 to ipfw fwd:
Distinguish IPv4 and IPv6 addresses and optional port numbers in
user space to set the option for the correct protocol family.
Add support in the kernel for carrying the new IPv6 destination
address and port.
Add support to TCP and UDP for IPv6 and fix UDP IPv4 to not change
the address in the IP header.
Add support for IPv6 forwarding to a non-local destination.
Add a regession test uitilizing VIMAGE to check all 20 possible
combinations I could think of.

Obtained from: David Dolson at Sandvine Incorporated
(original version for ipfw fwd IPv6 support)
Sponsored by: Sandvine Incorporated
PR: bin/117214
MFC after: 4 weeks
Approved by: re (kib)


# d3c1f003 04-Jun-2011 Robert Watson <rwatson@FreeBSD.org>

Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).

Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.

(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)

Reviewed by: bz
Sponsored by: Juniper Networks, Inc.


# fa046d87 30-May-2011 Robert Watson <rwatson@FreeBSD.org>

Decompose the current single inpcbinfo lock into two locks:

- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).

- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.

Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.

A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:

INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb

Callers must pass exactly one of these flags (for the time being).

Some notes:

- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).

This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.

Reviewed by: bz
Sponsored by: Juniper Networks, Inc.


# 5891ebd6 14-May-2011 John Baldwin <jhb@FreeBSD.org>

Oops, fix order of sequence numbers in KASSERT()'s to catch negative
receive windows to match the labels in the panic message.

Submitted by: trociny


# f701e30d 02-May-2011 John Baldwin <jhb@FreeBSD.org>

Handle a rare edge case with nearly full TCP receive buffers. If a TCP
buffer fills up causing the remote sender to enter into persist mode, but
there is still room available in the receive buffer when a window probe
arrives (either due to window scaling, or due to the local application
very slowing draining data from the receive buffer), then the single byte
of data in the window probe is accepted. However, this can cause rcv_nxt
to be greater than rcv_adv. This condition will only last until the next
ACK packet is pushed out via tcp_output(), and since the previous ACK
advertised a zero window, the ACK should be pushed out while the TCP
pcb is write-locked.

During the window while rcv_nxt is greather than rcv_adv, a few places
would compute the remaining receive window via rcv_adv - rcv_nxt.
However, this value was then (uint32_t)-1. On a 64 bit machine this
could expand to a positive 2^32 - 1 when cast to a long. In particular,
when calculating the receive window in tcp_output(), the result would be
that the receive window was computed as 2^32 - 1 resulting in advertising
a far larger window to the remote peer than actually existed.

Fix various places that compute the remaining receive window to either
assert that it is not negative (i.e. rcv_nxt <= rcv_adv), or treat the
window as full if rcv_nxt is greather than rcv_adv.

Reviewed by: bz
MFC after: 1 month


# 29bd2010 30-Apr-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Fix a mismerge from p4 in that in_localaddr() is not available without INET.

Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 4 days


# b287c6c7 30-Apr-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Make the TCP code compile without INET. Sort #includes and add #ifdef INETs.
Add some comments at #endifs given more nestedness. To make the compiler
happy, some default initializations were added in accordance with the style
on the files.

Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 4 days


# 672dc4ae 29-Apr-2011 John Baldwin <jhb@FreeBSD.org>

TCP reuses t_rxtshift to determine the backoff timer used for both the
persist state and the retransmit timer. However, the code that implements
"bad retransmit recovery" only checks t_rxtshift to see if an ACK has been
received in during the first retransmit timeout window. As a result, if
ticks has wrapped over to a negative value and a socket is in the persist
state, it can incorrectly treat an ACK from the remote peer as a
"bad retransmit recovery" and restore saved values such as snd_ssthresh and
snd_cwnd. However, if the socket has never had a retransmit timeout, then
these saved values will be zero, so snd_ssthresh and snd_cwnd will be set
to 0.

If the socket is in fast recovery (this can be caused by excessive
duplicate ACKs such as those fixed by 220794), then each ACK that arrives
triggers either NewReno or SACK partial ACK handling which clamps snd_cwnd
to be no larger than snd_ssthresh. In effect, the socket's send window
is permamently stuck at 0 even though the remote peer is advertising a
much larger window and pending data is only sent via TCP window probes
(so one byte every few seconds).

Fix this by adding a new TCP pcb flag (TF_PREVVALID) that indicates that
the various snd_*_prev fields in the pcb are valid and only perform
"bad retransmit recovery" if this flag is set in the pcb. The flag is set
on the first retransmit timeout that occurs and is cleared on subsequent
retransmit timeouts or when entering the persist state.

Reviewed by: bz
MFC after: 2 weeks


# 2903309a 25-Apr-2011 Attilio Rao <attilio@FreeBSD.org>

Add the possibility to verify MD5 hash of incoming TCP packets.
As long as this is a costy function, even when compiled in (along with
the option TCP_SIGNATURE), it can be disabled via the
net.inet.tcp.signature_verify_input sysctl.

Sponsored by: Sandvine Incorporated
Reviewed by: emaste, bz
MFC after: 2 weeks


# 891b8ed4 12-Apr-2011 Lawrence Stewart <lstewart@FreeBSD.org>

Use the full and proper company name for Swinburne University of Technology
throughout the source tree.

Requested by: Grenville Armitage, Director of CAIA at Swinburne University of
Technology
MFC after: 3 days


# 766282cb 29-Mar-2011 John Baldwin <jhb@FreeBSD.org>

Clamp the initial advertised receive window when responding to a SYN/ACK
to the maximum allowed window. Growing the window too large would cause
an underflow in the calculations in tcp_output() to decide if a window
update should be sent which would prevent the persist timer from being
started if data was pending and the other end of the connection advertised
an initial window size of 0.

PR: kern/154006
Submitted by: Stefan `Sec` Zehl sec 42 org
Reviewed by: bz
MFC after: 1 week


# d64a46ea 09-Jan-2011 Lawrence Stewart <lstewart@FreeBSD.org>

Reset the last_sack_ack SACK hint for TCP input processing to ensure that the
hint is 0 when no SACK data is received to update the hint with. This was
accidentally omitted from r216753.

Sponsored by: FreeBSD Foundation
MFC after: 10 weeks
X-MFC with: 216753


# 79e955ed 07-Jan-2011 John Baldwin <jhb@FreeBSD.org>

Trim extra spaces before tabs.


# 39bc9de5 27-Dec-2010 Lawrence Stewart <lstewart@FreeBSD.org>

- Add some helper hook points to the TCP stack. The hooks allow Khelp modules to
access inbound/outbound events and associated data for established TCP
connections. The hooks only run if at least one hook function is registered
for the hook point, ensuring the impact on the stack is effectively nil when
no TCP Khelp modules are loaded. struct tcp_hhook_data is passed as contextual
data to any registered Khelp module hook functions.

- Add an OSD (Object Specific Data) pointer to struct tcpcb to allow Khelp
modules to associate per-connection data with the TCP control block.

- Bump __FreeBSD_version and add a note to UPDATING regarding to ABI changes
introduced by this commit and r216753.

In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: FreeBSD Foundation
Reviewed by: bz, others along the way
MFC after: 3 months


# 6157935f 01-Dec-2010 Lawrence Stewart <lstewart@FreeBSD.org>

Set ssthresh appropriately on RTO. This change was accidentally not ported from
the pre modular CC stack.

Sponsored by: FreeBSD Foundation
Submitted by: David Hayes <dahayes at swin edu au>
MFC after: 9 weeks
X-MFC with: r215166


# dbc42409 11-Nov-2010 Lawrence Stewart <lstewart@FreeBSD.org>

This commit marks the first formal contribution of the "Five New TCP Congestion
Control Algorithms for FreeBSD" FreeBSD Foundation funded project. More details
about the project are available at: http://caia.swin.edu.au/freebsd/5cc/

- Add a KPI and supporting infrastructure to allow modular congestion control
algorithms to be used in the net stack. Algorithms can maintain per-connection
state if required, and connections maintain their own algorithm pointer, which
allows different connections to concurrently use different algorithms. The
TCP_CONGESTION socket option can be used with getsockopt()/setsockopt() to
programmatically query or change the congestion control algorithm respectively
from within an application at runtime.

- Integrate the framework with the TCP stack in as least intrusive a manner as
possible. Care was also taken to develop the framework in a way that should
allow integration with other congestion aware transport protocols (e.g. SCTP)
in the future. The hope is that we will one day be able to share a single set
of congestion control algorithm modules between all congestion aware transport
protocols.

- Introduce a new congestion recovery (TF_CONGRECOVERY) state into the TCP stack
and use it to decouple the meaning of recovery from a congestion event and
recovery from packet loss (TF_FASTRECOVERY) a la RFC2581. ECN and delay based
congestion control protocols don't generally need to recover from packet loss
and need a different way to note a congestion recovery episode within the
stack.

- Remove the net.inet.tcp.newreno sysctl, which simplifies some portions of code
and ensures the stack always uses the appropriate mechanisms for recovering
from packet loss during a congestion recovery episode.

- Extract the NewReno congestion control algorithm from the TCP stack and
massage it into module form. NewReno is always built into the kernel and will
remain the default algorithm for the forseeable future. Implementations of
additional different algorithms will become available in the near future.

- Bump __FreeBSD_version to 900025 and note in UPDATING that rebuilding code
that relies on the size of "struct tcpcb" is required.

Many thanks go to the Cisco University Research Program Fund at Community
Foundation Silicon Valley and the FreeBSD Foundation. Their support of our work
at the Centre for Advanced Internet Architectures, Swinburne University of
Technology is greatly appreciated.

In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: Cisco URP, FreeBSD Foundation
Reviewed by: rpaulo
Tested by: David Hayes (and many others over the years)
MFC after: 3 months


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 1c18314d 16-Sep-2010 Andre Oppermann <andre@FreeBSD.org>

Remove the TCP inflight bandwidth limiter as announced in r211315
to give way for the pluggable congestion control framework. It is
the task of the congestion control algorithm to set the congestion
window and amount of inflight data without external interference.

In 'struct tcpcb' the variables previously used by the inflight
limiter are renamed to spares to keep the ABI intact and to have
some more space for future extensions.

In 'struct tcp_info' the variable 'tcpi_snd_bwnd' is not removed to
preserve the ABI. It is always set to 0.

In siftr.c in 'struct pkt_node' the variable 'snd_bwnd' is not removed
to preserve the ABI. It is always set to 0.

These unused variable in the various structures may be reused in the
future or garbage collected before the next release or at some other
point when an ABI change happens anyway for other reasons.

No MFC is planned. The inflight bandwidth limiter stays disabled by
default in the other branches but remains available.


# 8502ec25 26-Aug-2010 Andre Oppermann <andre@FreeBSD.org>

Use timestamp modulo comparison macro for automatic receive buffer
scaling to correctly handle wrapping of ticks value.

MFC after: 1 week


# b7d747ec 18-Aug-2010 Andre Oppermann <andre@FreeBSD.org>

Untangle the net.inet.tcp.log_in_vain and net.inet.tcp.log_debug
sysctl's and remove any side effects.

Both sysctl's share the same backend infrastructure and due to the
way it was implemented enabling net.inet.tcp.log_in_vain would also
cause log_debug output to be generated. This was surprising and
eventually annoying to the user.

The log output backend is kept the same but a little shim is inserted
to properly separate log_in_vain and log_debug and to remove any side
effects.

PR: kern/137317
MFC after: 1 week


# 4fc9f6b8 01-Jun-2010 Robert Watson <rwatson@FreeBSD.org>

Merge r204806 from head to stable/8:

Wrap use of rw_try_upgrade() on pcbinfo with macro INP_INFO_TRY_UPGRADE()
to match other pcbinfo locking macros.

Approved by: re (bz)


# 480d7c6c 06-May-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

MFC r207369:
MFP4: @176978-176982, 176984, 176990-176994, 177441

"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by: jhb
Discussed with: rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH


# 82cea7e6 29-Apr-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

MFP4: @176978-176982, 176984, 176990-176994, 177441

"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by: jhb
Discussed with: rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
MFC after: 6 days


# 55f05ae7 17-Apr-2010 Rui Paulo <rpaulo@FreeBSD.org>

MFC r206456:
Honor the CE bit even when the CWR bit is set.

PR: 145600
Submitted by: Richard Scheffenegger <rs at netapp.com>


# 9c251892 09-Apr-2010 Rui Paulo <rpaulo@FreeBSD.org>

Honor the CE bit even when the CWR bit is set.

PR: 145600
Submitted by: Richard Scheffenegger <rs at netapp.com>
MFC after: 1 week


# 66f80e90 06-Mar-2010 Robert Watson <rwatson@FreeBSD.org>

Wrap use of rw_try_upgrade() on pcbinfo with macro INP_INFO_TRY_UPGRADE()
to match other pcbinfo locking macros.

MFC after: 1 week


# 3e5cbaa4 09-Oct-2009 Robert Watson <rwatson@FreeBSD.org>

Merge r197814 from head to stable/8:

Remove tcp_input lock statistics; these are intended for debugging only
and are not intended to ship in 8.0 as they dirty additional cache
lines in a performance-critical per-packet path.

Approved by: re (kib, bz)


# f41dd6dc 08-Oct-2009 Robert Watson <rwatson@FreeBSD.org>

Merge r197795 from head to stable/8:

In tcp_input(), we acquire a global write lock at first only if a
segment is likely to trigger a TCP state change (i.e., FIN/RST/SYN).
If we later have to upgrade the lock, we acquire an inpcb reference
and drop both global/inpcb locks before reacquiring in-order. In
that gap, the connection may transition into TIMEWAIT, so we need
to loop back and reevaluate the inpcb after relocking.

Reported by: Kamigishi Rei <spambox at haruhiism.net>
Reviewed by: bz

Approved by: re (kib)


# f681a5fd 06-Oct-2009 Robert Watson <rwatson@FreeBSD.org>

Remove tcp_input lock statistics; these are intended for debugging only
and are not intended to ship in 8.0 as they dirty additional cache
lines in a performance-critical per-packet path.

MFC after: 3 days


# 883e9bc4 05-Oct-2009 Robert Watson <rwatson@FreeBSD.org>

In tcp_input(), we acquire a global write lock at first only if a
segment is likely to trigger a TCP state change (i.e., FIN/RST/SYN).
If we later have to upgrade the lock, we acquire an inpcb reference
and drop both global/inpcb locks before reacquiring in-order. In
that gap, the connection may transition into TIMEWAIT, so we need
to loop back and reevaluate the inpcb after relocking.

MFC after: 3 days
Reported by: Kamigishi Rei <spambox at haruhiism.net>
Reviewed by: bz


# 315e3e38 02-Aug-2009 Robert Watson <rwatson@FreeBSD.org>

Many network stack subsystems use a single global data structure to hold
all pertinent statatistics for the subsystem. These structures are
sometimes "borrowed" by kernel modules that require a place to store
statistics for similar events.

Add KPI accessor functions for statistics structures referenced by kernel
modules so that they no longer encode certain specifics of how the data
structures are named and stored. This change is intended to make it
easier to move to per-CPU network stats following 8.0-RELEASE.

The following modules are affected by this change:

if_bridge
if_cxgb
if_gif
ip_mroute
ipdivert
pf

In practice, most of these statistics consumers should, in fact, maintain
their own statistics data structures rather than borrowing structures
from the base network stack. However, that change is too agressive for
this point in the release cycle.

Reviewed by: bz
Approved by: re (kib)


# 530c0060 01-Aug-2009 Robert Watson <rwatson@FreeBSD.org>

Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by: bz
Approved by: re (vimage blanket)


# 7973fba3 28-Jul-2009 Julian Elischer <julian@FreeBSD.org>

Somewhere along the line accept sockets stopped honoring the
FIB selected for them. Fix this.

Reviewed by: ambrisko
Approved by: re (kib)
MFC after: 3 days


# eddfbb76 14-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)


# 8c0fec80 23-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Modify most routines returning 'struct ifaddr *' to return references
rather than pointers, requiring callers to properly dispose of those
references. The following routines now return references:

ifaddr_byindex
ifa_ifwithaddr
ifa_ifwithbroadaddr
ifa_ifwithdstaddr
ifa_ifwithnet
ifaof_ifpforaddr
ifa_ifwithroute
ifa_ifwithroute_fib
rt_getifa
rt_getifa_fib
IFP_TO_IA
ip_rtaddr
in6_ifawithifp
in6ifa_ifpforlinklocal
in6ifa_ifpwithaddr
in6_ifadd
carp_iamatch6
ip6_getdstifaddr

Remove unused macro which didn't have required referencing:

IFP_TO_IA6

This closes many small races in which changes to interface
or address lists while an ifaddr was in use could lead to use of freed
memory (etc). In a few cases, add missing if_addr_list locking
required to safely acquire references.

Because of a lack of deep copying support, we accept a race in which
an in6_ifaddr pointed to by mbuf tags and extracted with
ip6_getdstifaddr() doesn't hold a reference while in transmit. Once
we have mbuf tag deep copy support, this can be fixed.

Reviewed by: bz
Obtained from: Apple, Inc. (portions)
MFC after: 6 weeks (portions)


# 6dfb8b31 16-Jun-2009 John Baldwin <jhb@FreeBSD.org>

Fix edge cases with ticks wrapping from INT_MAX to INT_MIN in the handling
of the per-tcpcb t_badtrxtwin.

Submitted by: bde


# 1a0e7cfc 11-Jun-2009 John Baldwin <jhb@FreeBSD.org>

Trim extra ()'s.

Submitted by: bde


# 0e8cc7e7 10-Jun-2009 John Baldwin <jhb@FreeBSD.org>

Change a few members of tcpcb that store cached copies of ticks to be ints
instead of unsigned longs. This fixes a few overflow edge cases on 64-bit
platforms. Specifically, if an idle connection receives a packet shortly
before 2^31 clock ticks of uptime (about 25 days with hz=1000) and the keep
alive timer fires after 2^31 clock ticks, the keep alive timer will think
that the connection has been idle for a very long time and will immediately
drop the connection instead of sending a keep alive probe.

Reviewed by: silby, gnn, lstewart
MFC after: 1 week


# bcf11e8d 05-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# f93bfb23 02-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Add internal 'mac_policy_count' counter to the MAC Framework, which is a
count of the number of registered policies.

Rather than unconditionally locking sockets before passing them into MAC,
lock them in the MAC entry points only if mac_policy_count is non-zero.

This avoids locking overhead for a number of socket system calls when no
policies are registered, eliminating measurable overhead for the MAC
Framework for the socket subsystem when there are no active policies.

Possibly socket locks should be acquired by policies if they are required
for socket labels, which would further avoid locking overhead when there
are policies but they don't require labeling of sockets, or possibly
don't even implement socket controls.

Obtained from: TrustedBSD Project


# 81ad7eb0 27-May-2009 Zachary Loafman <zml@FreeBSD.org>

Correct handling of SYN packets that are to the left of the current window of an ESTABLISHED connection.

Reviewed by: net@, gnn
Approved by: dfr (mentor)


# 78b50714 11-Apr-2009 Robert Watson <rwatson@FreeBSD.org>

Update stats in struct tcpstat using two new macros, TCPSTAT_ADD() and
TCPSTAT_INC(), rather than directly manipulating the fields across the
kernel. This will make it easier to change the implementation of
these statistics, such as using per-CPU versions of the data structures.

MFC after: 3 days


# 80cb9f21 10-Apr-2009 Kip Macy <kmacy@FreeBSD.org>

Import "flowid" support for serializing flows across transmit queues

Reviewed by: rwatson and jeli


# ad71fe3c 15-Mar-2009 Robert Watson <rwatson@FreeBSD.org>

Correct a number of evolved problems with inp_vflag and inp_flags:
certain flags that should have been in inp_flags ended up in inp_vflag,
meaning that they were inconsistently locked, and in one case,
interpreted. Move the following flags from inp_vflag to gaps in the
inp_flags space (and clean up the inp_flags constants to make gaps
more obvious to future takers):

INP_TIMEWAIT
INP_SOCKREF
INP_ONESBCAST
INP_DROPPED

Some aspects of this change have no effect on kernel ABI at all, as these
are UDP/TCP/IP-internal uses; however, netstat and sockstat detect
INP_TIMEWAIT when listing TCP sockets, so any MFC will need to take this
into account.

MFC after: 1 week (or after dependencies are MFC'd)
Reviewed by: bz


# 24cb0f22 14-Jan-2009 Lawrence Stewart <lstewart@FreeBSD.org>

Add TCP Appropriate Byte Counting (RFC 3465) support to kernel.

The new behaviour is on by default, and can be disabled by setting the
net.inet.tcp.rfc3465 sysctl to 0 to obtain previous behaviour.

The patch changes struct tcpcb in sys/netinet/tcp_var.h which breaks
the ABI. Bump __FreeBSD_version to 800061 accordingly. User space tools
that rely on the size of struct tcpcb (e.g. sockstat) need to be recompiled.

Reviewed by: rpaulo, gnn
Approved by: gnn, kmacy (mentors)
Sponsored by: FreeBSD Foundation


# dcdb4371 16-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Use inc_flags instead of the inc_isipv6 alias which so far
had been the only flag with random usage patterns.
Switch inc_flags to be used as a real bit field by using
INC_ISIPV6 with bitops to check for the 'isipv6' condition.

While here fix a place or two where in case of v4 inc_flags
were not properly initialized before.[1]

Found by: rwatson during review [1]
Discussed with: rwatson
Reviewed by: rwatson
MFC after: 4 weeks


# d15fb965 09-Dec-2008 Robert Watson <rwatson@FreeBSD.org>

Enhance one comment relating to recent TCP locking changes, and fix a
typo in another.

MFC after: 6 weeks


# 252ca428 08-Dec-2008 Robert Watson <rwatson@FreeBSD.org>

Move from solely write-locking the global tcbinfo in tcp_input()
to read-locking in the TCP input path, allowing greater TCP
input parallelism where multiple ithreads or ithread and netisr
are able to run in parallel. Previously, most TCP input paths
held a write lock on the global tcbinfo lock, effectively
serializing TCP input.

Before looking up the connection, acquire a write lock if a
potentially state-changing flag is set on the TCP segment header
(FIN, RST, SYN), and otherwise a read lock. We may later have
to upgrade to a write lock in certain cases (ACKs received by the
syncache or during TIMEWAIT) in order to support global state
transitions, but this is never required for steady-state packets.

Upgrading from a write lock to a read lock must be done as a
trylock operation to avoid deadlocks, and actually violates the
lock order as the tcbinfo lock preceeds the inpcb lock held at
the time of upgrade. If the trylock fails, we bump the refcount
on the inpcb, drop both locks, and re-acquire in-order. If
another thread has freed the connection while the locks are
dropped, we free the inpcb and repeat the lookup (this should
hardly ever or never happen in practice).

For now, maintain a number of new counters measuring how many
times various cases execute, and in particular whether various
optimistic assumptions about when read locks can be used, whether
upgrades are done using the fast path, and whether connections
close in practice in the above-described race, actually occur.

MFC after: 6 weeks
Discussed with: kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy


# 4b79449e 02-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation


# 97021c24 26-Nov-2008 Marko Zec <zec@FreeBSD.org>

Merge more of currently non-functional (i.e. resolving to
whitespace) macros from p4/vimage branch.

Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.

De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.

Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 44e33a07 19-Nov-2008 Marko Zec <zec@FreeBSD.org>

Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions. As a rule,
initialization at instatiation for such variables should never be
introduced again from now on. Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact. In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 8e5c87f4 06-Nov-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Fix typo and while here another one.

Reviewed by: keramida
Reported by: keramida
MFC after: 2 months (with r184720)


# 91d6cfa6 06-Nov-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Fix a bug introduced with r182851 splitting tcp_mss() into
tcp_mss() and tcp_mss_update() so that tcp_mtudisc() could
re-use the same code.

Move the TSO logic back to tcp_mss() and out of tcp_mss_update().
We tried to avoid that initially but if were are called from
tcp_output() with EMSGSIZE, we cleared the TSO flag on the tcpcb
there, called into tcp_mtudisc() and tcp_mss_update() which
then would reenable TSO on the tcpcb based on TSO capabilities
of the interface as learnt in tcp_maxmtu/6().
So if TSO was enabled on the (possibly new) outgoing interface
it was turned back on, which lead to an endless loop between
tcp_output() and tcp_mtudisc() until we overflew the stack.

Reported by: kmacy
MFC after: 2 months (along with r182851)


# 6f01cac6 05-Nov-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Fix a bug introduced with r182851 splitting tcp_mss() into
tcp_mss() and tcp_mss_update() so that tcp_mtudisc() could
re-use the same code.

In case we return early and got a metricptr to pass the hostcache
info back to the caller we need to initialize the data to a defined
state (zero it) as tcp_hc_get() would do if there was no hit.
Without that the caller would check on random stack garbage which
could lead to undefined results.

This only affected tcp_mss() if there was no routing entry for the peer,
tcp_mtudisc() was not affected.

MFC after: 2 months (along with r182851)


# dd8ac7f9 26-Oct-2008 Robert Watson <rwatson@FreeBSD.org>

In both dropwithreset paths in tcp_input.c, drop the tcbinfo lock
sooner to decomplicate locking and eliminate the need for a rather
chatty comment about why we have to handle the global lock in a
special way for the benefit of ipfw and pf cred rules.

MFC after: 3 days


# 4c95fd23 26-Oct-2008 Robert Watson <rwatson@FreeBSD.org>

Remove endearing but syntactically unnecessary "return;" statements
directly before the final closeing brackets of some TCP functions.

MFC after: 3 days


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 6c8286e4 07-Oct-2008 Robert Watson <rwatson@FreeBSD.org>

Don't pass curthread to sbreserve_locked() in tcp_do_segment(), as the
netisr or ithread's socket buffer size limit is not the right limit to
use. Instead, pass NULL as the other two calls to sbreserve_locked()
in the TCP input path (tcp_mss()) do.

In practice, this is a no-op, as ithreads and the netisr run without a
process limit on socket buffer use, and a NULL thread pointer leads to
not using the process's limit, if any. However, if tcp_input() is
called in other contexts that do have limits, this may prevent the
incorrect limit from being used.

MFC after: 3 days


# 8b615593 02-Oct-2008 Marko Zec <zec@FreeBSD.org>

Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 014ea782 25-Sep-2008 Robert Watson <rwatson@FreeBSD.org>

As a follow-on to r183323, correct another case where ip_output() was
called without an inpcb pointer despite holding the tcbinfo global
lock, which lead to a deadlock or panic when ipfw tried to further
acquire it recursively.

Reported by: Stefan Ehmann <shoesoft at gmx dot net>
MFC after: 3 days


# a0ca0871 24-Sep-2008 Robert Watson <rwatson@FreeBSD.org>

When dropping a packet and issuing a reset during TCP segment handling,
unconditionally drop the tcbinfo lock (after all, we assert it lines
before), but call tcp_dropwithreset() under both inpcb and inpcbinfo
locks only if we pass in an tcpcb. Otherwise, if the pointer is NULL,
firewall code may later recurse the global tcbinfo lock trying to look
up an inpcb.

This is an instance where a layering violation leads not only
potentially to code reentrace and recursion, but also to lock
recursion, and was revealed by the conversion to rwlocks because
acquiring a read lock on an rwlock already held with a write lock is
forbidden. When these locks were mutexes, they simply recursed.

Reported by: Stefan Ehmann <shoesoft at gmx dot net>
MFC after: 3 days


# c10eb6d1 09-Sep-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Work around an integer division resulting in 0 and thus the
congestion window not being incremented, if cwnd > maxseg^2.
As suggested in RFC2581 increment the cwnd by 1 in this case.

See http://caia.swin.edu.au/reports/080829A/CAIA-TR-080829A.pdf
for more details.

Submitted by: Alana Huebner, Lawrence Stewart,
Grenville Armitage (caia.swin.edu.au)
Reviewed by: dwmalone, gnn, rpaulo
MFC After: 3 days


# 3cee92e0 07-Sep-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Split tcp_mss() in tcp_mss() and tcp_mss_update() where the former
calls the latter.

Merge tcp_mss_update() with code from tcp_mtudisc() basically
doing the same thing.

This gives us one central place where we calcuate and check mss values
to update t_maxopd (maximum mss + options length) instead of two slightly
different but almost equal implementations to maintain.

PR: kern/118455
Reviewed by: silby (back in March)
MFC after: 2 months


# ac957cd2 19-Aug-2008 Julian Elischer <julian@FreeBSD.org>

A bunch of formatting fixes brough to light by, or created by the Vimage commit
a few days ago.


# 603724d3 17-Aug-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch


# f2512ba1 31-Jul-2008 Rui Paulo <rpaulo@FreeBSD.org>

MFp4 (//depot/projects/tcpecn/):

TCP ECN support. Merge of my GSoC 2006 work for NetBSD.
TCP ECN is defined in RFC 3168.

Partly reviewed by: dwmalone, silby
Obtained from: NetBSD


# 8b07e49a 09-May-2008 Julian Elischer <julian@FreeBSD.org>

Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.

Constraints:
------------

I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.

One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".

One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.

This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.

To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.

The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.

The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.

In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.

One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).

You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.

This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.

Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.

Packets fall into one of a number of classes.

1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..

setfib -3 ping target.example.com # will use fib 3 for ping.

It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.

2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)

3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).

4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.

5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.

6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.

Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)

In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.

In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.

Early testing experience:
-------------------------

Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.

For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.

Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.

ipfw has grown 2 new keywords:

setfib N ip from anay to any
count ip from any to any fib N

In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.

SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.

Where to next:
--------------------

After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.

Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.

My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.

When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.

Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.

This work was sponsored by Ironport Systems/Cisco

Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco


# 8501a69c 17-Apr-2008 Robert Watson <rwatson@FreeBSD.org>

Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.

This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.

MFC after: 3 months
Tested by: kris (superset of committered patch)


# 7a3244cc 06-Apr-2008 Robert Watson <rwatson@FreeBSD.org>

Add further TCP inpcb locking assertions to some TCP input code paths.

MFC after: 1 month


# c3b02504 02-Mar-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Some "cleanup" of tcp_mss():
- Move the assigment of the socket down before we first need it.
No need to do it at the beginning and then drop out the function
by one of the returns before using it 100 lines further down.
- Use t_maxopd which was assigned the "tcp_mssdflt" for the corrrect
AF already instead of another #ifdef ? : #endif block doing the same.
- Remove an unneeded (duplicate) assignment of mss to t_maxseg just before
we possibly change mss and re-do the assignment without using t_maxseg
in between.

Reviewed by: silby
No objections: net@ (silence)
MFC after: 5 days


# af92e6cf 01-Mar-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Fix indentation (whitespace changes only).

MFC after: 6 days


# 30d239bc 24-Oct-2007 Robert Watson <rwatson@FreeBSD.org>

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# 4b421e2d 07-Oct-2007 Mike Silbersack <silby@FreeBSD.org>

Add FBSDID to all files in netinet so that people can more
easily include file version information in bug reports.

Approved by: re (kensmith)


# e31d8aa3 06-Oct-2007 Mike Silbersack <silby@FreeBSD.org>

Improve the debugging message:

TCP: [X.X.X.X]:X to [X.X.X.X]:X tcpflags 0x18<PUSH,ACK>; tcp_do_segment: FIN_WAIT_2: Received data after socket was closed, sending RST and removing tcpcb

So that it also includes how many bytes of data were received. It now looks
like this:

TCP: [X.X.X.X]:X to [X.X.X.X]:X tcpflags 0x18<PUSH,ACK>; tcp_do_segment: FIN_WAIT_2: Received X bytes of data after socket was closed, sending RST and removing tcpcb

Approved by: re (gnn)


# a2589465 10-Sep-2007 Ken Smith <kensmith@FreeBSD.org>

Make sure that either inp is NULL or we have obtained a lock on it before
jumping to dropunlock to avoid a panic. While here move the calls to
ipsec4_in_reject() and ipsec6_in_reject() so they are after we obtain
the lock on inp.

Original patch to avoid panic: pjd
Review of locking adjustments: gnn, sam
Approved by: re (rwatson)


# 218cbbea 30-Jul-2007 Dag-Erling Smørgrav <des@FreeBSD.org>

Make tcpstates[] static, and make sure TCPSTATES is defined before
<netinet/tcp_fsm.h> is included into any compilation unit that needs
tcpstates[]. Also remove incorrect extern declarations and TCPDEBUG
conditionals. This allows kernels both with and without TCPDEBUG to
build, and unbreaks the tinderbox.

Approved by: re (rwatson)


# 24face54 28-Jul-2007 Matt Jacob <mjacob@FreeBSD.org>

Fix compilation problems- tcpstates is only available if TCPDEBUG
is set.

Approved by: re (in spirit)


# 773673c1 27-Jul-2007 Andre Oppermann <andre@FreeBSD.org>

Provide a sysctl to toggle reporting of TCP debug logging:

sys.net.inet.tcp.log_debug = 1

It defaults to enabled for the moment and is to be turned off for
the next release like other diagnostics from development branches.

It is important to note that sysctl sys.net.inet.tcp.log_in_vain
uses the same logging function as log_debug. Enabling of the former
also causes the latter to engage, but not vice versa.

Use consistent terminology in tcp log messages:

"ignored" means a segment contains invalid flags/information and
is dropped without changing state or issuing a reply.

"rejected" means a segments contains invalid flags/information but
is causing a reply (usually RST) and may cause a state change.

Approved by: re (rwatson)


# 19bc77c5 28-Jul-2007 Andre Oppermann <andre@FreeBSD.org>

o Move all detailed checks for RST in LISTEN state from tcp_input() to
syncache_rst().
o Fix tests for flag combinations of RST and SYN, ACK, FIN. Before
a RST for a connection in syncache did not properly free the entry.
o Add more detailed logging.

Approved by: re (rwatson)


# c325962b 26-Jul-2007 Mike Silbersack <silby@FreeBSD.org>

Export the contents of the syncache to netstat.

Approved by: re (kensmith)
MFC after: 2 weeks


# 564aab1f 25-Jul-2007 Andre Oppermann <andre@FreeBSD.org>

Fix comments in tcp_do_segment().

Approved by: re (kensmith)


# 9fb5d4c0 04-Jul-2007 Peter Wemm <peter@FreeBSD.org>

Fix cast-qualifiers warning when INET6 is not present

Approved by: re (rwatson)


# b2630c29 02-Jul-2007 George V. Neville-Neil <gnn@FreeBSD.org>

Commit the change from FAST_IPSEC to IPSEC. The FAST_IPSEC
option is now deprecated, as well as the KAME IPsec code.
What was FAST_IPSEC is now IPSEC.

Approved by: re
Sponsored by: Secure Computing


# 2cb64cb2 01-Jul-2007 George V. Neville-Neil <gnn@FreeBSD.org>

Commit IPv6 support for FAST_IPSEC to the tree.
This commit includes only the kernel files, the rest of the files
will follow in a second commit.

Reviewed by: bz
Approved by: re
Supported by: Secure Computing


# f194524f 10-Jun-2007 Andre Oppermann <andre@FreeBSD.org>

Fix a case in tcp_do_segment() where tcp_update_sack_list() would
be called with an incorrect segment end value. tcp_reass() may
trim segments when they overlap with already existing ones in the
reassembly queue. Instead of saving the segment end value before
the call to tcp_reass() compute it on the fly based on the effective
segment length afterwards.

This bug was not really problematic as no information got lost and
the eventual SACK information computation was correct nontheless.

MFC after: 1 week


# e8949f74 10-Jun-2007 Andre Oppermann <andre@FreeBSD.org>

Fix style for comments, be more verbose and add some more.


# 5396d0f8 09-Jun-2007 Andre Oppermann <andre@FreeBSD.org>

Remove some bogosity from the SYN_SENT case in tcp_do_segment
and simplify handling of the send/receive window scaling. No
change in effective behavour.

RFC1323 requires the window field in a SYN (i.e., a <SYN> or
<SYN,ACK>) segment itself never be scaled.

Noticed by: yar


# 8d573cc1 28-May-2007 Andre Oppermann <andre@FreeBSD.org>

Make log messages more verbose and simpler to understand for non-experts.
Update comments to be more conscious, verbose and fully reflect reality.


# e885b205 28-May-2007 Andre Oppermann <andre@FreeBSD.org>

Fix indentation of the syncache_expand() section in tcp_input().


# a160e630 28-May-2007 Andre Oppermann <andre@FreeBSD.org>

Refactor and rewrite in parts the SYN handling code on listen sockets
in tcp_input():

o tighten the checks on allowed TCP flags to be RFC793 and
tcp-secure conform
o log check failures to syslog at LOG_DEBUG level
o rearrange the code flow to be easier to follow
o add KASSERTs to validate assumptions of the code flow

Add sysctl net.inet.tcp.syncache.rst_on_sock_fail defaulting to enable
that controls the behavior on socket creation failure for a otherwise
successful 3-way handshake. The socket creation can fail due to global
memory shortage, listen queue limits and file descriptor limits. The
sysctl allows to chose between two options to deal with this. One is
to send a reset to the other endpoint to notify it about the failure
(default). The other one is to ignore and treat the failure as a
transient error and have the other endpoint retransmit for another try.

Reviewed by: rwatson (in general)


# df541e5f 18-May-2007 Andre Oppermann <andre@FreeBSD.org>

Add tcp_log_addrs() function to generate and standardized TCP log line
for use thoughout the tcp subsystem.

It is IPv4 and IPv6 aware creates a line in the following format:

"TCP: [1.2.3.4]:50332 to [1.2.3.4]:80 tcpflags <RST>"

A "\n" is not included at the end. The caller is supposed to add
further information after the standard tcp log header.

The function returns a NUL terminated string which the caller has
to free(s, M_TCPLOG) after use. All memory allocation is done
with M_NOWAIT and the return value may be NULL in memory shortage
situations.

Either struct in_conninfo || (struct tcphdr && (struct ip || struct
ip6_hdr) have to be supplied.

Due to ip[6].h header inclusion limitations and ordering issues the
struct ip and struct ip6_hdr parameters have to be casted and passed
as void * pointers.

tcp_log_addrs(struct in_conninfo *inc, struct tcphdr *th, void *ip4hdr,
void *ip6hdr)

Usage example:

struct ip *ip;
char *tcplog;

if (tcplog = tcp_log_addrs(NULL, th, (void *)ip, NULL)) {
log(LOG_DEBUG, "%s; %s: Connection attempt to closed port\n",
tcplog, __func__);
free(s, M_TCPLOG);
}


# 2104448f 16-May-2007 Andre Oppermann <andre@FreeBSD.org>

Move TIME_WAIT related functions and timer handling from files
other than repo copied tcp_subr.c into tcp_timewait.c#1.284:

tcp_input.c#1.350 tcp_timewait() -> tcp_twcheck()

tcp_timer.c#1.92 tcp_timer_2msl_reset() -> tcp_tw_2msl_reset()
tcp_timer.c#1.92 tcp_timer_2msl_stop() -> tcp_tw_2msl_stop()
tcp_timer.c#1.92 tcp_timer_2msl_tw() -> tcp_tw_2msl_scan()

This is a mechanical move with appropriate renames and making
them static if used only locally.

The tcp_tw_2msl_scan() cleanup function is still run from the
tcp_slowtimo() in tcp_timer.c.


# ec9c7553 13-May-2007 Andre Oppermann <andre@FreeBSD.org>

Complete the (mechanical) move of the TCP reassembly and timewait
functions from their origininal place to their own files.

TCP Reassembly from tcp_input.c -> tcp_reass.c
TCP Timewait from tcp_subr.c -> tcp_timewait.c


# f2565d68 10-May-2007 Robert Watson <rwatson@FreeBSD.org>

Move universally to ANSI C function declarations, with relatively
consistent style(9)-ish layout.


# d30d90dc 09-May-2007 Maxim Konovalov <maxim@FreeBSD.org>

o Fix style(9) bugs introduced in the last commit.

Pointed out by: bde


# 10fe523e 09-May-2007 Maxim Konovalov <maxim@FreeBSD.org>

o Unbreak "options TCPDEBUG" && "nooptions INET6" kernel build.

PR: kern/112517
Submitted by: vd


# 3529149e 06-May-2007 Andre Oppermann <andre@FreeBSD.org>

Use existing TF_SACK_PERMIT flag in struct tcpcb t_flags field instead of
a decdicated sack_enable int for this bool. Change all users accordingly.


# 0ca3f933 06-May-2007 Andre Oppermann <andre@FreeBSD.org>

o Remove redundant tcp reassembly check in header prediction code
o Rearrange code to make intent in TCPS_SYN_SENT case more clear
o Assorted style cleanup
o Comment clarification for tcp_dropwithreset()


# c5ad39b9 06-May-2007 Andre Oppermann <andre@FreeBSD.org>

Reorder the TCP header prediction test to check for the most volatile
values first to spend less time on a fallback to normal processing.


# 679d9708 06-May-2007 Andre Oppermann <andre@FreeBSD.org>

Remove the defunct remains of the TCPS_TIME_WAIT cases from tcp_do_segment
and change it to a void function.

We use a compressed structure for TCPS_TIME_WAIT to save memory. Any late
late segments arriving for such a connection is handled directly in the TW
code.


# 1cd6eadf 04-May-2007 Robert Watson <rwatson@FreeBSD.org>

Tweak comment at end of tcp_input() when calling into tcp_do_segment(): the
pcbinfo lock will be released as well, not just the pcb lock.


# 9fa198be 23-Apr-2007 Andre Oppermann <andre@FreeBSD.org>

o Fix INP lock leak in the minttl case
o Remove indirection in the decision of unlocking inp
o Further annotation of locking in tcp_input()


# df47e437 20-Apr-2007 Andre Oppermann <andre@FreeBSD.org>

o Remove unncessary TOF_SIGLEN flag from struct tcpopt
o Correctly set to->to_signature in tcp_dooptions()
o Update comments


# 7824d002 20-Apr-2007 Andre Oppermann <andre@FreeBSD.org>

Add more KASSERT's.


# 4d6e7130 20-Apr-2007 Andre Oppermann <andre@FreeBSD.org>

Remove bogus check for accept queue length and associated failure handling
from the incoming SYN handling section of tcp_input().

Enforcement of the accept queue limits is done by sonewconn() after the
3WHS is completed. It is not necessary to have an earlier check before a
connection request enters the SYN cache awaiting the full handshake. It
rather limits the effectiveness of the syncache by preventing legit and
illegit connections from entering it and having them shaken out before we
hit the real limit which may have vanished by then.

Change return value of syncache_add() to void. No status communication
is required.


# e207f800 20-Apr-2007 Andre Oppermann <andre@FreeBSD.org>

Simplifly syncache_expand() and clarify its semantics. Zero is returned
when the ACK is invalid and doesn't belong to any registered connection,
either in syncache or through SYN cookies. True but a NULL struct socket
is returned when the 3WHS completed but the socket could not be created
due to insufficient resources or limits reached.

For both cases an RST is sent back in tcp_input().

A logic error leading to a panic is fixed where syncache_expand() would
free the mbuf on socket allocation failure but tcp_input() later supplies
it to tcp_dropwithreset() to issue a RST to the peer.

Reported by: kris (the panic)


# 215c8d75 15-Apr-2007 Robert Watson <rwatson@FreeBSD.org>

Remove unused variable tcbinfo_mtx.


# b8152ba7 11-Apr-2007 Andre Oppermann <andre@FreeBSD.org>

Change the TCP timer system from using the callout system five times
directly to a merged model where only one callout, the next to fire,
is registered.

Instead of callout_reset(9) and callout_stop(9) the new function
tcp_timer_activate() is used which then internally manages the callout.

The single new callout is a mutex callout on inpcb simplifying the
locking a bit.

tcp_timer() is the called function which handles all race conditions
in one place and then dispatches the individual timer functions.

Reviewed by: rwatson (earlier version)


# 995a7717 04-Apr-2007 Andre Oppermann <andre@FreeBSD.org>

Add INP_INFO_UNLOCK_ASSERT() and use it in tcp_input(). Also add some
further INP_INFO_WLOCK_ASSERT() while there.


# 0c38fd0a 04-Apr-2007 Andre Oppermann <andre@FreeBSD.org>

Move last tcpcb initialization for the inbound connection case from
tcp_input() to syncache_socket() where it belongs and the majority
of it already happens.

The "tp->snd_up = tp->snd_una" is removed as it is done with the
tcp_sendseqinit() macro a few lines earlier.


# 5dd9dfef 04-Apr-2007 Andre Oppermann <andre@FreeBSD.org>

Retire unused TCP_SACK_DEBUG.


# b728e902 04-Apr-2007 Andre Oppermann <andre@FreeBSD.org>

In tcp_dooptions() skip over SACK options if it is a SYN segment.


# 1929eae1 27-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

When blackholing do a 'dropunlock' in the new world order to prevent the
INP_INFO_LOCK from leaking.

Reported by: ache
Found by: rwatson


# 14739780 24-Mar-2007 Maxim Konovalov <maxim@FreeBSD.org>

o Use a define for a buffer size.

Prodded by: db

o Add missed vars for TCPDEBUG in tcp_do_segment().

Prodded by: tinderbox


# 302ce8d6 23-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

Split tcp_input() into its two functional parts:

o tcp_input() now handles TCP segment sanity checks and preparations
including the INPCB lookup and syncache.
o tcp_do_segment() handles all data and ACK processing and is IPv4/v6
agnostic.

Change all KASSERT() messages to ("%s: ", __func__).

The changes in this commit are primarily of mechanical nature and no
functional changes besides the function split are made.

Discussed with: rwatson


# 4dfdffe9 23-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

Tidy up some code to conform better to surroundings and style(9), 0 = NULL
and space/tab.


# fc30a251 23-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

Bring SACK option handling in tcp_dooptions() in line with all other
options and ajust users accordingly.


# ad3f9ab3 21-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

ANSIfy function declarations and remove register keywords for variables.
Consistently apply style to all function declarations.


# b10fbdea 21-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

Tidy up IPFIREWALL_FORWARD sections and comments.


# 794235b7 21-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

Update and clarify comments in first section of tcp_input().


# db33b3e6 21-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

Tidy up the ACCEPTCONN section of tcp_input(), ajust comments and remove
old dead T/TCP code.


# 574b6964 21-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

Tidy up tcp_log_in_vain and blackhole.


# 85c49791 21-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

Make TCP_DROP_SYNFIN a standard part of TCP. Disabled by default it
doesn't impede normal operation negatively and is only a few lines of
code. It's close relatives blackhole and log_in_vain aren't options
either.


# e406f5a1 21-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

Remove tcp_minmssoverload DoS detection logic. The problem it tried to
protect us from wasn't really there and it only bloats the code. Should
the problem surface in the future we can simply resurrect it from cvs
history.


# 6489fe65 19-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

Match up SYSCTL declaration style.


# 02a1a643 15-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

Consolidate insertion of TCP options into a segment from within tcp_output()
and syncache_respond() into its own generic function tcp_addoptions().

tcp_addoptions() is alignment agnostic and does optimal packing in all cases.

In struct tcpopt rename to_requested_s_scale to just to_wscale.

Add a comment with quote from RFC1323: "The Window field in a SYN (i.e.,
a <SYN> or <SYN,ACK>) segment itself is never scaled."

Reviewed by: silby, mohans, julian
Sponsored by: TCP/IP Optimization Fundraise 2005


# 95ad8418 07-Mar-2007 Qing Li <qingli@FreeBSD.org>

This patch is provided to fix a couple of deployment issues observed
in the field. In one situation, one end of the TCP connection sends
a back-to-back RST packet, with delayed ack, the last_ack_sent variable
has not been update yet. When tcp_insecure_rst is turned off, the code
treats the RST as invalid because last_ack_sent instead of rcv_nxt is
compared against th_seq. Apparently there is some kind of firewall that
sits in between the two ends and that RST packet is the only RST
packet received. With short lived HTTP connections, the symptom is
a large accumulation of connections over a short period of time .

The +/-(1) factor is to take care of implementations out there that
generate RST packets with these types of sequence numbers. This
behavior has also been observed in live environments.

Reviewed by: silby, Mike Karels
MFC after: 1 week


# 4a32dc29 28-Feb-2007 Mohan Srinivasan <mohans@FreeBSD.org>

In the SYN_SENT case, Initialize the snd_wnd before the call to tcp_mss().
The TCP hostcache logic in tcp_mss() depends on the snd_wnd being initialized.


# 7c72af87 26-Feb-2007 Mohan Srinivasan <mohans@FreeBSD.org>

Reap FIN_WAIT_2 connections marked SOCANTRCVMORE faster. This mitigate
potential issues where the peer does not close, potentially leaving
thousands of connections in FIN_WAIT_2. This is controlled by a new sysctl
fast_finwait2_recycle, which is disabled by default.

Reviewed by: gnn, silby.


# afdb4274 20-Feb-2007 Robert Watson <rwatson@FreeBSD.org>

Rename two identically named log_in_vain variables: tcp_input.c's static
log_in_vain to tcp_log_in_vain, and udp_usrreq's global log_in_vain to
udp_log_in_vain.

MFC after: 1 week


# 6741ecf5 01-Feb-2007 Andre Oppermann <andre@FreeBSD.org>

Auto sizing TCP socket buffers.

Normally the socket buffers are static (either derived from global
defaults or set with setsockopt) and do not adapt to real network
conditions. Two things happen: a) your socket buffers are too small
and you can't reach the full potential of the network between both
hosts; b) your socket buffers are too big and you waste a lot of
kernel memory for data just sitting around.

With automatic TCP send and receive socket buffers we can start with a
small buffer and quickly grow it in parallel with the TCP congestion
window to match real network conditions.

FreeBSD has a default 32K send socket buffer. This supports a maximal
transfer rate of only slightly more than 2Mbit/s on a 100ms RTT
trans-continental link. Or at 200ms just above 1Mbit/s. With TCP send
buffer auto scaling and the default values below it supports 20Mbit/s
at 100ms and 10Mbit/s at 200ms. That's an improvement of factor 10, or
1000%. For the receive side it looks slightly better with a default of
64K buffer size.

New sysctls are:
net.inet.tcp.sendbuf_auto=1 (enabled)
net.inet.tcp.sendbuf_inc=8192 (8K, step size)
net.inet.tcp.sendbuf_max=262144 (256K, growth limit)
net.inet.tcp.recvbuf_auto=1 (enabled)
net.inet.tcp.recvbuf_inc=16384 (16K, step size)
net.inet.tcp.recvbuf_max=262144 (256K, growth limit)

Tested by: many (on HEAD and RELENG_6)
Approved by: re
MFC after: 1 month


# 1d54aa3b 11-Dec-2006 Bjoern A. Zeeb <bz@FreeBSD.org>

MFp4: 92972, 98913 + one more change

In ip6_sprintf no longer use and return one of eight static buffers
for printing/logging ipv6 addresses.
The caller now has to hand in a sufficiently large buffer as first
argument.


# aed55708 22-Oct-2006 Robert Watson <rwatson@FreeBSD.org>

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# e16fa5ca 25-Sep-2006 John-Mark Gurney <jmg@FreeBSD.org>

fix calculating to_tsecr... This prevents the rtt calculations from
going all wonky...


# f1edc3bd 23-Sep-2006 Bruce M Simpson <bms@FreeBSD.org>

Always set the IP version in the TCP input path, to preserve
the header field for possible later IPSEC SPD lookup, even
when the kernel is built without 'options INET6'.

PR: kern/57760
MFC after: 1 week
Submitted by: Joachim Schueth


# bf6d304a 13-Sep-2006 Andre Oppermann <andre@FreeBSD.org>

Rewrite of TCP syncookies to remove locking requirements and to enhance
functionality:

- Remove a rwlock aquisition/release per generated syncookie. Locking
is now integrated with the bucket row locking of syncache itself and
syncookies no longer add any additional lock overhead.
- Syncookie secrets are different for and stored per syncache buck row.
Secrets expire after 16 seconds and are reseeded on-demand.
- The computational overhead for syncookie generation and verification
is one MD5 hash computation as before.
- Syncache can be turned off and run with syncookies only by setting the
sysctl net.inet.tcp.syncookies_only=1.

This implementation extends the orginal idea and first implementation
of FreeBSD by using not only the initial sequence number field to store
information but also the timestamp field if present. This way we can
keep track of the entire state we need to know to recreate the session in
its original form. Almost all TCP speakers implement RFC1323 timestamps
these days. For those that do not we still have to live with the known
shortcomings of the ISN only SYN cookies. The use of the timestamp field
causes the timestamps to be randomized if syncookies are enabled.

The idea of SYN cookies is to encode and include all necessary information
about the connection setup state within the SYN-ACK we send back and thus
to get along without keeping any local state until the ACK to the SYN-ACK
arrives (if ever). Everything we need to know should be available from
the information we encoded in the SYN-ACK.

A detailed description of the inner working of the syncookies mechanism
is included in the comments in tcp_syncache.c.

Reviewed by: silby (slightly earlier version)
Sponsored by: TCP/IP Optimization Fundraise 2005


# 751dea29 07-Sep-2006 Ruslan Ermilov <ru@FreeBSD.org>

Back when we had T/TCP support, we used to apply different
timeouts for TCP and T/TCP connections in the TIME_WAIT
state, and we had two separate timed wait queues for them.
Now that is has gone, the timeout is always 2*MSL again,
and there is no reason to keep two queues (the first was
unused anyway!).

Also, reimplement the remaining queue using a TAILQ (it
was technically impossible before, with two queues).


# 233dcce1 06-Sep-2006 Andre Oppermann <andre@FreeBSD.org>

First step of TSO (TCP segmentation offload) support in our network stack.

o add IFCAP_TSO[46] for drivers to announce this capability for IPv4 and IPv6
o add CSUM_TSO flag to mbuf pkthdr csum_flags field
o add tso_segsz field to mbuf pkthdr
o enhance ip_output() packet length check to allow for large TSO packets
o extend tcp_maxmtu[46]() with a flag pointer to pass interface capabilities
o adjust all callers of tcp_maxmtu[46]() accordingly

Discussed on: -current, -net
Sponsored by: TCP/IP Optimization Fundraise 2005


# 464469c7 11-Aug-2006 Mohan Srinivasan <mohans@FreeBSD.org>

Fixes an edge case bug in timewait handling where ticks rolling over causing
the timewait expiry to be exactly 0 corrupts the timewait queues (and that entry).
Reviewed by: silby


# 421d8aa6 29-Jun-2006 Bjoern A. Zeeb <bz@FreeBSD.org>

Use INPLOOKUP_WILDCARD instead of just 1 more consistently.

OKed by: rwatson (some weeks ago)


# 8bfb1918 26-Jun-2006 Andre Oppermann <andre@FreeBSD.org>

Some cleanups and janitorial work to tcp_syncache:

o don't assign remote/local host/port information manually between provided
struct in_conninfo and struct syncache, bcopy() it instead
o rename sc_tsrecent to sc_tsreflect in struct syncache to better capture
the purpose of this field
o rename sc_request_r_scale to sc_requested_r_scale for ditto reasons
o fix IPSEC error case printf's to report correct function name
o in syncache_socket() only transpose enhanced tcp options parameters to
struct tcpcb when the inpcb doesn't has TF_NOOPT set
o in syncache_respond() reorder stack variables
o in syncache_respond() remove bogus KASSERT()

No functional changes.

Sponsored by: TCP/IP Optimization Fundraise 2005


# f72167f4 26-Jun-2006 Andre Oppermann <andre@FreeBSD.org>

Some cleanups and janitorial work to tcp_dooptions():

o redefine the parameter 'is_syn' to 'flags', add TO_SYN flag and adjust its
usage accordingly
o update the comments to the tcp_dooptions() invocation in
tcp_input():after_listen to reflect reality
o move the logic checking the echoed timestamp out of tcp_dooptions() to the
only place that uses it next to the invocation described in the previous
item
o adjust parsing of TCPOPT_SACK_PERMITTED to use the same style as the others
o add comments in to struct tcpopt.to_flags #defines

No functional changes.

Sponsored by: TCP/IP Optimization Fundraise 2005


# 5e1aa279 18-Jun-2006 David Malone <dwmalone@FreeBSD.org>

When we receive an out-of-window SYN for an "ESTABLISHED" connection,
ACK the SYN as required by RFC793, rather than ignoring it. NetBSD
have had a similar change since 1999.

PR: 93236
Submitted by: Grant Edwards <grante@visi.com>
MFC after: 1 month


# 351630c4 17-Jun-2006 Andre Oppermann <andre@FreeBSD.org>

Add locking to TCP syncache and drop the global tcpinfo lock as early
as possible for the syncache_add() case. The syncache timer no longer
aquires the tcpinfo lock and timeout/retransmit runs can happen in
parallel with bucket granularity.

On a P4 the additional locks cause a slight degression of 0.7% in tcp
connections per second. When IP and TCP input are deserialized and
can run in parallel this little overhead can be neglected. The syncookie
handling still leaves room for improvement and its random salts may be
moved to the syncache bucket head structures to remove the second lock
operation currently required for it. However this would be a more
involved change from the way syncookies work at the moment.

Reviewed by: rwatson
Tested by: rwatson, ps (earlier version)
Sponsored by: TCP/IP Optimization Fundraise 2005


# 4f590175 21-Apr-2006 Paul Saab <ps@FreeBSD.org>

Allow for nmbclusters and maxsockets to be increased via sysctl.
An eventhandler is used to update all the various zones that depend
on these values.


# 3cbe7faf 09-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Modify tcp_timewait() to accept an inpcb reference, not a tcptw
reference. For now, we allow the possibility that the in_ppcb
pointer in the inpcb may be NULL if a timewait socket has had its
tcptw structure recycled. This allows tcp_timewait() to
consistently unlock the inpcb.

Reported by: Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after: 3 months


# a460ae4b 05-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Don't unlock a timewait structure if the pointer is NULL in
tcp_timewait(). This corrects a bug (or lack of fixing of a bug)
in tcp_input.c:1.295.

Submitted by: Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after: 3 months


# ae0e7143 03-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Before dereferencing intotw() when INP_TIMEWAIT, check for inp_ppcb being
NULL. We currently do allow this to happen, but may want to remove that
possibility in the future. This case can occur when a socket is left
open after TCP wraps up, and the timewait state is recycled. This will
be cleaned up in the future.

Found by: Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after: 3 months


# afa39e25 03-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Change inp_ppcb from caddr_t to void *, fix/remove associated related
casts.

Consistently use intotw() to cast inp_ppcb pointers to struct tcptw *
pointers.

Consistently use intotcpcb() to cast inp_ppcb pointers to struct tcpcb *
pointers.

Don't assign tp to the results to intotcpcb() during variable declation
at the top of functions, as that is before the asserts relating to
locking have been performed. Do this later in the function after
appropriate assertions have run to allow that operation to be conisdered
safe.

MFC after: 3 months


# 623dce13 01-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():

- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.

- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.

- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.

- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.

- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.

- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.

- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.

- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.

- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.

These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.

MFC after: 3 months


# 1c53f806 25-Mar-2006 Robert Watson <rwatson@FreeBSD.org>

Explicitly assert socket pointer is non-NULL in tcp_input() so as to
provide better debugging information.

Prefer explicit comparison to NULL for tcpcb pointers rather than
treating them as booleans.

MFC after: 1 month


# 464fcfbc 28-Feb-2006 Andre Oppermann <andre@FreeBSD.org>

Rework TCP window scaling (RFC1323) to properly scale the send window
right from the beginning and partly clean up the differences in handling
between SYN_SENT and SYN_RCVD (syncache).

Further changes to this code to come. This is a first incremental step
to a general overhaul and streamlining of the TCP code.

PR: kern/15095
PR: kern/92690 (partly)
Reviewed by: qingli (and tested with ANVL)
Sponsored by: TCP/IP Optimization Fundraise 2005


# 4b8e98d6 23-Feb-2006 Qing Li <qingli@FreeBSD.org>

This patch fixes the problem where the current TCP code can not handle
simultaneous open. Both the bug and the patch were verified using the
ANVL test suite.

PR: kern/74935
Submitted by: qingli (before I became committer)
Reviewed by: andre
MFC after: 5 days


# 8e8aab7a 18-Feb-2006 Andre Oppermann <andre@FreeBSD.org>

Remove unneeded includes and provide more accurate description
to others.

Submitted by: garys
PR: kern/86437


# eaf80179 16-Feb-2006 Andre Oppermann <andre@FreeBSD.org>

Have TCP Inflight disable itself if the RTT is below a certain
threshold. Inflight doesn't make sense on a LAN as it has
trouble figuring out the maximal bandwidth because of the coarse
tick granularity.

The sysctl net.inet.tcp.inflight.rttthresh specifies the threshold
in milliseconds below which inflight will disengage. It defaults
to 10ms.

Tested by: Joao Barros <joao.barros-at-gmail.com>,
Rich Murphey <rich-at-whiteoaklabs.com>
Sponsored by: TCP/IP Optimization Fundraise 2005


# 02707462 18-Jan-2006 Andre Oppermann <andre@FreeBSD.org>

Do not derefence the ip header pointer in the IPv6 case.
This fixes a bug in the previous commit.

Found by: Coverity Prevent(tm)
Coverity ID: CID253
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 days


# 34f83c52 14-Jan-2006 George V. Neville-Neil <gnn@FreeBSD.org>

Check the correct TTL in both the IPv6 and IPv4 cases.

Submitted by: glebius
Reviewed by: gnn, bz
Found with: Coverity Prevent(tm)


# ef39adf0 18-Nov-2005 Andre Oppermann <andre@FreeBSD.org>

Consolidate all IP Options handling functions into ip_options.[ch] and
include ip_options.h into all files making use of IP Options functions.

From ip_input.c rev 1.306:
ip_dooptions(struct mbuf *m, int pass)
save_rte(m, option, dst)
ip_srcroute(m0)
ip_stripoptions(m, mopt)

From ip_output.c rev 1.249:
ip_insertoptions(m, opt, phlen)
ip_optcopy(ip, jp)
ip_pcbopts(struct inpcb *inp, int optname, struct mbuf *m)

No functional changes in this commit.

Discussed with: rwatson
Sponsored by: TCP/IP Optimization Fundraise 2005


# a65e12b0 19-Oct-2005 Robert Watson <rwatson@FreeBSD.org>

Convert if (tp->t_state == TCPS_LISTEN) panic() into a KASSERT.

MFC after: 2 weeks


# 4d3b1346 23-Aug-2005 Paul Saab <ps@FreeBSD.org>

Remove a KASSERT in the sack path that fails because of a interaction
between sack and a bug in the "bad retransmit recovery" logic. This is
a workaround, the underlying bug will be fixed later.

Submitted by: Mohan Srinivasan, Noritoshi Demizu


# 936cd18d 22-Aug-2005 Andre Oppermann <andre@FreeBSD.org>

Add socketoption IP_MINTTL. May be used to set the minimum acceptable
TTL a packet must have when received on a socket. All packets with a
lower TTL are silently dropped. Works on already connected/connecting
and listening sockets for RAW/UDP/TCP.

This option is only really useful when set to 255 preventing packets
from outside the directly connected networks reaching local listeners
on sockets.

Allows userland implementation of 'The Generalized TTL Security Mechanism
(GTSM)' according to RFC3682. Examples of such use include the Cisco IOS
BGP implementation command "neighbor ttl-security".

MFC after: 2 weeks
Sponsored by: TCP/IP Optimization Fundraise 2005


# d7587117 05-Jul-2005 Paul Saab <ps@FreeBSD.org>

Fix for a bug in newreno partial ack handling where if a large amount
of data is partial acked, snd_cwnd underflows, causing a burst.

Found, Submitted by: Noritoshi Demizu
Approved by: re


# 482ac968 01-Jul-2005 Paul Saab <ps@FreeBSD.org>

Fix for a bug in the change that defers sack option processing until
after PAWS checks. The symptom of this is an inconsistency in the cached
sack state, caused by the fact that the sack scoreboard was not being
updated for an ACK handled in the header prediction path.

Found by: Andrey Chernov.
Submitted by: Noritoshi Demizu, Raja Mukerji.
Approved by: re


# 69e03620 01-Jul-2005 Paul Saab <ps@FreeBSD.org>

Fix for a SACK crash caused by a bug in tcp_reass(). tcp_reass()
does not clear tlen and frees the mbuf (leaving th pointing at
freed memory), if the data segment is a complete duplicate.
This change works around that bug. A fix for the tcp_reass() bug
will appear later (that bug is benign for now, as neither th nor
tlen is referenced in tcp_input() after the call to tcp_reass()).

Found by: Pawel Jakub Dawidek.
Submitted by: Raja Mukerji, Noritoshi Demizu.
Approved by: re


# 0a389eab 29-Jun-2005 Simon L. B. Nielsen <simon@FreeBSD.org>

Fix ipfw packet matching errors with address tables.

The ipfw tables lookup code caches the result of the last query. The
kernel may process multiple packets concurrently, performing several
concurrent table lookups. Due to an insufficient locking, a cached
result can become corrupted that could cause some addresses to be
incorrectly matched against a lookup table.

Submitted by: ru
Reviewed by: csjp, mlaier
Security: CAN-2005-2019
Security: FreeBSD-SA-05:13.ipfw

Correct bzip2 permission race condition vulnerability.

Obtained from: Steve Grubb via RedHat
Security: CAN-2005-0953
Security: FreeBSD-SA-05:14.bzip2
Approved by: obrien

Correct TCP connection stall denial of service vulnerability.

A TCP packets with the SYN flag set is accepted for established
connections, allowing an attacker to overwrite certain TCP options.

Submitted by: Noritoshi Demizu
Reviewed by: andre, Mohan Srinivasan
Security: CAN-2005-2068
Security: FreeBSD-SA-05:15.tcp

Approved by: re (security blanket), cperciva


# 5a53ca16 27-Jun-2005 Paul Saab <ps@FreeBSD.org>

- Postpone SACK option processing until after PAWS checks. SACK option
processing is now done in the ACK processing case.
- Merge tcp_sack_option() and tcp_del_sackholes() into a new function
called tcp_sack_doack().
- Test (SEG.ACK < SND.MAX) before processing the ACK.

Submitted by: Noritoshi Demizu
Reveiewed by: Mohan Srinivasan, Raja Mukerji
Approved by: re


# 68d37625 25-Jun-2005 Stephan Uphoff <ups@FreeBSD.org>

Fix a timer ticks wrap around bug for minmssoverload processing.

Approved by: re (scottl,dwhite)
MFC after: 4 weeks


# 1e2d989d 31-May-2005 Robert Watson <rwatson@FreeBSD.org>

Assert that tcbinfo is locked in tcp_input() before calling into
tcp_drop().

MFC after: 7 days


# 416738a7 01-Jun-2005 Robert Watson <rwatson@FreeBSD.org>

Assert the tcbinfo lock whenever tcp_close() is to be called by
tcp_input().

MFC after: 7 days


# 808f11b7 25-May-2005 Paul Saab <ps@FreeBSD.org>

This is conform with the terminology in

M.Mathis and J.Mahdavi,
"Forward Acknowledgement: Refining TCP Congestion Control"
SIGCOMM'96, August 1996.

Submitted by: Noritoshi Demizu, Raja Mukerji


# 0077b016 11-May-2005 Paul Saab <ps@FreeBSD.org>

When looking for the next hole to retransmit from the scoreboard,
or to compute the total retransmitted bytes in this sack recovery
episode, the scoreboard is traversed. While in sack recovery, this
traversal occurs on every call to tcp_output(), every dupack and
every partial ack. The scoreboard could potentially get quite large,
making this traversal expensive.

This change optimizes this by storing hints (for the next hole to
retransmit and the total retransmitted bytes in this sack recovery
episode) reducing the complexity to find these values from O(n) to
constant time.

The debug code that sanity checks the hints against the computed
value will be removed eventually.

Submitted by: Mohan Srinivasan, Noritoshi Demizu, Raja Mukerji.


# 25e6f9ed 14-Apr-2005 Paul Saab <ps@FreeBSD.org>

Fix for a TCP SACK bug where more than (win/2) bytes could have been
in flight in SACK recovery.

Found by: Noritoshi Demizu
Submitted by: Mohan Srinivasan <mohans at yahoo-inc dot com>
Noritoshi Demizu <demizu at dd dot ij4u dot or dot jp>
Raja Mukerji <raja at moselle dot com>


# cf09195b 09-Apr-2005 Paul Saab <ps@FreeBSD.org>

- Tighten up the Timestamp checks to prevent a spoofed segment from
setting ts_recent to an arbitrary value, stopping further
communication between the two hosts.
- If the Echoed Timestamp is greater than the current time,
fall back to the non RFC 1323 RTT calculation.

Submitted by: Raja Mukerji (raja at moselle dot com)
Reviewed by: Noritoshi Demizu, Mohan Srinivasan


# e346eeff 09-Apr-2005 Paul Saab <ps@FreeBSD.org>

- If the reassembly queue limit was reached or if we couldn't allocate
a reassembly queue state structure, don't update (receiver) sack
report.
- Similarly, if tcp_drain() is called, freeing up all items on the
reassembly queue, clean the sack report.

Found, Submitted by: Noritoshi Demizu <demizu at dd dot iij4u dot or dot jp>
Reviewed by: Mohan Srinivasan (mohans at yahoo-inc dot com),
Raja Mukerji (raja at moselle dot com).


# 7643c37c 17-Feb-2005 Paul Saab <ps@FreeBSD.org>

Remove 2 (SACK) fields from the tcpcb. These are only used by a
function that is called from tcp_input(), so they oughta be passed on
the stack instead of stuck in the tcpcb.

Submitted by: Mohan Srinivasan


# 7776346f 15-Feb-2005 Paul Saab <ps@FreeBSD.org>

Fix for a SACK (receiver) bug where incorrect SACK blocks are
reported to the sender - in the case where the sender sends data
outside the window (as WinXP does :().

Reported by: Sam Jensen <sam at wand dot net dot nz>
Submitted by: Mohan Srinivasan


# 8db456bf 14-Feb-2005 Paul Saab <ps@FreeBSD.org>

- Retransmit just one segment on initiation of SACK recovery.
Remove the SACK "initburst" sysctl.
- Fix bugs in SACK dupack and partialack handling that can cause
large bursts while in SACK recovery.

Submitted by: Mohan Srinivasan


# c398230b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# a69968ee 03-Jan-2005 Mike Silbersack <silby@FreeBSD.org>

Add a sysctl (net.inet.tcp.insecure_rst) which allows one to specify
that the RFC 793 specification for accepting RST packets should be
following. When followed, this makes one vulnerable to the attacks
described in "slipping in the window", but it may be necessary in
some odd circumstances.


# 42cf3289 25-Dec-2004 Robert Watson <rwatson@FreeBSD.org>

In the dropafterack case of tcp_input(), it's OK to release the TCP
pcbinfo lock before calling tcp_output(), as holding just the inpcb
lock is sufficient to prevent garbage collection.


# e0bef1cb 25-Dec-2004 Robert Watson <rwatson@FreeBSD.org>

Revert parts of tcp_input.c:1.255 associated with the header predicted
cases for tcp_input():

While it is true that the pcbinfo lock provides a pseudo-reference to
inpcbs, both the inpcb and pcbinfo locks are required to free an
un-referenced inpcb. As such, we can release the pcbinfo lock as
long as the inpcb remains locked with the confidence that it will not
be garbage-collected. This leads to a less conservative locking
strategy that should reduce contention on the TCP pcbinfo lock.

Discussed with: sam


# 2be3bf22 28-Nov-2004 Robert Watson <rwatson@FreeBSD.org>

Assert the inpcb lock in tcp_xmit_timer() as it performs read-modify-
write of various time/rtt-related fields in the tcpcb.


# 18ad5842 28-Nov-2004 Robert Watson <rwatson@FreeBSD.org>

Expand coverage of the receive socket buffer lock when handling urgent
pointer updates: test available space while holding the socket buffer
mutex, and continue to hold until until the pointer update has been
performed.

MFC after: 2 weeks


# 6a220ed8 25-Nov-2004 Mike Silbersack <silby@FreeBSD.org>

Fix a problem where our TCP stack would ignore RST packets if the receive
window was 0 bytes in size. This may have been the cause of unsolved
"connection not closing" reports over the years.

Thanks to Michiel Boland for providing the fix and providing a concise
test program for the problem.

Submitted by: Michiel Boland
MFC after: 2 weeks


# de30ea13 23-Nov-2004 Robert Watson <rwatson@FreeBSD.org>

In tcp_reass(), assert the inpcb lock on the passed tcpcb, since the
contents of the tcpcb are read and modified in volume.

In tcp_input(), replace th comparison with 0 with a comparison with
NULL.

At the 'findpcb', 'dropafterack', and 'dropwithreset' labels in
tcp_input(), assert 'headlocked'. Try to improve consistency between
various assertions regarding headlocked to be more informative.

MFC after: 2 weeks


# cce83ffb 23-Nov-2004 Robert Watson <rwatson@FreeBSD.org>

tcp_timewait() performs multiple non-atomic reads on the tcptw
structure, so assert the inpcb lock associated with the tcptw.
Also assert the tcbinfo lock, as tcp_timewait() may call
tcp_twclose() or tcp_2msl_rest(), which require it. Since
tcp_timewait() is already called with that lock from tcp_input(),
this doesn't change current locking, merely documents reasons for
it.

In tcp_twstart(), assert the tcbinfo lock, as tcp_timer_2msl_rest()
is called, which requires that lock.

In tcp_twclose(), assert the tcbinfo lock, as tcp_timer_2msl_stop()
is called, which requires that lock.

Document the locking strategy for the time wait queues in tcp_timer.c,
which consists of protecting the time wait queues in the same manner
as the tcbinfo structure (using the tcbinfo lock).

In tcp_timer_2msl_reset(), assert the tcbinfo lock, as the time wait
queues are modified.

In tcp_timer_2msl_stop(), assert the tcbinfo lock, as the time wait
queues may be modified.

In tcp_timer_2msl_tw(), assert the tcbinfo lock, as the time wait
queues may be modified.

MFC after: 2 weeks


# ca127a3e 22-Nov-2004 Robert Watson <rwatson@FreeBSD.org>

Remove "Unlocked read" annotations associated with previously unlocked
use of socket buffer fields in the TCP input code. These references
are now protected by use of the receive socket buffer lock.

MFC after: 1 week


# d6915262 07-Nov-2004 Robert Watson <rwatson@FreeBSD.org>

Do some re-sorting of TCP pcbinfo locking and assertions: make sure to
retain the pcbinfo lock until we're done using a pcb in the in-bound
path, as the pcbinfo lock acts as a pseuo-reference to prevent the pcb
from potentially being recycled. Clean up assertions and make sure to
assert that the pcbinfo is locked at the head of code subsections where
it is needed. Free the mbuf at the end of tcp_input after releasing
any held locks to reduce the time the locks are held.

MFC after: 3 weeks


# c94c54e4 02-Nov-2004 Andre Oppermann <andre@FreeBSD.org>

Remove RFC1644 T/TCP support from the TCP side of the network stack.

A complete rationale and discussion is given in this message
and the resulting discussion:

http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706

Note that this commit removes only the functional part of T/TCP
from the tcp_* related functions in the kernel. Other features
introduced with RFC1644 are left intact (socket layer changes,
sendmsg(2) on connection oriented protocols) and are meant to
be reused by a simpler and less intrusive reimplemention of the
previous T/TCP functionality.

Discussed on: -arch


# a55db2b6 05-Oct-2004 Paul Saab <ps@FreeBSD.org>

- Estimate the amount of data in flight in sack recovery and use it
to control the packets injected while in sack recovery (for both
retransmissions and new data).
- Cleanups to the sack codepaths in tcp_output.c and tcp_sack.c.
- Add a new sysctl (net.inet.tcp.sack.initburst) that controls the
number of sack retransmissions done upon initiation of sack recovery.

Submitted by: Mohan Srinivasan <mohans@yahoo-inc.com>


# 9b932e9e 17-Aug-2004 Andre Oppermann <andre@FreeBSD.org>

Convert ipfw to use PFIL_HOOKS. This is change is transparent to userland
and preserves the ipfw ABI. The ipfw core packet inspection and filtering
functions have not been changed, only how ipfw is invoked is different.

However there are many changes how ipfw is and its add-on's are handled:

In general ipfw is now called through the PFIL_HOOKS and most associated
magic, that was in ip_input() or ip_output() previously, is now done in
ipfw_check_[in|out]() in the ipfw PFIL handler.

IPDIVERT is entirely handled within the ipfw PFIL handlers. A packet to
be diverted is checked if it is fragmented, if yes, ip_reass() gets in for
reassembly. If not, or all fragments arrived and the packet is complete,
divert_packet is called directly. For 'tee' no reassembly attempt is made
and a copy of the packet is sent to the divert socket unmodified. The
original packet continues its way through ip_input/output().

ipfw 'forward' is done via m_tag's. The ipfw PFIL handlers tag the packet
with the new destination sockaddr_in. A check if the new destination is a
local IP address is made and the m_flags are set appropriately. ip_input()
and ip_output() have some more work to do here. For ip_input() the m_flags
are checked and a packet for us is directly sent to the 'ours' section for
further processing. Destination changes on the input path are only tagged
and the 'srcrt' flag to ip_forward() is set to disable destination checks
and ICMP replies at this stage. The tag is going to be handled on output.
ip_output() again checks for m_flags and the 'ours' tag. If found, the
packet will be dropped back to the IP netisr where it is going to be picked
up by ip_input() again and the directly sent to the 'ours' section. When
only the destination changes, the route's 'dst' is overwritten with the
new destination from the forward m_tag. Then it jumps back at the route
lookup again and skips the firewall check because it has been marked with
M_SKIP_FIREWALL. ipfw 'forward' has to be compiled into the kernel with
'option IPFIREWALL_FORWARD' to enable it.

DUMMYNET is entirely handled within the ipfw PFIL handlers. A packet for
a dummynet pipe or queue is directly sent to dummynet_io(). Dummynet will
then inject it back into ip_input/ip_output() after it has served its time.
Dummynet packets are tagged and will continue from the next rule when they
hit the ipfw PFIL handlers again after re-injection.

BRIDGING and IPFW_ETHER are not changed yet and use ipfw_chk() directly as
they did before. Later this will be changed to dedicated ETHER PFIL_HOOKS.

More detailed changes to the code:

conf/files
Add netinet/ip_fw_pfil.c.

conf/options
Add IPFIREWALL_FORWARD option.

modules/ipfw/Makefile
Add ip_fw_pfil.c.

net/bridge.c
Disable PFIL_HOOKS if ipfw for bridging is active. Bridging ipfw
is still directly invoked to handle layer2 headers and packets would
get a double ipfw when run through PFIL_HOOKS as well.

netinet/ip_divert.c
Removed divert_clone() function. It is no longer used.

netinet/ip_dummynet.[ch]
Neither the route 'ro' nor the destination 'dst' need to be stored
while in dummynet transit. Structure members and associated macros
are removed.

netinet/ip_fastfwd.c
Removed all direct ipfw handling code and replace it with the new
'ipfw forward' handling code.

netinet/ip_fw.h
Removed 'ro' and 'dst' from struct ip_fw_args.

netinet/ip_fw2.c
(Re)moved some global variables and the module handling.

netinet/ip_fw_pfil.c
New file containing the ipfw PFIL handlers and module initialization.

netinet/ip_input.c
Removed all direct ipfw handling code and replace it with the new
'ipfw forward' handling code. ip_forward() does not longer require
the 'next_hop' struct sockaddr_in argument. Disable early checks
if 'srcrt' is set.

netinet/ip_output.c
Removed all direct ipfw handling code and replace it with the new
'ipfw forward' handling code.

netinet/ip_var.h
Add ip_reass() as general function. (Used from ipfw PFIL handlers
for IPDIVERT.)

netinet/raw_ip.c
Directly check if ipfw and dummynet control pointers are active.

netinet/tcp_input.c
Rework the 'ipfw forward' to local code to work with the new way of
forward tags.

netinet/tcp_sack.c
Remove include 'opt_ipfw.h' which is not needed here.

sys/mbuf.h
Remove m_claim_next() macro which was exclusively for ipfw 'forward'
and is no longer needed.

Approved by: re (scottl)


# a4f757cd 16-Aug-2004 Robert Watson <rwatson@FreeBSD.org>

White space cleanup for netinet before branch:

- Trailing tab/space cleanup
- Remove spurious spaces between or before tabs

This change avoids touching files that Andre likely has in his working
set for PFIL hooks changes for IPFW/DUMMYNET.

Approved by: re (scottl)
Submitted by: Xin LI <delphij@frontfree.net>


# 7cfc6904 12-Jul-2004 Robert Watson <rwatson@FreeBSD.org>

After each label in tcp_input(), assert the inpcbinfo and inpcb lock
state that we expect.


# a0445c2e 01-Jul-2004 Jayanth Vijayaraghavan <jayanth@FreeBSD.org>

On receiving 3 duplicate acknowledgements, SACK recovery was not being entered correctly.
Fix this problem by separating out the SACK and the newreno cases. Also, check
if we are in FASTRECOVERY for the sack case and if so, turn off dupacks.

Fix an issue where the congestion window was not being incremented by ssthresh.

Thanks to Mohan Srinivasan for finding this problem.


# 1e4d7da7 26-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Reduce the number of unnecessary unlock-relocks on socket buffer mutexes
associated with performing a wakeup on the socket buffer:

- When performing an sbappend*() followed by a so[rw]wakeup(), explicitly
acquire the socket buffer lock and use the _locked() variants of both
calls. Note that the _locked() sowakeup() versions unlock the mutex on
return. This is done in uipc_send(), divert_packet(), mroute
socket_send(), raw_append(), tcp_reass(), tcp_input(), and udp_append().

- When the socket buffer lock is dropped before a sowakeup(), remove the
explicit unlock and use the _locked() sowakeup() variant. This is done
in soisdisconnecting(), soisdisconnected() when setting the can't send/
receive flags and dropping data, and in uipc_rcvd() which adjusting
back-pressure on the sockets.

For UNIX domain sockets running mpsafe with a contention-intensive SMP
mysql benchmark, this results in a 1.6% query rate improvement due to
reduce mutex costs.


# 652178a1 24-Jun-2004 Paul Saab <ps@FreeBSD.org>

White space & spelling fixes

Submitted by: Xin LI <delphij@frontfree.net>


# 5905999b 23-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Broaden scope of the socket buffer lock when processing an ACK so that
the read and write of sb_cc are atomic. Call sbdrop_locked() instead
of sbdrop() since we already hold the socket buffer lock.


# 927c5cea 23-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Protect so_oobmark with with SOCKBUF_LOCK(&so->so_rcv), and broaden
locking in tcp_input() for TCP packets with urgent data pointers to
hold the socket buffer lock across testing and updating oobmark
from just protecting sb_state.

Update socket locking annotations


# 3f11a2f3 23-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Introduce sbreserve_locked(), which asserts the socket buffer lock on
the socket buffer having its limits adjusted. sbreserve() now acquires
the lock before calling sbreserve_locked(). In soreserve(), acquire
socket buffer locks across read-modify-writes of socket buffer fields,
and calls into sbreserve/sbrelease; make sure to acquire in keeping
with the socket buffer lock order. In tcp_mss(), acquire the socket
buffer lock in the calling context so that we have atomic read-modify
-write on buffer sizes.


# 6d90faf3 23-Jun-2004 Paul Saab <ps@FreeBSD.org>

Add support for TCP Selective Acknowledgements. The work for this
originated on RELENG_4 and was ported to -CURRENT.

The scoreboarding code was obtained from OpenBSD, and many
of the remaining changes were inspired by OpenBSD, but not
taken directly from there.

You can enable/disable sack using net.inet.tcp.do_sack. You can
also limit the number of sack holes that all senders can have in
the scoreboard with net.inet.tcp.sackhole_limit.

Reviewed by: gnn
Obtained from: Yahoo! (Mohan Srinivasan, Jayanth Vijayaraghavan)


# 1f82efb3 20-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Assert the inpcb lock before letting MAC check whether we can deliver
to the inpcb in tcp_input().


# d420fcda 16-Jun-2004 Bruce M Simpson <bms@FreeBSD.org>

Fix build for IPSEC && !INET6

PR: kern/66125
Submitted by: Cyrille Lefevre


# 7721f5d7 14-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Grab the socket buffer send or receive mutex when performing a
read-modify-write on the sb_state field. This commit catches only
the "easy" ones where it doesn't interact with as yet unmerged
locking.


# c0b99ffa 14-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

The socket field so_state is used to hold a variety of socket related
flags relating to several aspects of socket functionality. This change
breaks out several bits relating to send and receive operation into a
new per-socket buffer field, sb_state, in order to facilitate locking.
This is required because, in order to provide more granular locking of
sockets, different state fields have different locking properties. The
following fields are moved to sb_state:

SS_CANTRCVMORE (so_state)
SS_CANTSENDMORE (so_state)
SS_RCVATMARK (so_state)

Rename respectively to:

SBS_CANTRCVMORE (so_rcv.sb_state)
SBS_CANTSENDMORE (so_snd.sb_state)
SBS_RCVATMARK (so_rcv.sb_state)

This facilitates locking by isolating fields to be located with other
identically locked fields, and permits greater granularity in socket
locking by avoiding storing fields with different locking semantics in
the same short (avoiding locking conflicts). In the future, we may
wish to coallesce sb_state and sb_flags; for the time being I leave
them separate and there is no additional memory overhead due to the
packing/alignment of shorts in the socket buffer structure.


# 310e7ceb 12-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Socket MAC labels so_label and so_peerlabel are now protected by
SOCK_LOCK(so):

- Hold socket lock over calls to MAC entry points reading or
manipulating socket labels.

- Assert socket lock in MAC entry point implementations.

- When externalizing the socket label, first make a thread-local
copy while holding the socket lock, then release the socket lock
to externalize to userspace.


# 2f3f1e67 02-May-2004 Darren Reed <darrenr@FreeBSD.org>

Rename m_claim_next_hop() to m_claim_next(), as suggested by Max Laier.


# 7fbb1300 02-May-2004 Darren Reed <darrenr@FreeBSD.org>

oops, I forgot this file in a prior commit (change was still sitting here,
uncommitted):

Rename ip_claim_next_hop() to m_claim_next_hop(), give it an extra arg
(the type of tag to claim) and push it out of ip_var.h into mbuf.h
alongside all of the other macros that work ok mbuf's and tag's.


# 80dd2a81 25-Apr-2004 Mike Silbersack <silby@FreeBSD.org>

Tighten up reset handling in order to make reset attacks as difficult as
possible while maintaining compatibility with the widest range of TCP stacks.

The algorithm is as follows:

---
For connections in the ESTABLISHED state, only resets with
sequence numbers exactly matching last_ack_sent will cause a reset,
all other segments will be silently dropped.

For connections in all other states, a reset anywhere in the window
will cause the connection to be reset. All other segments will be
silently dropped.
---

The necessity of accepting all in-window resets was discovered
by jayanth and jlemon, both of whom have seen TCP stacks that
will respond to FIN-ACK packets with resets not meeting the
strict last_ack_sent check.

Idea by: Darren Reed
Reviewed by: truckman, jlemon, others(?)


# 2d166c02 23-Apr-2004 Andre Oppermann <andre@FreeBSD.org>

Correct an edge case in tcp_mss() where the cached path MTU
from tcp_hostcache would have overridden a (now) lower MTU of
an interface or route that changed since first PMTU discovery.
The bug would have caused TCP to redo the PMTU discovery when
not strictly necessary.

Make a comment about already pre-initialized default values
more clear.

Reviewed by: sam


# f36cfd49 07-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson


# 04d3a452 01-Mar-2004 Hajimu UMEMOTO <ume@FreeBSD.org>

fix -O0 compilation without INET6.

Pointed out by: ru


# a7b6a14a 28-Feb-2004 Robert Watson <rwatson@FreeBSD.org>

Remove now unneeded arguments to tcp_twrespond() -- so and msrc. These
were needed by the MAC Framework until inpcbs gained labels.

Submitted by: sam


# ac9d7e26 25-Feb-2004 Max Laier <mlaier@FreeBSD.org>

Re-remove MT_TAGs. The problems with dummynet have been fixed now.

Tested by: -current, bms(mentor), me
Approved by: bms(mentor), sam


# 89c02376 25-Feb-2004 Jeffrey Hsu <hsu@FreeBSD.org>

Relax a KASSERT condition to allow for a valid corner case where
the FIN on the last segment consumes an extra sequence number.

Spurious panic reported by Mike Silbersack <silby@silby.com>.


# 12e2e970 24-Feb-2004 Andre Oppermann <andre@FreeBSD.org>

Convert the tcp segment reassembly queue to UMA and limit the maximum
amount of segments it will hold.

The following tuneables and sysctls control the behaviour of the tcp
segment reassembly queue:

net.inet.tcp.reass.maxsegments (loader tuneable)
specifies the maximum number of segments all tcp reassemly queues can
hold (defaults to 1/16 of nmbclusters).

net.inet.tcp.reass.maxqlen
specifies the maximum number of segments any individual tcp session queue
can hold (defaults to 48).

net.inet.tcp.reass.cursegments (readonly)
counts the number of segments currently in all reassembly queues.

net.inet.tcp.reass.overflows (readonly)
counts how often either the global or local queue limit has been reached.

Tested by: bms, silby
Reviewed by: bms, silby


# 36e8826f 17-Feb-2004 Max Laier <mlaier@FreeBSD.org>

Backout MT_TAG removal (i.e. bring back MT_TAGs) for now, as dummynet is
not working properly with the patch in place.

Approved by: bms(mentor)


# da0f4099 17-Feb-2004 Hajimu UMEMOTO <ume@FreeBSD.org>

IPSEC and FAST_IPSEC have the same internal API now;
so merge these (IPSEC has an extra ipsecstat)

Submitted by: "Bjoern A. Zeeb" <bzeeb+freebsd@zabbadoz.net>


# 1094bdca 13-Feb-2004 Max Laier <mlaier@FreeBSD.org>

This set of changes eliminates the use of MT_TAG "pseudo mbufs", replacing
them mostly with packet tags (one case is handled by using an mbuf flag
since the linkage between "caller" and "callee" is direct and there's no
need to incur the overhead of a packet tag).

This is (mostly) work from: sam

Silence from: -arch
Approved by: bms(mentor), sam, rwatson


# 265ed012 13-Feb-2004 Bruce M Simpson <bms@FreeBSD.org>

Brucification.

Submitted by: bde


# a0194ef1 12-Feb-2004 Bruce M Simpson <bms@FreeBSD.org>

Remove an unnecessary initialization that crept in from the code which
verifies TCP-MD5 digests.

Noticed by: njl


# 1cfd4b53 10-Feb-2004 Bruce M Simpson <bms@FreeBSD.org>

Initial import of RFC 2385 (TCP-MD5) digest support.

This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.

For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.

Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.

There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.

Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.

This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.

Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.

Sponsored by: sentex.net


# f073c60f 03-Feb-2004 Hajimu UMEMOTO <ume@FreeBSD.org>

pass pcb rather than so. it is expected that per socket policy
works again.


# 61a36e3d 20-Jan-2004 Jeffrey Hsu <hsu@FreeBSD.org>

Merge from DragonFlyBSD rev 1.10:

date: 2003/09/02 10:04:47; author: hsu; state: Exp; lines: +5 -6
Account for when Limited Transmit is not congestion window limited.

Obtained from: DragonFlyBSD


# 53369ac9 08-Jan-2004 Andre Oppermann <andre@FreeBSD.org>

Limiters and sanity checks for TCP MSS (maximum segement size)
resource exhaustion attacks.

For network link optimization TCP can adjust its MSS and thus
packet size according to the observed path MTU. This is done
dynamically based on feedback from the remote host and network
components along the packet path. This information can be
abused to pretend an extremely low path MTU.

The resource exhaustion works in two ways:

o during tcp connection setup the advertized local MSS is
exchanged between the endpoints. The remote endpoint can
set this arbitrarily low (except for a minimum MTU of 64
octets enforced in the BSD code). When the local host is
sending data it is forced to send many small IP packets
instead of a large one.

For example instead of the normal TCP payload size of 1448
it forces TCP payload size of 12 (MTU 64) and thus we have
a 120 times increase in workload and packets. On fast links
this quickly saturates the local CPU and may also hit pps
processing limites of network components along the path.

This type of attack is particularly effective for servers
where the attacker can download large files (WWW and FTP).

We mitigate it by enforcing a minimum MTU settable by sysctl
net.inet.tcp.minmss defaulting to 256 octets.

o the local host is reveiving data on a TCP connection from
the remote host. The local host has no control over the
packet size the remote host is sending. The remote host
may chose to do what is described in the first attack and
send the data in packets with an TCP payload of at least
one byte. For each packet the tcp_input() function will
be entered, the packet is processed and a sowakeup() is
signalled to the connected process.

For example an attack with 2 Mbit/s gives 4716 packets per
second and the same amount of sowakeup()s to the process
(and context switches).

This type of attack is particularly effective for servers
where the attacker can upload large amounts of data.
Normally this is the case with WWW server where large POSTs
can be made.

We mitigate this by calculating the average MSS payload per
second. If it goes below 'net.inet.tcp.minmss' and the pps
rate is above 'net.inet.tcp.minmssoverload' defaulting to
1000 this particular TCP connection is resetted and dropped.

MITRE CVE: CAN-2004-0002
Reviewed by: sam (mentor)
MFC after: 1 day


# dba7bc6a 06-Jan-2004 Andre Oppermann <andre@FreeBSD.org>

Enable the following TCP options by default to give it more exposure:

rfc3042 Limited retransmit
rfc3390 Increasing TCP's initial congestion Window
inflight TCP inflight bandwidth limiting

All my production server have it enabled and there have been no
issues. I am confident about having them on by default and it gives
us better overall TCP performance.

Reviewed by: sam (mentor)


# 943ae302 25-Nov-2003 Andre Oppermann <andre@FreeBSD.org>

Restructure a too broad ifdef which was disabling the setting of the
tcp flightsize sysctl value for local networks in the !INET6 case.

Approved by: re (scottl)


# 97d8d152 20-Nov-2003 Andre Oppermann <andre@FreeBSD.org>

Introduce tcp_hostcache and remove the tcp specific metrics from
the routing table. Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.

It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination. Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.

tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.

It removes significant locking requirements from the tcp stack with
regard to the routing table.

Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)


# a557af22 17-Nov-2003 Robert Watson <rwatson@FreeBSD.org>

Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.

This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.

For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.

Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.

Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 122aad88 12-Nov-2003 Andre Oppermann <andre@FreeBSD.org>

dropwithreset is not needed in this case as tcp_drop() is already notifying
the other side. Before we were sending two RST packets.


# c29afad6 08-Nov-2003 Sam Leffler <sam@FreeBSD.org>

o correct locking problem: the inpcb must be held across tcp_respond
o add assertions in tcp_respond to validate inpcb locking assumptions
o use local variable instead of chasing pointers in tcp_respond

Supported by: FreeBSD Foundation


# 395bb186 27-Oct-2003 Sam Leffler <sam@FreeBSD.org>

speedup stream socket recv handling by tracking the tail of
the mbuf chain instead of walking the list for each append

Submitted by: ps/jayanth
Obtained from: netbsd (jason thorpe)


# b3399803 20-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

enclose IPv6 part with ifdef INET6.

Obtained from: KAME


# 31b3783c 20-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

correct linkmtu handling.

Obtained from: KAME


# 31b1bfe1 17-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

- add dom_if{attach,detach} framework.
- transition to use ifp->if_afdata.

Obtained from: KAME


# 3c653157 13-Aug-2003 Hartmut Brandt <harti@FreeBSD.org>

A number of patches in the last years have created new return paths
in tcp_input that leave the function before hitting the tcp_trace
function call for the TCPDEBUG option. This has made TCPDEBUG mostly
useless (and tools like ports/benchmarks/dbs not working). Add
tcp_trace calls to the return paths that could be identified in this
maze.

This is a NOP unless you compile with TCPDEBUG.


# 9d11646d 15-Jul-2003 Jeffrey Hsu <hsu@FreeBSD.org>

Unify the "send high" and "recover" variables as specified in the
lastest rev of the spec. Use an explicit flag for Fast Recovery. [1]

Fix bug with exiting Fast Recovery on a retransmit timeout
diagnosed by Lu Guohan. [2]

Reviewed by: Thomas Henderson <thomas.r.henderson@boeing.com>
Reported and tested by: Lu Guohan <lguohan00@mails.tsinghua.edu.cn> [2]
Approved by: Thomas Henderson <thomas.r.henderson@boeing.com>,
Sally Floyd <floyd@acm.org> [1]


# e4d2978d 31-May-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Add /* FALLTHROUGH */

Found by: FlexeLint


# 430c6354 06-May-2003 Robert Watson <rwatson@FreeBSD.org>

Correct a bug introduced with reduced TCP state handling; make
sure that the MAC label on TCP responses during TIMEWAIT is
properly set from either the socket (if available), or the mbuf
that it's responding to.

Unfortunately, this is made somewhat difficult by the TCP code,
as tcp_twstart() calls tcp_twrespond() after discarding the socket
but without a reference to the mbuf that causes the "response".
Passing both the socket and the mbuf works arounds this--eventually
it might be good to make sure the mbuf always gets passed in in
"response" scenarios but working through this provided to
complicate things too much.

Approved by: re (scottl)
Reviewed by: hsu
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 152385d1 21-Apr-2003 David E. O'Brien <obrien@FreeBSD.org>

Explicitly declare 'int' parameters.


# 48d2549c 01-Apr-2003 Jeffrey Hsu <hsu@FreeBSD.org>

Observe conservation of packets when entering Fast Recovery while
doing Limited Transmit. Only artificially inflate the congestion
window by 1 segment instead of the usual 3 to take into account
the 2 already sent by Limited Transmit.

Approved in principle by: Mark Allman <mallman@grc.nasa.gov>,
Hari Balakrishnan <hari@nms.lcs.mit.edu>, Sally Floyd <floyd@icir.org>


# 7792ea27 13-Mar-2003 Jeffrey Hsu <hsu@FreeBSD.org>

Greatly simplify the unlocking logic by holding the TCP protocol lock until
after FIN_WAIT_2 processing.

Helped with debugging: Doug Barton


# da3a8a1a 12-Mar-2003 Jeffrey Hsu <hsu@FreeBSD.org>

Add support for RFC 3390, which allows for a variable-sized
initial congestion window.


# 582a954b 12-Mar-2003 Jeffrey Hsu <hsu@FreeBSD.org>

Implement the Limited Transmit algorithm (RFC 3042).


# 607b0b0c 08-Mar-2003 Jonathan Lemon <jlemon@FreeBSD.org>

Remove a panic(); if the zone allocator can't provide more timewait
structures, reuse the oldest one. Also move the expiry timer from
a per-structure callout to the tcp slow timer.

Sponsored by: DARPA, NAI Labs


# 272c5dfe 26-Feb-2003 Jonathan Lemon <jlemon@FreeBSD.org>

In timewait state, if the incoming segment is a pure in-sequence ack
that matches snd_max, then do not respond with an ack, just drop the
segment. This fixes a problem where a simultaneous close results in
an ack loop between two time-wait states.

Test case supplied by: Tim Robbins <tjr@FreeBSD.ORG>
Sponsored by: DARPA, NAI Labs


# ef6b48de 26-Feb-2003 Jonathan Lemon <jlemon@FreeBSD.org>

The TCP protocol lock may still be held if the reassembly queue dropped FIN.
Detect this case and drop the lock accordingly.

Sponsored by: DARPA, NAI Labs


# 11a20fb8 23-Feb-2003 Jeffrey Hsu <hsu@FreeBSD.org>

tcp_twstart() need to be called with the TCP protocol lock held to avoid
a race condition with the TCP timer routines.


# 2fbef918 23-Feb-2003 Jeffrey Hsu <hsu@FreeBSD.org>

Pass the right function to callout_reset() for a compressed
TIME-WAIT control block.


# f243998b 23-Feb-2003 Jonathan Lemon <jlemon@FreeBSD.org>

Yesterday just wasn't my day. Remove testing delta that crept into the diff.

Pointy hat provided by: sam


# a14c749f 22-Feb-2003 Jonathan Lemon <jlemon@FreeBSD.org>

Check to see if the TF_DELACK flag is set before returning from
tcp_input(). This unbreaks delack handling, while still preserving
correct T/TCP behavior

Tested by: maxim
Sponsored by: DARPA, NAI Labs


# 340c35de 19-Feb-2003 Jonathan Lemon <jlemon@FreeBSD.org>

Add a TCP TIMEWAIT state which uses less space than a fullblown TCP
control block. Allow the socket and tcpcb structures to be freed
earlier than inpcb. Update code to understand an inp w/o a socket.

Reviewed by: hsu, silby, jayanth
Sponsored by: DARPA, NAI Labs


# 41446225 19-Feb-2003 Jonathan Lemon <jlemon@FreeBSD.org>

Correct comments.


# 3bfd6421 19-Feb-2003 Jonathan Lemon <jlemon@FreeBSD.org>

Clean up delayed acks and T/TCP interactions:
- delay acks for T/TCP regardless of delack setting
- fix bug where a single pass through tcp_input might not delay acks
- use callout_active() instead of callout_pending()

Sponsored by: DARPA, NAI Labs


# 85e8b243 13-Feb-2003 Jeffrey Hsu <hsu@FreeBSD.org>

The protocol lock is always held in the dropafterack case, so we don't
need to check for it at runtime.


# 39eb27a4 02-Feb-2003 Crist J. Clark <cjc@FreeBSD.org>

Add the TCP flags to the log message whenever log_in_vain is 1, not
just when set to 2.

PR: kern/43348
MFC after: 5 days


# cb942153 13-Jan-2003 Jeffrey Hsu <hsu@FreeBSD.org>

Fix NewReno.

Reviewed by: Tom Henderson <thomas.r.henderson@boeing.com>


# 07fd333d 30-Dec-2002 Matthew Dillon <dillon@FreeBSD.org>

Remove the PAWS ack-on-ack debugging printf().

Note that the original RFC 1323 (PAWS) says in 4.2.1 that the out of
order / reverse-time-indexed packet should be acknowledged as specified
in RFC-793 page 69 then dropped. The original PAWS code in FreeBSD (1994)
simply acknowledged the segment unconditionally, which is incorrect, and
was fixed in 1.183 (2002). At the moment we do not do checks for SYN or FIN
in addition to (tlen != 0), which may or may not be correct, but the
worst that ought to happen should be a retry by the sender.


# 540e8b7e 20-Dec-2002 Jeffrey Hsu <hsu@FreeBSD.org>

Unravel a nested conditional.
Remove an unneeded local variable.


# 967adce8 16-Dec-2002 Matthew Dillon <dillon@FreeBSD.org>

Fix syntax in last commit.


# 1ab4789d 14-Dec-2002 Matthew Dillon <dillon@FreeBSD.org>

Bruce forwarded this tidbit from an analysis Van Jacobson did on an
apparent ack-on-ack problem with FreeBSD. Prof. Jacobson noticed a
case in our TCP stack which would acknowledge a received ack-only packet,
which is not legal in TCP.

Submitted by: Van Jacobson <van@packetdesign.com>,
bmah@packetdesign.com (Bruce A. Mah)
MFC after: 7 days


# 6f0d017c 10-Nov-2002 Sam Leffler <sam@FreeBSD.org>

a better solution to building FAST_IPSEC w/o INET6

Submitted by: Jeffrey Hsu <hsu@FreeBSD.org>


# 58fcadfc 08-Nov-2002 Sam Leffler <sam@FreeBSD.org>

fixup FAST_IPSEC build w/o INET6


# 1645d090 31-Oct-2002 Jeff Roberson <jeff@FreeBSD.org>

- Consistently update snd_wl1, snd_wl2, and rcv_up in the header
prediction code. Previously, 2GB worth of header predicted data
could leave these variables too far out of sequence which would cause
problems after receiving a packet that did not match the header
prediction.

Submitted by: Bill Baumann <bbaumann@isilon.com>
Sponsored by: Isilon Systems, Inc.
Reviewed by: hsu, pete@isilon.com, neal@isilon.com, aaronp@isilon.com


# 30613f56 30-Oct-2002 Jeffrey Hsu <hsu@FreeBSD.org>

Don't need to check if SO_OOBINLINE is defined.
Don't need to protect isipv6 conditional with INET6.
Fix leading indentation in 2 lines.


# b9234faf 15-Oct-2002 Sam Leffler <sam@FreeBSD.org>

Tie new "Fast IPsec" code into the build. This involves the usual
configuration stuff as well as conditional code in the IPv4 and IPv6
areas. Everything is conditional on FAST_IPSEC which is mutually
exclusive with IPSEC (KAME IPsec implmentation).

As noted previously, don't use FAST_IPSEC with INET6 at the moment.

Reviewed by: KAME, rwatson
Approved by: silence
Supported by: Vernier Networks


# 5d846453 15-Oct-2002 Sam Leffler <sam@FreeBSD.org>

Replace aux mbufs with packet tags:

o instead of a list of mbufs use a list of m_tag structures a la openbsd
o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit
ABI/module number cookie
o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and
use this in defining openbsd-compatible m_tag_find and m_tag_get routines
o rewrite KAME use of aux mbufs in terms of packet tags
o eliminate the most heavily used aux mbufs by adding an additional struct
inpcb parameter to ip_output and ip6_output to allow the IPsec code to
locate the security policy to apply to outbound packets
o bump __FreeBSD_version so code can be conditionalized
o fixup ipfilter's call to ip_output based on __FreeBSD_version

Reviewed by: julian, luigi (silent), -arch, -net, darren
Approved by: julian, silence from everyone else
Obtained from: openbsd (mostly)
MFC after: 1 month


# a84db8f4 30-Sep-2002 Matthew Dillon <dillon@FreeBSD.org>

Guido found another bug. There is a situation with
timestamped TCP packets where FreeBSD will send DATA+FIN and
A W2K box will ack just the DATA portion. If this occurs
after FreeBSD has done a (NewReno) fast-retransmit and is
recovering it (dupacks > threshold) it triggers a case in
tcp_newreno_partial_ack() (tcp_newreno() in stable) where
tcp_output() is called with the expectation that the retransmit
timer will be reloaded. But tcp_output() falls through and
returns without doing anything, causing the persist timer to be
loaded instead. This causes the connection to hang until W2K gives up.
This occurs because in the case where only the FIN must be acked, the
'len' calculation in tcp_output() will be 0, a lot of checks will be
skipped, and the FIN check will also be skipped because it is designed
to handle FIN retransmits, not forced transmits from tcp_newreno().

The solution is to simply set TF_ACKNOW before calling tcp_output()
to absolute guarentee that it will run the send code and reset the
retransmit timer. TF_ACKNOW is already used for this purpose in other
cases.

For some unknown reason this patch also seems to greatly reduce
the number of duplicate acks received when Guido runs his tests over
a lossy network. It is quite possible that there are other
tcp_newreno{_partial_ack()} cases which were not generating the expected
output which this patch also fixes.

X-MFC after: Will be MFC'd after the freeze is over


# c1c36a2c 21-Sep-2002 Mike Silbersack <silby@FreeBSD.org>

Fix issue where shutdown(socket, SHUT_RD) was effectively
ignored for TCP sockets.

NetBSD PR: 18185
Submitted by: Sean Boudreau <seanb@qnx.com>
MFC after: 3 days


# fa55172b 17-Sep-2002 Matthew Dillon <dillon@FreeBSD.org>

Guido reported an interesting bug where an FTP connection between a
Windows 2000 box and a FreeBSD box could stall. The problem turned out
to be a timestamp reply bug in the W2K TCP stack. FreeBSD sends a
timestamp with the SYN, W2K returns a timestamp of 0 in the SYN+ACK
causing FreeBSD to calculate an insane SRTT and RTT, resulting in
a maximal retransmit timeout (60 seconds). If there is any packet
loss on the connection for the first six or so packets the retransmit
case may be hit (the window will still be too small for fast-retransmit),
causing a 60+ second pause. The W2K box gives up and closes the
connection.

This commit works around the W2K bug.

15:04:59.374588 FREEBSD.20 > W2K.1036: S 1420807004:1420807004(0) win 65535 <mss 1460,nop,wscale 2,nop,nop,timestamp 188297344 0> (DF) [tos 0x8]
15:04:59.377558 W2K.1036 > FREEBSD.20: S 4134611565:4134611565(0) ack 1420807005 win 17520 <mss 1460,nop,wscale 0,nop,nop,timestamp 0 0> (DF)

Bug reported by: Guido van Rooij <guido@gvr.org>


# 93b0017f 25-Aug-2002 Philippe Charnier <charnier@FreeBSD.org>

Replace various spelling with FALLTHROUGH which is lint()able


# ded7008a 19-Aug-2002 Juli Mallett <jmallett@FreeBSD.org>

Enclose IPv6 addresses in brackets when they are displayed printable with a
TCP/UDP port seperated by a colon. This is for the log_in_vain facility.

Pointed out by: Edward J. M. Brocklesby
Reviewed by: ume
MFC after: 2 weeks


# 1fcc99b5 17-Aug-2002 Matthew Dillon <dillon@FreeBSD.org>

Implement TCP bandwidth delay product window limiting, similar to (but
not meant to duplicate) TCP/Vegas. Add four sysctls and default the
implementation to 'off'.

net.inet.tcp.inflight_enable enable algorithm (defaults to 0=off)
net.inet.tcp.inflight_debug debugging (defaults to 1=on)
net.inet.tcp.inflight_min minimum window limit
net.inet.tcp.inflight_max maximum window limit

MFC after: 1 week


# c068736a 16-Aug-2002 Jeffrey Hsu <hsu@FreeBSD.org>

Cosmetic-only changes for readability.

Reviewed by: (early form passed by) bde
Approved by: itojun (from core@kame.net)


# fb95b5d3 15-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Rename mac_check_socket_receive() to mac_check_socket_deliver() so that
we can use the names _receive() and _send() for the receive() and send()
checks. Rename related constants, policy implementations, etc.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# b5addd85 15-Aug-2002 Jeffrey Hsu <hsu@FreeBSD.org>

Reset dupack count in header prediction.
Follow-on to rev 1.39.

Reviewed by: jayanth, Thomas R Henderson <thomas.r.henderson@boeing.com>, silby, dillon


# c488362e 31-Jul-2002 Robert Watson <rwatson@FreeBSD.org>

Introduce support for Mandatory Access Control and extensible
kernel access control.

Instrument the TCP socket code for packet generation and delivery:
label outgoing mbufs with the label of the socket, and check socket and
mbuf labels before permitting delivery to a socket. Assign labels
to newly accepted connections when the syncache/cookie code has done
its business. Also set peer labels as convenient. Currently,
MAC policies cannot influence the PCB matching algorithm, so cannot
implement polyinstantiation. Note that there is at least one case
where a PCB is not available due to the TCP packet not being associated
with any socket, so we don't label in that case, but need to handle
it in a special manner.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 88c39af3 22-Jul-2002 Ruslan Ermilov <ru@FreeBSD.org>

Don't shrink socket buffers in tcp_mss(), application might have already
configured them with setsockopt(SO_*BUF), for RFC1323's scaled windows.

PR: kern/11966
MFC after: 1 week


# d65bf08a 19-Jul-2002 Matthew Dillon <dillon@FreeBSD.org>

Add the tcps_sndrexmitbad statistic, keep track of late acks that caused
unnecessary retransmissions.


# 6fd22caf 24-Jun-2002 Jeffrey Hsu <hsu@FreeBSD.org>

Avoid unlocking the inp twice if badport_bandlim() returns -1.

Reported by: jlemon


# f14e4cfe 24-Jun-2002 Jeffrey Hsu <hsu@FreeBSD.org>

Style bug: fix 4 space indentations that should have been tabs.

Submitted by: jlemon


# 410bb1bf 23-Jun-2002 Luigi Rizzo <luigi@FreeBSD.org>

Move two global variables to automatic variables within the
only function where they are used (they are used with TCPDEBUG only).


# 2b25acc1 22-Jun-2002 Luigi Rizzo <luigi@FreeBSD.org>

Remove (almost all) global variables that were used to hold
packet forwarding state ("annotations") during ip processing.
The code is considerably cleaner now.

The variables removed by this change are:

ip_divert_cookie used by divert sockets
ip_fw_fwd_addr used for transparent ip redirection
last_pkt used by dynamic pipes in dummynet

Removal of the first two has been done by carrying the annotations
into volatile structs prepended to the mbuf chains, and adding
appropriate code to add/remove annotations in the routines which
make use of them, i.e. ip_input(), ip_output(), tcp_input(),
bdg_forward(), ether_demux(), ether_output_frame(), div_output().

On passing, remove a bug in divert handling of fragmented packet.
Now it is the fragment at offset 0 which sets the divert status of
the whole packet, whereas formerly it was the last incoming fragment
to decide.

Removal of last_pkt required a change in the interface of ip_fw_chk()
and dummynet_io(). On passing, use the same mechanism for dummynet
annotations and for divert/forward annotations.

option IPFIREWALL_FORWARD is effectively useless, the code to
implement it is very small and is now in by default to avoid the
obfuscation of conditionally compiled code.

NOTES:
* there is at least one global variable left, sro_fwd, in ip_output().
I am not sure if/how this can be removed.

* I have deliberately avoided gratuitous style changes in this commit
to avoid cluttering the diffs. Minor stule cleanup will likely be
necessary

* this commit only focused on the IP layer. I am sure there is a
number of global variables used in the TCP and maybe UDP stack.

* despite the number of files touched, there are absolutely no API's
or data structures changed by this commit (except the interfaces of
ip_fw_chk() and dummynet_io(), which are internal anyways), so
an MFC is quite safe and unintrusive (and desirable, given the
improved readability of the code).

MFC after: 10 days


# 03e49181 18-Jun-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Remove so*_locked(), which were backed out by mistake.


# f76fcf6d 10-Jun-2002 Jeffrey Hsu <hsu@FreeBSD.org>

Lock up inpcb.

Submitted by: Jennifer Yang <yangjihui@yahoo.com>


# 4cc20ab1 31-May-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Back out my lats commit of locking down a socket, it conflicts with hsu's work.

Requested by: hsu


# 243917fe 19-May-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Lock down a socket, milestone 1.

o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

- so_count
- so_options
- so_linger
- so_state

o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:

- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()

Reviewed by: alfred


# f1320723 01-May-2002 Alfred Perlstein <alfred@FreeBSD.org>

Redo the sigio locking.

Turn the sigio sx into a mutex.

Sigio lock is really only needed to protect interrupts from dereferencing
the sigio pointer in an object when the sigio itself is being destroyed.

In order to do this in the most unintrusive manner change pgsigio's
sigio * argument into a **, that way we can lock internally to the
function.


# 960ed29c 29-Apr-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Revert the change of #includes in sys/filedesc.h and sys/socketvar.h.

Requested by: bde

Since locking sigio_lock is usually followed by calling pgsigio(),
move the declaration of sigio_lock and the definitions of SIGIO_*() to
sys/signalvar.h.

While I am here, sort include files alphabetically, where possible.


# d48d4b25 27-Apr-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Add a global sx sigio_lock to protect the pointer to the sigio object
of a socket. This avoids lock order reversal caused by locking a
process in pgsigio().

sowakeup() and the callers of it (sowwakeup, soisconnected, etc.) now
require sigio_lock to be locked. Provide sowwakeup_locked(),
soisconnected_locked(), and so on in case where we have to modify a
socket and wake up a process atomically.


# 88ff5695 18-Apr-2002 SUZUKI Shinsuke <suz@FreeBSD.org>

just merged cosmetic changes from KAME to ease sync between KAME and FreeBSD.
(based on freebsd4-snap-20020128)

Reviewed by: ume
MFC after: 1 week


# 898568d8 10-Apr-2002 Mike Silbersack <silby@FreeBSD.org>

Remove some ISN generation code which has been unused since the
syncache went in.

MFC after: 3 days


# c1cd65ba 24-Mar-2002 Bruce Evans <bde@FreeBSD.org>

Fixed some style bugs in the removal of __P(()). Continuation lines
were not outdented to preserve non-KNF lining up of code with parentheses.
Switch to KNF formatting.


# 4d77a549 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# 93ec91ba 27-Feb-2002 Crist J. Clark <cjc@FreeBSD.org>

Change the wording of the inline comments from the previous commit.

Objection from: ru


# 2ca2159f 25-Feb-2002 Crist J. Clark <cjc@FreeBSD.org>

The TCP code did not do sufficient checks on whether incoming packets
were destined for a broadcast IP address. All TCP packets with a
broadcast destination must be ignored. The system only ignored packets
that were _link-layer_ broadcasts or multicast. We need to check the
IP address too since it is quite possible for a broadcast IP address
to come in with a unicast link-layer address.

Note that the check existed prior to CSRG revision 7.35, but was
removed. This commit effectively backs out that nine-year-old change.

PR: misc/35022


# fd8e4ebc 18-Feb-2002 Mike Barcroft <mike@FreeBSD.org>

o Move NTOHL() and associated macros into <sys/param.h>. These are
deprecated in favor of the POSIX-defined lowercase variants.
o Change all occurrences of NTOHL() and associated marcros in the
source tree to use the lowercase function variants.
o Add missing license bits to sparc64's <machine/endian.h>.
Approved by: jake
o Clean up <machine/endian.h> files.
o Remove unused __uint16_swap_uint32() from i386's <machine/endian.h>.
o Remove prototypes for non-existent bswapXX() functions.
o Include <machine/endian.h> in <arpa/inet.h> to define the
POSIX-required ntohl() family of functions.
o Do similar things to expose the ntohl() family in libstand, <netinet/in.h>,
and <sys/param.h>.
o Prepend underscores to the ntohl() family to help deal with
complexities associated with having MD (asm and inline) versions, and
having to prevent exposure of these functions in other headers that
happen to make use of endian-specific defines.
o Create weak aliases to the canonical function name to help deal with
third-party software forgetting to include an appropriate header.
o Remove some now unneeded pollution from <sys/types.h>.
o Add missing <arpa/inet.h> includes in userland.

Tested on: alpha, i386
Reviewed by: bde, jake, tmm


# e6658b12 04-Jan-2002 Robert Watson <rwatson@FreeBSD.org>

o Spelling fix in comment: tcp_ouput -> tcp_output


# 0ef3206b 12-Dec-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Fix up tabs in comments.


# 262c1c1a 02-Dec-2001 Matthew Dillon <dillon@FreeBSD.org>

Fix a bug with transmitter restart after receiving a 0 window. The
receiver was not sending an immediate ack with delayed acks turned on
when the input buffer is drained, preventing the transmitter from
restarting immediately.

Propogate the TCP_NODELAY option to accept()ed sockets. (Helps tbench and
is a good idea anyway).

Some cleanup. Identify additonal issues in comments.

MFC after: 1 day


# be2ac88c 21-Nov-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Introduce a syncache, which enables FreeBSD to withstand a SYN flood
DoS in an improved fashion over the existing code.

Reviewed by: silby (in a previous iteration)
Sponsored by: DARPA, NAI Labs


# d00fd201 21-Nov-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Move initialization of snd_recover into tcp_sendseqinit().


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# f0ffb944 03-Sep-2001 Julian Elischer <julian@FreeBSD.org>

Patches from Keiichi SHIMA <keiichi@iij.ad.jp>
to make ip use the standard protosw structure again.

Obtained from: Well, KAME I guess.


# e7e2b801 29-Aug-2001 Jayanth Vijayaraghavan <jayanth@FreeBSD.org>

when newreno is turned on, if dupacks = 1 or dupacks = 2 and
new data is acknowledged, reset the dupacks to 0.
The problem was spotted when a connection had its send buffer full
because the congestion window was only 1 MSS and was not being incremented
because dupacks was not reset to 0.

Obtained from: Yahoo!


# 745bab7f 23-Aug-2001 Dima Dorfman <dd@FreeBSD.org>

Correct a typo in a comment: FIN_WAIT2 -> FIN_WAIT_2

PR: 29970
Submitted by: Joseph Mallett <jmallett@xMach.org>


# b0e3ad75 21-Aug-2001 Mike Silbersack <silby@FreeBSD.org>

Much delayed but now present: RFC 1948 style sequence numbers

In order to ensure security and functionality, RFC 1948 style
initial sequence number generation has been implemented. Barring
any major crypographic breakthroughs, this algorithm should be
unbreakable. In addition, the problems with TIME_WAIT recycling
which affect our currently used algorithm are not present.

Reviewed by: jesper


# 2d610a50 07-Jul-2001 Mike Silbersack <silby@FreeBSD.org>

Temporary feature: Runtime tuneable tcp initial sequence number
generation scheme. Users may now select between the currently used
OpenBSD algorithm and the older random positive increment method.

While the OpenBSD algorithm is more secure, it also breaks TIME_WAIT
handling; this is causing trouble for an increasing number of folks.

To switch between generation schemes, one sets the sysctl
net.inet.tcp.tcp_seq_genscheme. 0 = random positive increments,
1 = the OpenBSD algorithm. 1 is still the default.

Once a secure _and_ compatible algorithm is implemented, this sysctl
will be removed.

Reviewed by: jlemon
Tested by: numerous subscribers of -net


# c73d99b5 23-Jun-2001 Ruslan Ermilov <ru@FreeBSD.org>

Add netstat(1) knob to reset net.inet.{ip|icmp|tcp|udp|igmp}.stats.
For example, ``netstat -s -p ip -z'' will show and reset IP stats.

PR: bin/17338


# 08517d53 22-Jun-2001 Mike Silbersack <silby@FreeBSD.org>

Eliminate the allocation of a tcp template structure for each
connection. The information contained in a tcptemp can be
reconstructed from a tcpcb when needed.

Previously, tcp templates required the allocation of one
mbuf per connection. On large systems, this change should
free up a large number of mbufs.

Reviewed by: bmilekic, jlemon, ru
MFC after: 2 weeks


# 33841545 10-Jun-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

Sync with recent KAME.
This work was based on kame-20010528-freebsd43-snap.tgz and some
critical problem after the snap was out were fixed.
There are many many changes since last KAME merge.

TODO:
- The definitions of SADB_* in sys/net/pfkeyv2.h are still different
from RFC2407/IANA assignment because of binary compatibility
issue. It should be fixed under 5-CURRENT.
- ip6po_m member of struct ip6_pktopts is no longer used. But, it
is still there because of binary compatibility issue. It should
be removed under 5-CURRENT.

Reviewed by: itojun
Obtained from: KAME
MFC after: 3 weeks


# 65f28919 06-Jun-2001 Jesper Skriver <jesper@FreeBSD.org>

Silby's take one on increasing FreeBSD's resistance to SYN floods:

One way we can reduce the amount of traffic we send in response to a SYN
flood is to eliminate the RST we send when removing a connection from
the listen queue. Since we are being flooded, we can assume that the
majority of connections in the queue are bogus. Our RST is unwanted
by these hosts, just as our SYN-ACK was. Genuine connection attempts
will result in hosts responding to our SYN-ACK with an ACK packet. We
will automatically return a RST response to their ACK when it gets to us
if the connection has been dropped, so the early RST doesn't serve the
genuine class of connections much. In summary, we can reduce the number
of packets we send by a factor of two without any loss in functionality
by ensuring that RST packets are not sent when dropping a connection
from the listen queue.

Submitted by: Mike Silbersack <silby@silby.com>
Reviewed by: jesper
MFC after: 2 weeks


# e4b64281 29-May-2001 Jesper Skriver <jesper@FreeBSD.org>

Inline TCP_REASS() in the single location where it's used,
just as OpenBSD and NetBSD has done.

No functional difference.

MFC after: 2 weeks


# 853be122 29-May-2001 Jesper Skriver <jesper@FreeBSD.org>

properly delay acks in half-closed TCP connections

PR: 24962
Submitted by: Tony Finch <dot@dotat.at>
MFC after: 2 weeks


# d1745f45 20-Apr-2001 Jesper Skriver <jesper@FreeBSD.org>

Say goodbye to TCP_COMPAT_42

Reviewed by: wollman
Requested by: wollman


# f0a04f3f 17-Apr-2001 Kris Kennaway <kris@FreeBSD.org>

Randomize the TCP initial sequence numbers more thoroughly.

Obtained from: OpenBSD
Reviewed by: jesper, peter, -developers


# c59319bf 19-Mar-2001 Dag-Erling Smørgrav <des@FreeBSD.org>

Axe TCP_RESTRICT_RST. It was never a particularly good idea except for a few
very specific scenarios, and now that we have had net.inet.tcp.blackhole for
quite some time there is really no reason to use it any more.

(last of three commits)


# d8c85a26 25-Feb-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Do not delay a new ack if there already is a delayed ack pending on the
connection, but send it immediately. Prior to this change, it was possible
to delay a delayed-ack for multiple times, resulting in degraded TCP
behavior in certain corner cases.


# a57815ef 11-Feb-2001 Bosko Milekic <bmilekic@FreeBSD.org>

Clean up RST ratelimiting. Previously, ratelimiting occured before tests
were performed to determine if the received packet should be reset. This
created erroneous ratelimiting and false alarms in some cases. The code
has now been reorganized so that the checks for validity come before
the call to badport_bandlim. Additionally, a few changes in the symbolic
names of the bandlim types have been made, as well as a clarification of
exactly which type each RST case falls under.

Submitted by: Mike Silbersack <silby@silby.com>


# a589a70e 24-Jan-2001 Garrett Wollman <wollman@FreeBSD.org>

Correct a comment.


# 09f81a46 15-Dec-2000 Bosko Milekic <bmilekic@FreeBSD.org>

Change the following:

1. ICMP ECHO and TSTAMP replies are now rate limited.
2. RSTs generated due to packets sent to open and unopen ports
are now limited by seperate counters.
3. Each rate limiting queue now has its own description, as
follows:

Limiting icmp unreach response from 439 to 200 packets per second
Limiting closed port RST response from 283 to 200 packets per second
Limiting open port RST response from 18724 to 200 packets per second
Limiting icmp ping response from 211 to 200 packets per second
Limiting icmp tstamp response from 394 to 200 packets per second

Submitted by: Mike Silbersack <silby@silby.com>


# 7cc0979f 08-Dec-2000 David Malone <dwmalone@FreeBSD.org>

Convert more malloc+bzero to malloc+M_ZERO.

Submitted by: josh@zipperup.org
Submitted by: Robert Drehmel <robd@gmx.net>


# 8735719e 04-Nov-2000 Jonathan Lemon <jlemon@FreeBSD.org>

tp->snd_recover is part of the New Reno recovery algorithm, and should
only be checked if the system is currently performing New Reno style
fast recovery. However, this value was being checked regardless of the
NR state, with the end result being that the congestion window was never
opened.

Change the logic to check t_dupack instead; the only code path that
allows it to be nonzero at this point is NewReno, so if it is nonzero,
we are in fast recovery mode and should not touch the congestion window.

Tested by: phk


# e7f32693 21-Jul-2000 Jayanth Vijayaraghavan <jayanth@FreeBSD.org>

When a connection is being dropped due to a listen queue overflow,
delete the cloned route that is associated with the connection.
This does not exhaust the routing table memory when the system
is under a SYN flood attack. The route entry is not deleted if there
is any prior information cached in it.

Reviewed by: Peter Wemm,asmodai


# b474779f 09-Jul-2000 Jun-ichiro itojun Hagino <itojun@FreeBSD.org>

be more cautious about tcp option length field. drop bogus ones earlier.
not sure if there is a real threat or not, but it seems that there's
possibility for overrun/underrun (like non-NOP option with optlen > cnt).


# 686cdd19 04-Jul-2000 Jun-ichiro itojun Hagino <itojun@FreeBSD.org>

sync with kame tree as of july00. tons of bug fixes/improvements.

API changes:
- additional IPv6 ioctls
- IPsec PF_KEY API was changed, it is mandatory to upgrade setkey(8).
(also syntax change)


# 4f14ee00 22-May-2000 Dan Moschuk <dan@FreeBSD.org>

sysctl'ize ICMP_BANDLIM and ICMP_BANDLIM_SUPPRESS_OUTPUT.

Suggested by: des/nbm


# d8417274 18-May-2000 Jayanth Vijayaraghavan <jayanth@FreeBSD.org>

snd_cwnd was updated twice in the tcp_newreno function.


# 75c6e0e2 17-May-2000 Jayanth Vijayaraghavan <jayanth@FreeBSD.org>

Sigh, fix a rookie patch merge error.

Also-missed-by: peter


# 6b2a5f92 15-May-2000 Jayanth Vijayaraghavan <jayanth@FreeBSD.org>

snd_una was being updated incorrectly, this resulted in the newreno
code retransmitting data from the wrong offset.

As a footnote, the newreno code was partially derived from NetBSD
and Tom Henderson <tomh@cs.berkeley.edu>


# 46f58482 05-May-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Implement TCP NewReno, as documented in RFC 2582. This allows
better recovery for multiple packet losses in a single window.
The algorithm can be toggled via the sysctl net.inet.tcp.newreno,
which defaults to "on".

Submitted by: Jayanth Vijayaraghavan <jayanth@yahoo-inc.com>


# 5e0ab69d 17-Apr-2000 Munechika SUMIKAWA <sumikawa@FreeBSD.org>

ND6_HINT() should not be called unless the connection status is
ESTABLISHED.

Obtained from: KAME Project


# fdaf052e 01-Apr-2000 Yoshinobu Inoue <shin@FreeBSD.org>

Support per socket based IPv4 mapped IPv6 addr enable/disable control.

Submitted by: ume


# db4f9cc7 27-Mar-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Add support for offloading IP/TCP/UDP checksums to NIC hardware which
supports them.


# 4739b807 11-Mar-2000 Yoshinobu Inoue <shin@FreeBSD.org>

IPv6 6to4 support.

Now most big problem of IPv6 is getting IPv6 address
assignment.
6to4 solve the problem. 6to4 addr is defined like below,

2002: 4byte v4 addr : 2byte SLA ID : 8byte interface ID

The most important point of the address format is that an IPv4 addr
is embeded in it. So any user who has IPv4 addr can get IPv6 address
block with 2byte subnet space. Also, the IPv4 addr is used for
semi-automatic IPv6 over IPv4 tunneling.

With 6to4, getting IPv6 addr become dramatically easy.
The attached patch enable 6to4 extension, and confirmed to work,
between "Richard Seaman, Jr." <dick@tar.com> and me.

Approved by: jkh

Reviewed by: itojun


# 173c0f9f 27-Jan-2000 Warner Losh <imp@FreeBSD.org>

Mitigate the stream.c attacks

o Drop all broadcast and multicast source addresses in tcp_input.
o Enable ICMP_BANDLIM in GENERIC.
o Change default to 200/s from 100/s. This will still stop the attack, but
is conservative enough to do this close to code freeze.

This is not the optimal patch for the problem, but is likely the least
intrusive patch that can be made for this.

Obtained from: Don Lewis and Matt Dillon.
Reviewed by: freebsd-security


# 69a34685 24-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

Avoid m_len and m_pkthdr.len inconsistency when changing m_len
for an mbuf whose M_PKTHDR is set.

PR: related to kern/15175
Reviewed by: archie


# 3a2a9f79 15-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

Fixed the problem that IPsec connection hangs when bigger data is sent.
-opt_ipsec.h was missing on some tcp files (sorry for basic mistake)
-made buildable as above fix
-also added some missing IPv4 mapped IPv6 addr consideration into
ipsec4_getpolicybysock


# 8972cdb1 12-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

add a comment for some possible? IPv4 option processing.


# fb59c426 09-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

tcp updates to support IPv6.
also a small patch to sys/nfs/nfs_socket.c, as max_hdr size change.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 6a800098 22-Dec-1999 Yoshinobu Inoue <shin@FreeBSD.org>

IPSEC support in the kernel.
pr_input() routines prototype is also changed to support IPSEC and IPV6
chained protocol headers.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# c0f7fd55 14-Dec-1999 Jonathan Lemon <jlemon@FreeBSD.org>

Use SEQ_* macros for comparing sequence space numbers.

Reviewed by: truckman


# 1a244a61 10-Dec-1999 Jonathan Lemon <jlemon@FreeBSD.org>

According to RFC 793, a reset should be honored if the sequence number
is within the receive window. Follow this behavior, instead of only
allowing resets at last_ack_sent.

Pointed out by: jayanth@yahoo-inc.com


# cfa1ca9d 07-Dec-1999 Yoshinobu Inoue <shin@FreeBSD.org>

udp IPv6 support, IPv6/IPv4 tunneling support in kernel,
packet divert at kernel for IPv6/IPv4 translater daemon

This includes queue related patch submitted by jburkhol@home.com.

Submitted by: queue related patch from jburkhol@home.com
Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# ecf72308 09-Oct-1999 Brian Feldman <green@FreeBSD.org>

Implement RLIMIT_SBSIZE in the kernel. This is a per-uid sockbuf total
usage limit.


# f8613305 14-Sep-1999 Dag-Erling Smørgrav <des@FreeBSD.org>

Fix some more disordering, as well as the description string for the
net.inet.tcp.drop_synfin sysctl, which for some mysterious reason said
"Drop TCP packets with FIN+ACK set" (instead of "...with SYN+FIN set")


# e46cd3d4 12-Sep-1999 Dag-Erling Smørgrav <des@FreeBSD.org>

Add the net.inet.tcp.restrict_rst and net.inet.tcp.drop_synfin sysctl
variables, conditional on the TCP_RESTRICT_RST and TCP_DROP_SYNFIN kernel
options, respectively. See the comments in LINT for details.


# 9b8b58e0 30-Aug-1999 Jonathan Lemon <jlemon@FreeBSD.org>

Restructure TCP timeout handling:

- eliminate the fast/slow timeout lists for TCP and instead use a
callout entry for each timer.
- increase the TCP timer granularity to HZ
- implement "bad retransmit" recovery, as presented in
"On Estimating End-to-End Network Path Properties", by Allman and Paxson.

Submitted by: jlemon, wollmann


# 5a8c77a8 29-Aug-1999 David E. O'Brien <obrien@FreeBSD.org>

Remove extra indenting of `break' statements introducted in rev 1.89,
plus wrap some long lines from that revision.

While here, wrap some other long lines.


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# 828b7f40 18-Aug-1999 Geoff Rehmet <csgr@FreeBSD.org>

Fix breakage if blackhole=1 and tiflags & TH_SYN, plus
style(9) fixes

Submitted by: Jonathon Lemon


# 2e4e1b4c 18-Aug-1999 Geoff Rehmet <csgr@FreeBSD.org>

Slight tweak to tcp.blackhole to add optional behaviour to
drop any segment arriving at a closed port.
tcp.blackhole=1 - only drop SYN without RST
tcp.blackhole=2 - drop everything without RST
tcp.blackhole=0 - always send RST - default behaviour

This confuses nmap -sF or -sX or -sN quite badly.


# 16f7f31f 16-Aug-1999 Geoff Rehmet <csgr@FreeBSD.org>

Add net.inet.tcp.blackhole and net.inet.udp.blackhole
sysctl knobs.

With these knobs on, refused connection attempts are dropped
without sending a RST, or Port unreachable in the UDP case.
In the TCP case, sending of RST is inhibited iff the incoming
segment was a SYN.

Docs and rc.conf settings to follow.


# e9bd3a37 18-Jul-1999 Jonathan M. Bresler <jmb@FreeBSD.org>

fix comment re: RST received in TIME_WAIT to match the code.


# dfd5dee1 06-May-1999 Peter Wemm <peter@FreeBSD.org>

Add sufficient braces to keep egcs happy about potentially ambiguous
if/else nesting.


# 3d177f46 03-May-1999 Bill Fumerola <billf@FreeBSD.org>

Add sysctl descriptions to many SYSCTL_XXXs

PR: kern/11197
Submitted by: Adrian Chadd <adrian@FreeBSD.org>
Reviewed by: billf(spelling/style/minor nits)
Looked at by: bde(style)


# 51b7b337 05-Feb-1999 Bill Fenner <fenner@FreeBSD.org>

Use snd_nxt, not rcv_nxt, when calculating the ISS during TIME_WAIT.
This was missed in the 4.4-Lite2 merge.

Noticed by: Mohan Parthasarathy <Mohan.Parthasarathy@eng.Sun.COM> and
jayanth@loc201.tandem.com (vijayaraghavan_jayanth)
on the tcp-impl mailing list.


# 831a80b0 27-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# 51508de1 03-Dec-1998 Matthew Dillon <dillon@FreeBSD.org>

Reviewed by: freebsd-current

Add ICMP_BANDLIM option and 'net.inet.icmp.icmplim' sysctl. If option
is specified in kernel config, icmplim defaults to 100 pps. Setting it
to 0 will disable the feature. This feature limits ICMP error responses
for packets sent to bad tcp or udp ports, which does a lot to help the
machine handle network D.O.S. attacks.

The kernel will report packet rates that exceed the limit at a rate of
one kernel printf per second. There is one issue in regards to the
'tail end' of an attack... the kernel will not output the last report
until some unrelated and valid icmp error packet is return at some
point after the attack is over. This is a minor reporting issue only.


# 80ab7c0e 11-Sep-1998 Garrett Wollman <wollman@FreeBSD.org>

Fix RST validation.

PR: 7892
Submitted by: Don.Lewis@tsc.tdk.com


# 6effc713 24-Aug-1998 Doug Rabson <dfr@FreeBSD.org>

Re-implement tcp and ip fragment reassembly to not store pointers in the
ip header which can't work on alpha since pointers are too big.

Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>


# f9e354df 05-Jul-1998 Julian Elischer <julian@FreeBSD.org>

Support for IPFW based transparent forwarding.
Any packet that can be matched by a ipfw rule can be redirected
transparently to another port or machine. Redirection to another port
mostly makes sense with tcp, where a session can be set up
between a proxy and an unsuspecting client. Redirection to another machine
requires that the other machine also be expecting to receive the forwarded
packets, as their headers will not have been modified.

/sbin/ipfw must be recompiled!!!

Reviewed by: Peter Wemm <peter@freebsd.org>
Submitted by: Chrisy Luke <chrisy@flix.net>


# 04a3fd12 31-May-1998 Peter Wemm <peter@FreeBSD.org>

Let the sowwakeup macro decide when to call sowakeup rather than have
tcp "know" about it. A pending upcall would be missed, eg: used by NFS.

Obtained from: NetBSD


# 068373b6 18-May-1998 Guido van Rooij <guido@FreeBSD.org>

Grumble...It seems I'm suffering from some mental disease. Do it correct now.


# 0bce271a 18-May-1998 Guido van Rooij <guido@FreeBSD.org>

Add some parenthesis for clarity and fix a bug
Pointed out by: Garrett Wollmand


# 11ad4550 04-May-1998 Guido van Rooij <guido@FreeBSD.org>

Refuse accellerated opens on listening sockets that have not set
the TCP_NOPUSH socket option.
This disables TAO for those services that do not know about T/TCP.

Reviewed by: Garrett Wollman
Submitted by: Peter Wemm


# 84e33c9e 24-Apr-1998 David Greenman <dg@FreeBSD.org>

At the request of Garrett, changed sysctl:

net.inet.tcp.delack_enabled -> net.inet.tcp.delayed_ack


# dc733423 17-Apr-1998 Dag-Erling Smørgrav <des@FreeBSD.org>

Seventy-odd "its" / "it's" typos in comments fixed as per kern/6108.


# 8e5db87c 06-Apr-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Remove the last traces of TUBA.

Inspired by: PR kern/3317


# 75daa6a5 19-Mar-1998 Bill Fenner <fenner@FreeBSD.org>

Remove the check for SYN in SYN_RECEIVED state; it breaks simultaneous
connect. This check was added as part of the defense against the "land"
attack, to prevent attacks which guess the ISS from going into ESTABLISHED.
The "src == dst" check will still prevent the single-homed case of the
"land" attack, and guessing ISS's should be hard anyway.

Submitted by: David Borman <dab@bsdi.com>


# f498eeee 25-Feb-1998 David Greenman <dg@FreeBSD.org>

Changes to support the addition of a new sysctl variable:
net.inet.tcp.delack_enabled
Which defaults to 1 and can be set to 0 to disable TCP delayed-ack
processing (i.e. all acks are immediate).


# c3229e05 27-Jan-1998 David Greenman <dg@FreeBSD.org>

Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.

Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.

These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.

Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.

WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).


# 764d8cef 20-Jan-1998 Bill Fenner <fenner@FreeBSD.org>

A more complete fix for the "land" attack, removing the "quick fix" from
rev 1.66. This fix contains both belt and suspenders.

Belt: ignore packets where src == dst and srcport == dstport in TCPS_LISTEN.
These packets can only legitimately occur when connecting a socket to itself,
which doesn't go through TCPS_LISTEN (it goes CLOSED->SYN_SENT->SYN_RCVD->
ESTABLISHED). This prevents the "standard" "land" attack, although doesn't
prevent the multi-homed variation.

Suspenders: send a RST in response to a SYN/ACK in SYN_RECEIVED state.
The only packets we should get in SYN_RECEIVED are
1. A retransmitted SYN, or
2. An ack of our SYN/ACK.
The "land" attack depends on us accepting our own SYN/ACK as an ACK;
in SYN_RECEIVED state; this should prevent all "land" attacks.

We also move up the sequence number check for the ACK in SYN_RECEIVED.
This neither helps nor hurts with respect to the "land" attack, but
puts more of the validation checking in one spot.

PR: kern/5103


# 592071e8 19-Dec-1997 Bruce Evans <bde@FreeBSD.org>

Don't use ANSI string concatenation to misformat a string.


# 76d3eadb 20-Nov-1997 Garrett Wollman <wollman@FreeBSD.org>

Add Matt Dillon's quick fix hack for the self-connect DoS.

PR: 5103


# 4a11ca4e 07-Nov-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Remove a bunch of variables which were unused both in GENERIC and LINT.

Found by: -Wunused


# 55b211e3 28-Oct-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# 4281faf2 01-Oct-1997 David Greenman <dg@FreeBSD.org>

Killed the SYN_RECEIVED addition from rev 1.52. It results in legitimate
RST's being ignored, keeping a connection around until it times out, and
thus has the opposite effect of what was intended (which is to make the
system more robust to DoS attacks).


# 026650e5 30-Sep-1997 Bill Fenner <fenner@FreeBSD.org>

Don't consider a SYN/ACK with CC but no CCECHO a proper T/TCP
handshake.

Reviewed by: Rich Stevens <rstevens@kohala.com>


# 0cc12cc5 16-Sep-1997 Joerg Wunsch <joerg@FreeBSD.org>

Make TCPDEBUG a new-style option.


# 57bf258e 16-Aug-1997 Garrett Wollman <wollman@FreeBSD.org>

Fix all areas of the system (or at least all those in LINT) to avoid storing
socket addresses in mbufs. (Socket buffers are the one exception.) A number
of kernel APIs needed to get fixed in order to make this happen. Also,
fix three protocol families which kept PCBs in mbufs to not malloc them
instead. Delete some old compatibility cruft while we're at it, and add
some new routines in the in_cksum family.


# 66e39adc 30-Jun-1997 John Polstra <jdp@FreeBSD.org>

Fix a bug (apparently very old) that can cause a TCP connection to
be dropped when it has an unusual traffic pattern. For full details
as well as a test case that demonstrates the failure, see the
referenced PR.

Under certain circumstances involving the persist state, it is
possible for the receive side's tp->rcv_nxt to advance beyond its
tp->rcv_adv. This causes (tp->rcv_adv - tp->rcv_nxt) to become
negative. However, in the code affected by this fix, that difference
was interpreted as an unsigned number by max(). Since it was
negative, it was taken as a huge unsigned number. The effect was
to cause the receiver to believe that its receive window had negative
size, thereby rejecting all received segments including ACKs. As
the test case shows, this led to fruitless retransmissions and
eventually to a dropped connection. Even connections using the
loopback interface could be dropped. The fix substitutes the signed
imax() for the unsigned max() function.

PR: closes kern/3998
Reviewed by: davidg, fenner, wollman


# a29f300e 27-Apr-1997 Garrett Wollman <wollman@FreeBSD.org>

The long-awaited mega-massive-network-code- cleanup. Part I.

This commit includes the following changes:
1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility
glue for them is deleted, and the kernel will panic on boot if any are compiled
in.

2) Certain protocol entry points are modified to take a process structure,
so they they can easily tell whether or not it is possible to sleep, and
also to access credentials.

3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt()
call. Protocols should use the process pointer they are now passed.

4) The PF_LOCAL and PF_ROUTE families have been updated to use the new
style, as has the `raw' skeleton family.

5) PF_LOCAL sockets now obey the process's umask when creating a socket
in the filesystem.

As a result, LINT is now broken. I'm hoping that some enterprising hacker
with a bit more time will either make the broken bits work (should be
easy for netipx) or dike them out.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 39172c94 10-Nov-1996 Bill Fenner <fenner@FreeBSD.org>

Re-enable the TCP SYN-attack protection code. I was the one who didn't
understand the socket state flag.

2.2 candidate.


# a51764a8 11-Oct-1996 Paul Traina <pst@FreeBSD.org>

Fix two bugs I accidently put into the syn code at the last minute
(yes I had tested the hell out of this).

I've also temporarily disabled the code so that it behaves as it previously
did (tail drop's the syns) pending discussion with fenner about some socket
state flags that I don't fully understand.

Submitted by: fenner


# 6d6a026b 07-Oct-1996 David Greenman <dg@FreeBSD.org>

Improved in_pcblookuphash() to support wildcarding, and changed relavent
callers of it to take advantage of this. This reduces new connection
request overhead in the face of a large number of PCBs in the system.
Thanks to David Filo <filo@yahoo.com> for suggesting this and providing
a sample implementation (which wasn't used, but showed that it could be
done).

Reviewed by: wollman


# ebb0cbea 06-Oct-1996 Paul Traina <pst@FreeBSD.org>

Increase robustness of FreeBSD against high-rate connection attempt
denial of service attacks.

Reviewed by: bde,wollman,olah
Inspired by: vjs@sgi.com


# 12eafeb0 21-Sep-1996 Paul Traina <pst@FreeBSD.org>

I don't understand, I committed this fix (move a counter and fixed a typo)
this evening.

I think I'm going insane.


# 96d719e6 21-Sep-1996 Andrey A. Chernov <ache@FreeBSD.org>

Syntax error: so_incom -> so_incomp


# 4195b4af 20-Sep-1996 Paul Traina <pst@FreeBSD.org>

If the incomplete listen queue for a given socket is full,
drop the oldest entry in the queue.

There was a fair bit of discussion as to whether or not the
proper action is to drop a random entry in the queue. It's
my conclusion that a random drop is better than a head drop,
however profiling this section of code (done by John Capo)
shows that a head-drop results in a significant performance
increase.

There are scenarios where a random drop is more appropriate.
If I find one in reality, I'll add the random drop code under
a conditional.

Obtained from: discussions and code done by Vernon Schryver (vjs@sgi.com).


# 7b40aa32 13-Sep-1996 Paul Traina <pst@FreeBSD.org>

Make the misnamed tcp initial keepalive timer value (which is really the
time, in seconds, that state for non-established TCP sessions stays about)
a sysctl modifyable variable.

[part 1 of two commits, I just realized I can't play with the indices as
I was typing this commit message.]


# 7ff19458 13-Sep-1996 Paul Traina <pst@FreeBSD.org>

Receipt of two SYN's are sufficient to set the t_timer[TCPT_KEEP]
to "keepidle". this should not occur unless the connection has
been established via the 3-way handshake which requires an ACK

Submitted by: jmb
Obtained from: problem discussed in Stevens vol. 3


# df5c0b8a 01-May-1996 Bill Fenner <fenner@FreeBSD.org>

Back out my stupid braino; I was thinking strlen and not sizeof.


# af00f800 01-May-1996 Bill Fenner <fenner@FreeBSD.org>

Size temp var correctly; buf[4*sizeof "123"] is not long enough
to store "192.252.119.189\0".


# 75cfc95f 27-Apr-1996 Andrey A. Chernov <ache@FreeBSD.org>

inet_ntoa buffer was evaluated twice in log_in_vain, fix it.
Thanx to: jdp


# a2352fc1 26-Apr-1996 Garrett Wollman <wollman@FreeBSD.org>

Delete #ifdef notdef blocks containing old method of srtt calculation.

Requested by: davidg


# d78a37ad 09-Apr-1996 Paul Traina <pst@FreeBSD.org>

Logging UDP and TCP connection attempts should not be enabled by default.
It's trivial to create a denial of service attack on a box so enabled.

These messages, if enabled at all, must be rate-limited. (!)


# 816a3d83 04-Apr-1996 Poul-Henning Kamp <phk@FreeBSD.org>

Log TCP syn packets for ports we don't listen on.
Controlled by: sysctl net.inet.tcp.log_in_vain: 1

Log UDP syn packets for ports we don't listen on.
Controlled by: sysctl net.inet.udp.log_in_vain: 1

Suggested by: Warren Toomey <wkt@cs.adfa.oz.au>


# 9e2874b0 25-Mar-1996 Garrett Wollman <wollman@FreeBSD.org>

Slight modification of RTO floor calculation.


# 233e8c18 22-Mar-1996 Garrett Wollman <wollman@FreeBSD.org>

A number of performance-reducing flaws fixed based on comments
from Larry Peterson &co. at Arizona:

- Header prediction for ACKs did not exclude Fast Retransmit/Recovery.
- srtt calculation tended to get ``stuck'' and could never decrease
when below 8. It still can't, but the scaling factors are adjusted
so that this artifact does not cause as bad an effect on the RTO
value as it used to.

The paper also points out the incr/8 error that has been long since fixed,
and the problems with ACKing frequency resulting from the use of options
which I suspect to be fixed already as well (as part of the T/TCP work).

Obtained from: Brakmo & Peterson, ``Performance Problems in BSD4.4 TCP''


# 2ee45d7d 11-Mar-1996 David Greenman <dg@FreeBSD.org>

Move or add #include <queue.h> in preparation for upcoming struct socket
changes.


# 1347f5b8 26-Feb-1996 Guido van Rooij <guido@FreeBSD.org>

Add a counter for the number of times the listen queue was overflowed to
the tcpstat structure. (netstat -s)
Reviewed by: wollman
Obtained from: Steves, TCP/IP Ill. vol.3, page 189


# f9d5a964 22-Feb-1996 David Greenman <dg@FreeBSD.org>

Fixed bug in Path MTU Discovery that caused the system to have to re-
discover the Path MTU for each connection if the connecting host didn't
offer an initial MSS.

Submitted by: davidg & olah


# 07e43e10 31-Jan-1996 Andras Olah <olah@FreeBSD.org>

Fix a bug related to the interworking of T/TCP and window scaling:
when a connection enters the ESTBLS state using T/TCP, then window
scaling wasn't properly handled. The fix is twofold.

1) When the 3WHS completes, make sure that we update our window
scaling state variables.

2) When setting the `virtual advertized window', then make sure
that we do not try to offer a window that is larger than the maximum
window without scaling (TCP_MAXWIN).

Reviewed by: davidg
Reported by: Jerry Chen <chen@Ipsilon.COM>


# f708ef1b 14-Dec-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Another mega commit to staticize things.


# 0312fbe9 14-Nov-1995 Poul-Henning Kamp <phk@FreeBSD.org>

New style sysctl & staticize alot of stuff.


# 98163b98 09-Nov-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Start adding new style sysctl here too.


# 845799c1 03-Nov-1995 Andras Olah <olah@FreeBSD.org>

Cosmetic changes to processing of segments in the SYN_SENT state:
- remove a redundant condition;
- complete all validity checks on segment before calling
soisconnected(so).

Reviewed by: Richard Stevens, davidg, wollman


# 91badc86 13-Oct-1995 Garrett Wollman <wollman@FreeBSD.org>

Routes can be asymmetric. Always offer to /accept/ an MSS of up to the
capacity of the link, even if the route's MTU indicates that we cannot
send that much in their direction. (This might actually make it possible
to test Path MTU discovery in a useful variety of cases.)


# e79adb8e 03-Oct-1995 Garrett Wollman <wollman@FreeBSD.org>

Finish 4.4-Lite-2 merge: randomize TCP initial sequence numbers
to make ISS-guessing spoofing attacks harder.


# efe4b0eb 21-Sep-1995 Garrett Wollman <wollman@FreeBSD.org>

Second try: get 4.4-Lite-2 into the source tree. The conflicts don't
matter because none of our working source files are on the CSRG branch
any more.

Obtained from: 4.4BSD-Lite-2


# d3eede9d 31-Jul-1995 Andras Olah <olah@FreeBSD.org>

Remove a redundant `if' from tcp_reass().

Correct a typo in a comment (SEND_SYN -> NEEDSYN).

Reviewed by: David Greenman


# dd224982 10-Jul-1995 Garrett Wollman <wollman@FreeBSD.org>

tcp_input.c - keep track of how many times a route contained a cached rtt
or ssthresh that we were able to use

tcp_var.h - declare tcpstat entries for above; declare tcp_{send,recv}space

in_rmx.c - fill in the MTU and pipe sizes with the defaults TCP would have
used anyway in the absence of values here


# fc978271 29-Jun-1995 Garrett Wollman <wollman@FreeBSD.org>

Keep track of the number of samples through the srtt filter so that we
know better when to cache values in the route, rather than relying on a
heuristic involving sequence numbers that broke when tcp_sendspace
was increased to 16k.


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# 6b067b07 10-May-1995 David Greenman <dg@FreeBSD.org>

#ifdef'd my Nagel/ACK hack with "TCP_ACK_HACK", disabled by default. I'm
currently considering reducing the TCP fasttimo to 100ms to help improve
things, but this would be done as a seperate step at some point in the
future.
This was done because it was causing some sometimes serious performance
problems with T/TCP.


# 40db8ef7 08-May-1995 Andras Olah <olah@FreeBSD.org>

Fix a misspelled constant in tcp_input.c.

On Tue, 09 May 1995 04:35:27 PDT, Richard Stevens wrote:
> In tcp_dooptions() under the case TCPOPT_CC there is an assignment
>
> to->to_flag |= TCPOPT_CC;
>
> that should be
>
> to->to_flag |= TOF_CC;
>
> I haven't thought through the ramifications of what's been happening ...
>
> Rich Stevens

Submitted by: rstevens@noao.edu (Richard Stevens)


# 0d7b7d3e 03-May-1995 David Greenman <dg@FreeBSD.org>

Changed in_pcblookuphash() to not automatically call in_pcblookup() if
the lookup fails. Updated callers to deal with this. Call in_pcblookuphash
instead of in_pcblookup() in in_pcbconnect; this improves performance of
UDP output by about 17% in the standard case.


# d79940da 10-Apr-1995 David Greenman <dg@FreeBSD.org>

Further satisfy my paranoia by making sure that the ACKNOW is only
set when ti_len is non-zero.


# afa70c96 10-Apr-1995 David Greenman <dg@FreeBSD.org>

Fixed bug I introduced with my Nagel hack which caused tcp_input and
tcp_output to loop endlessly. This was freefall's problem during the past
day.


# 15bd2b43 08-Apr-1995 David Greenman <dg@FreeBSD.org>

Implemented PCB hashing. Includes new functions in_pcbinshash, in_pcbrehash,
and in_pcblookuphash.


# 755c1f07 05-Apr-1995 Andras Olah <olah@FreeBSD.org>

Fix a bug in tcp_input reported by Rick Jones <raj@hpisrdq.cup.hp.com>.

If a goto findpcb occurred during the processing of a segment, the TCP and
IP headers were dropped twice from the mbuf which resulted in data acked
by TCP but not delivered to the user.
Reviewed by: davidg


# e612a582 27-Mar-1995 David Greenman <dg@FreeBSD.org>

Re-apply my "breakage" to the Nagel congestion avoidence. This version
differs slightly in the logic from the previous version; packets are now
acked immediately if the sender set PUSH.


# b5e8ce9f 16-Mar-1995 Bruce Evans <bde@FreeBSD.org>

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# dac20301 15-Feb-1995 Garrett Wollman <wollman@FreeBSD.org>

Avoid deadlock situation described by Stevens using his suggested replacement
code.

Obtained from: Stevens, vol. 2, pp. 959-960


# 41f82abe 15-Feb-1995 Garrett Wollman <wollman@FreeBSD.org>

Transaction TCP support now standard. Hack away!


# c70f4510 13-Feb-1995 Poul-Henning Kamp <phk@FreeBSD.org>

YFfix.


# 2f96f1f4 13-Feb-1995 Garrett Wollman <wollman@FreeBSD.org>

Get rid of some unneeded #ifdef TTCP lines. Also, get rid of some
bogus commons declared in header files.


# a0292f23 09-Feb-1995 Garrett Wollman <wollman@FreeBSD.org>

Merge Transaction TCP, courtesy of Andras Olah <olah@cs.utwente.nl> and
Bob Braden <braden@isi.edu>.

NB: This has not had David's TCP ACK hack re-integrated. It is not clear
what the correct solution to this problem is, if any. If a better solution
doesn't pop up in response to this message, I'll put David's code back in
(or he's welcome to do so himself).


# 10be5648 13-Oct-1994 Garrett Wollman <wollman@FreeBSD.org>

As suggested by Sally Floyd, don't add the ``small fraction of the window
size'' when doing congestion avoidance.

Submitted by: Mark Andrews


# 623ae52e 02-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

GCC cleanup.
Reviewed by:
Submitted by:
Obtained from:


# 610ee2f9 15-Sep-1994 David Greenman <dg@FreeBSD.org>

Made TCPDEBUG truely optional. Based on changes I made in FreeBSD 1.1.5.
Fixed somebody's idea of a joke - about the first half of the lines in
in_proto.c were spaced over by one space.


# 6a7be6e8 26-Aug-1994 Garrett Wollman <wollman@FreeBSD.org>

Obey RFC 793, section 3.4:

Several examples of connection initiation follow. Although these
examples do not show connection synchronization using data-carrying
segments, this is perfectly legitimate, so long as the receiving TCP
doesn't deliver the data to the user until it is clear the data is
valid (i.e., the data must be buffered at the receiver until the
connection reaches the ESTABLISHED state).


# f23b4c91 18-Aug-1994 Garrett Wollman <wollman@FreeBSD.org>

Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# b164106f 31-Jul-1994 David Greenman <dg@FreeBSD.org>

Fixed bug with Nagel Congestion Avoidance where a tcp connection would
stall unnecessarily - always send an ACK when a packet len of < mss is
received.


# d4d0967e 26-May-1994 David Greenman <dg@FreeBSD.org>

Added missing ntohl()'s that are needed before calling IN_MULTICAST in
a couple of places.
Submitted by: Johannes Helander


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources