Cross Reference: /freebsd-current/sys/netinet/tcp

History log of /freebsd-current/sys/netinet/tcp_usrreq.c
Revision	Date	Author	Comments
# e7381521	30-May-2024	Michael Tuexen <tuexen@FreeBSD.org>	tcp: remove unused code in tcp_usr_attach pr_attach is only called on a socket (so) with so->so_listen != NULL via sonewconn. However, sonewconn is not called from the TCP code. The listening sockets are handled in tcp_syncache.c without using sonewconn. Therefore, the code removed is never executed. No functional change intended. Reviewed by: rrs, peter.lei_ieee.org MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D45412
# fe136aec	23-May-2024	Michael Tuexen <tuexen@FreeBSD.org>	tcp: improve inp locking in setsockopt Ensure that the inp is not dropped when starting a stack switch. While there, clean-up the code by using INP_WLOCK_RECHECK, which also re-assigns tp. Reviewed by: glebius MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D45241
# fce03f85	05-May-2024	Randall Stewart <rrs@FreeBSD.org>	TCP can be subject to Sack Attacks lets fix this issue. There is a type of attack that a TCP peer can launch on a connection. This is for sure in Rack or BBR and probably even the default stack if it uses lists in sack processing. The idea of the attack is that the attacker is driving you to look at 100's of sack blocks that only update 1 byte. So for example if you have 1 - 10,000 bytes outstanding the attacker sends in something like: ACK 0 SACK(1-512) SACK(1024 - 1536), SACK(2048-2536), SACK(4096 - 4608), SACK(8192-8704) This first sack looks fine but then the attacker sends ACK 0 SACK(1-512) SACK(1025 - 1537), SACK(2049-2537), SACK(4097 - 4609), SACK(8193-8705) ACK 0 SACK(1-512) SACK(1027 - 1539), SACK(2051-2539), SACK(4099 - 4611), SACK(8195-8707) ... These blocks are making you hunt across your linked list and split things up so that you have an entry for every other byte. Has your list grows you spend more and more CPU running through the lists. The idea here is the attacker chooses entries as far apart as possible that make you run through the list. This example is small but in theory if the window is open to say 1Meg you could end up with 100's of thousands link list entries. To combat this we introduce three things. when the peer requests a very small MSS we stop processing SACK's from them. This prevents a malicious peer from just using a small MSS to do the same thing. Any time we get a sack block, we use the sack-filter to remove sacks that are smaller than the smallest v4 mss (minus 40 for max TCP options) unless it ties up to snd_max (since that is legal). All other sacks in theory should be at least an MSS. If we get such an attacker that means we basically start skipping all but MSS sized Sacked blocks. The sack filter used to throw away data when its bounds were exceeded, instead now we increase its size to 15 and then throw away sack's if the filter gets over-run to prevent the malicious attacker from over-running the sack filter and thus we start to process things anyway. The default stack will need to start using the sack-filter which we have talked about in past conference calls to take full advantage of the protections offered by it (and reduce cpu consumption when processing sacks). After this set of changes is in rack can drop its SAD detection completely Reviewed by:tuexen@, rscheff@ Differential Revision: <https://reviews.freebsd.org/D44903>
# dd7b86e2	18-Mar-2024	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: remove IS_FASTOPEN() macro The macro is more obfuscating than helping as it just checks a single flag of t_flags. All other t_flags bits are checked without a macro. A bigger problem was that declaration of the macro in tcp_var.h depended on a kernel option. It is a bad practice to create such definitions in installable headers. Reviewed by: rscheff, tuexen, kib Differential Revision: https://reviews.freebsd.org/D44362
# 85df11a1	12-Mar-2024	Richard Scheffenegger <rscheff@FreeBSD.org>	ktls: deep copy tls_enable struct for in-kernel tcp consumers Doing a deep copy of the keys early allows users of the tls_enable structure to assume kernel memory. This enables the socket options to be set by kernel threads. Reviewed By: #transport, tuexen, jhb, rrs Sponsored by: NetApp, Inc. X-NetApp-PR: #79 Differential Revision: https://reviews.freebsd.org/D44250
# e18b97bd	12-Mar-2024	Randall Stewart <rrs@FreeBSD.org>	Update to bring the rack stack with all its fixes in. This brings the rack stack up to the current level used at NF. Many fixes and improvements have been added. I also add in a fix to BBR to deal with the changes that have been in hpts for a while i.e. only one call no matter if mbuf queue or tcp_output. It basically does little except BBlogs and is a placemark for future work on doing path capacity measurements. With a bit of a struggle with git I finally got rack_pcm.c into place (apologies for not noticing this error). The LINT kernel is running on my box now .. sigh. Reviewed by: tuexen, glebius Sponsored by: Netflix Inc. Differential Revision:https://reviews.freebsd.org/D43986
# c112243f	11-Mar-2024	Brooks Davis <brooks@FreeBSD.org>	Revert "Update to bring the rack stack with all its fixes in." This commit was incomplete and breaks LINT kernels. The tree has been broken for 8+ hours. This reverts commit f6d489f402c320f1a6eaa473491a0b8c3878113e.
# f6d489f4	11-Mar-2024	Randall Stewart <rrs@FreeBSD.org>	Update to bring the rack stack with all its fixes in. This brings the rack stack up to the current level used at NF. Many fixes and improvements have been added. I also add in a fix to BBR to deal with the changes that have been in hpts for a while i.e. only one call no matter if mbuf queue or tcp_output. Note there is a new file that I can't figure out how to get in rack_pcm.c It basically does little except BBlogs and is a placemark for future work on doing path capacity measurements. Reviewed by: tuexen, glebius Sponsored by: Netflix Inc. Differential Revision:https://reviews.freebsd.org/D43986
# abe8379b	15-Feb-2024	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: repair wakeup of accept(2) by shutdown(2) That was lost in transition from one-for-all soshutdown() to protocol specific methods. Only protocols that listen(2) were affected. This is not a documented or specified feature, but some software relies on it. At least the FreeSWITCH telephony software uses this behavior on PF_INET/SOCK_STREAM. Fixes: 5bba2728079ed4da33f727dbc2b6ae1de02ba897
# 3eeb22cb	10-Feb-2024	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: clean scoreboard when releasing the socket buffer The SACK scoreboard is conceptually an extention of the socket buffer. Remove it when the socket buffer goes away with soisdisconnected(). Verify that this is also the expected state in tcp_discardcb(). PR: 276761 Reviewed by: glebius, tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D43805
# ce69e373	03-Feb-2024	Gleb Smirnoff <glebius@FreeBSD.org>	Revert "sockets: retire sorflush()" Provide a comment in sorflush() why the socket I/O sx(9) lock is actually important. This reverts commit 507f87a799cf0811ce30f0ae7f10ba19b2fd3db3.
# 507f87a7	16-Jan-2024	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: retire sorflush() With removal of dom_dispose method the function boils down to two meaningful function calls: socantrcvmore() and sbrelease(). The latter is only relevant for protocols that use generic socket buffers. The socket I/O sx(9) lock acquisition in sorflush() is not relevant for shutdown(2) operation as it doesn't do any I/O that may interleave with read(2) or write(2). The socket buffer mutex acquisition inside sbrelease() is what guarantees thread safety. This sx(9) acquisition in soshutdown() can be tracked down to 4.4BSD times, where it used to be sblock(), and it was carried over through the years evolving together with sockets with no reconsideration of why do we carry it over. I can't tell if that sblock() made sense back then, but it doesn't make any today. Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D43415
# 5bba2728	16-Jan-2024	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: make pr_shutdown fully protocol specific method Disassemble a one-for-all soshutdown() into protocol specific methods. This creates a small amount of copy & paste, but makes code a lot more self documented, as protocol specific method would execute only the code that is relevant to that protocol and nothing else. This also fixes a couple recent regressions and reduces risk of future regressions. The extended KPI for the new pr_shutdown removes need for the extra pr_flush which was added for the sake of SCTP which could not perform its shutdown properly with the old one. Particularly for SCTP this change streamlines a lot of code. Some notes on why certain parts of code were copied or were not to certain protocols: * The (SS_ISCONNECTED \| SS_ISCONNECTING \| SS_ISDISCONNECTING) check is needed only for those protocols that may be connected or disconnected. * The above reduces into only SS_ISCONNECTED for those protocols that always connect instantly. * The ENOTCONN and continue processing hack is left only for datagram protocols. * The SOLISTENING(so) block is copied to those protocols that listen(2). * sorflush() on SHUT_RD is copied almost to every protocol, but that will be refactored later. * wakeup(&so->so_timeo) is copied to protocols that can make a non-instant connect(2), can SO_LINGER or can accept(2). There are three protocols (netgraph(4), Bluetooth, SDP) that did not have pr_shutdown, but old soshutdown() would still perform sorflush() on SHUT_RD for them and also wakeup(9). Those protocols partially supported shutdown(2) returning EOPNOTSUP for SHUT_WR/SHUT_RDWR, now they fully lost shutdown(2) support. I'm pretty sure netgraph(4) and Bluetooth are okay about that and SDP is almost abandoned anyway. Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D43413
# a13039e2	27-Dec-2023	Gleb Smirnoff <glebius@FreeBSD.org>	inpcb: reoder inpcb destruction First, merge in_pcbdetach() with in_pcbfree(). The comment for in_pcbdetach() was no longer correct. Then, make sure we remove the inpcb from the hash before we commit any destructive actions on it. There are couple functions that rely on the hash lock skipping SMR + inpcb lock to lookup an inpcb. Although there are no known functions that similarly rely on the global inpcb list lock, also do list removal before destructive actions. PR: 273890 Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D43122
# d2ef52ef	04-Dec-2023	Gleb Smirnoff <glebius@FreeBSD.org>	tcp/hpts: make stacks responsible for clearing themselves out HPTS There already is the tfb_tcp_timer_stop_all method that is supposed to stop all time events associated with a given tcpcb by given stack. Some time ago it was doing actual callout_stop(). Today bbr/rack just mark their internal state as inactive in their tfb_tcp_timer_stop_all methods, but tcpcb stays in HPTS wheel and potentially called in from HPTS. Change the methods to also call tcp_hpts_remove(). Note: I'm not sure if internal flag is still relevant once we are out of HPTS wheel. Call the method when connection goes into TCP_CLOSED state, instead of calling it later when tcpcb is freed. Also call it when we switch between stacks. Reviewed by: tuexen, rrs Differential Revision: https://reviews.freebsd.org/D42857
# f42518ff	30-Nov-2023	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: for LRD move sysctl from tcp.do_lrd tp tcp.sack.lrd, remove sockopt Moving lrd sysctl to the tcp.sack branch, since LRD only works with SACK. Remove the sockopt to programmatically control LRD per session. Reviewed By: #transport, tuexen, rrs Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D42851
# cfb1e929	30-Nov-2023	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: don't malloc/free sockaddr memory on accept(2) Let the accept functions provide stack memory for protocols to fill it in. Generic code should provide sockaddr_storage, specialized code may provide smaller structure. While rewriting accept(2) make 'addrlen' a true in/out parameter, reporting required length in case if provided length was insufficient. Our manual page accept(2) and POSIX don't explicitly require that, but one can read the text as they do. Linux also does that. Update tests accordingly. Reviewed by: rscheff, tuexen, zlei, dchagin Differential Revision: https://reviews.freebsd.org/D42635
# 29363fb4	23-Nov-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove ancient SCCS tags. Remove ancient SCCS tags from the tree, automated scripting, with two minor fixup to keep things compiling. All the common forms in the tree were removed with a perl script. Sponsored by: Netflix
# 70e30add	16-Nov-2023	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: remove extraneous network epoch entry accept(2) on IPv6 TCP doesn't need epoch. Some leaf functions may need it, but they will enter accordingly, see sa6_recoverscope(). Reviewed by: rscheff, tuexen (implicitly, see deleted XXXMT) Differential Revision: https://reviews.freebsd.org/D42634
# dc485b96	22-Aug-2023	Marius Strobl <marius@FreeBSD.org>	tcp_info: Add and export more FreeBSD-specific fields This change adds struct tcp_info fields corresponding to the following struct tcpcb ones: - snd_una - snd_max - rcv_numsacks - rcv_adv - dupacks Note that while both tcp_fill_info() and fill_tcp_info_from_tcb() are extended accordingly, no counterpart of rcv_numsacks is available in the cxgbe(4) TOE PCB, though. Sponsored by: NetApp, Inc. (originally)
# 8c6104c4	22-Aug-2023	Marius Strobl <marius@FreeBSD.org>	tcp_fill_info(): Change lock assertion on INPCB to locked only This function actually only ever reads from the TCP PCB. Consequently, also make the pointer to its TCP PCB parameter const. Sponsored by: NetApp, Inc. (originally)
# 685dc743	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove $FreeBSD$: one-line .c pattern Remove /^[\s]__FBSDID$"\$FreeBSD\$"$;?\s*\n/
# de0a2eb2	23-Jun-2023	Mark Johnston <markj@FreeBSD.org>	tcp: Disallow connecting a disconnected socket Currently nothing prevents tcp_usr_connect() from attempting to connect when the socket has been disconnected. At the moment, doing so triggers an assertion in in_pcbconnect() because inp_faddr is not unspecified. I believe this may have been caught in the past by TIMEWAIT checks, but those are now removed. Check for additional socket states in tcp_connect(). Reported by: syzbot+f0f7871ec5397602b446@syzkaller.appspotmail.com Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D40579
# 04682968	20-Jun-2023	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: expose AccECN mode and TCP FastOpen (TFO) in TCPI Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D40621
# c2399dd2	06-May-2023	Michael Tuexen <tuexen@FreeBSD.org>	tcp: improve BBLoging for PRUs Log all errors for PRUs, except when INP_DROPPED is set. In that case, don't log it. Reviewed by: glebius, rrs Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D39591
# c2a69e84	25-Apr-2023	Gleb Smirnoff <glebius@FreeBSD.org>	tcp_hpts: move HPTS related fields from inpcb to tcpcb This makes inpcb lighter and allows future cache line optimizations of tcpcb. The reason why HPTS originally used inpcb is the compressed TIME-WAIT state (see 0d7445193ab), that used to free a tcpcb, while the associated connection is still on the HPTS ring. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D39697
# 66fbc19f	07-Apr-2023	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: pass tcpcb in the tfb_tcp_ctloutput() method instead of inpcb Just matches rest of the KPI. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D39435
# 945f9a7c	07-Apr-2023	Randall Stewart <rrs@FreeBSD.org>	tcp: misc cleanup of options for rack as well as socket option logging. Both BBR and Rack have the ability to log socket options, which is currently disabled. Rack has an experimental SaD (Sack Attack Detection) algorithm that should be made available. Also there is a t_maxpeak_rate that needs to be removed (its un-used). Reviewed by: tuexen, cc Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D39427
# 73ee5756	31-Mar-2023	Randall Stewart <rrs@FreeBSD.org>	Fixes in the tcp infrastructure with respect to stack changes as well as other infrastructure updates for incoming rack features. So stack switching as always been a bit of a issue. We currently use a break before make setup which means that if something goes wrong you have to try to get back to a stack. This patch among a lot of other things changes that so that it is a make before break. We also expand some of the function blocks in prep for new features in rack that will allow more controlled pacing. We also add other abilities such as the pathway for a stack to query a previous stack to acquire from it critical state information so things in flight don't get dropped or mis-handled when switching stacks. We also add the concept of a timer granularity. This allows an alternate stack to change from the old ticks granularity to microseconds and of course this even gives us a pathway to go to nanosecond timekeeping if we need to (something for the data center to consider for sure). Once all this lands I will then update rack to begin using all these new features. Reviewed by: tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D39210
# 69c7c811	16-Mar-2023	Randall Stewart <rrs@FreeBSD.org>	Move access to tcp's t_logstate into inline functions and provide new tracepoint and bbpoint capabilities. The TCP stacks have long accessed t_logstate directly, but in order to do tracepoints and the new bbpoints we need to move to using the new inline functions. This adds them and moves rack to now use the tcp_tracepoints. Reviewed by: tuexen, gallatin Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D38831
# 399a5655	28-Feb-2023	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Make TCP PCAP buffer properly configurable. Reviewed By: tuexen, cc, #transport MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D38824
# 453aa7fa	22-Feb-2023	Michael Tuexen <tuexen@FreeBSD.org>	tcp: ensure the tcpcb is not NULL when logging an event When calling tcp_bblog_pru() on some error paths, tp is NULL, therefore handle it. Sponsored by: Netflix, Inc.
# 4065becf	21-Feb-2023	Michael Tuexen <tuexen@FreeBSD.org>	bblog: unbreak build Ensure that tp is always declared and set. Reported by: Michael Butler Sponsored by: Netflix, Inc.
# 00812bbd	20-Feb-2023	Michael Tuexen <tuexen@FreeBSD.org>	bblog: add logging of protocol user requests This information was available in trpt and is useful. So provide a way to get this information via TCP BBLog. Reviewed by: rscheff@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D38701
# cda6bdba	17-Feb-2023	John Baldwin <jhb@FreeBSD.org>	tcp: Don't try to disconnect a socket multiple times. When the checks for INP_TIMEWAIT were removed, tcp_usr_close() and tcp_usr_disconnect() were no longer prevented from calling tcp_disconnect() on a socket that was already disconnected. This triggered a panic in cxgbe(4) for TOE where the tcp_disconnect() on an already-disconnected socket invoked tcp_output() on a socket that was already in time-wait. Reviewed by: rrs, np Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D37112
# 96871af0	15-Feb-2023	Gleb Smirnoff <glebius@FreeBSD.org>	inpcb: use family specific sockaddr argument for bind functions Do the cast from sockaddr to either IPv4 or IPv6 sockaddr in the protocol's pr_bind method and from there on go down the call stack with family specific argument. Reviewed by: zlei, melifaro, markj Differential Revision: https://reviews.freebsd.org/D38601
# 636b19ea	14-Feb-2023	Mark Johnston <markj@FreeBSD.org>	tcp: Disallow re-connection of a connected socket soconnectat() tries to ensure that one cannot connect a connected socket. However, the check is racy and does not really prevent two threads from attempting to connect the same TCP socket. Modify tcp_connect() and tcp6_connect() to perform the check again, this time synchronized by the inpcb lock, under which we call soisconnecting(). Reported by: syzkaller Reviewed by: glebius MFC after: 2 weeks Sponsored by: Klara, Inc. Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D38507
# 775da7f8	13-Feb-2023	Mark Johnston <markj@FreeBSD.org>	tcp: Remove a redundant net_epoch entry in tcp6_connect() tcp6_connect() is always called in a net_epoch read section. Fixes: 3d76be28ec60 ("netinet6: require network epoch for in6_pcbconnect()") Reviewed by: tuexen, glebius Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D38506
# dfc4d218	07-Feb-2023	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: use straight in_pcbconnect() in tcp_connect() This brings tcp_connect() par with tcp6_connect(). The code removed now is a remnant of "truncating old TIME-WAIT" removed back in 2004 in c94c54e4df9a. Reviewed by: markj, tuexen Differential Revision: https://reviews.freebsd.org/D38405
# fb8f221a	06-Feb-2023	Maxim Konovalov <maxim@FreeBSD.org>	db_printf: fix a typo PR: 269377
# 9e46ff4d	03-Feb-2023	Gleb Smirnoff <glebius@FreeBSD.org>	netinet: don't return conflicting inpcb in in_pcbconnect_setup() Last time this inpcb was actually used was in tcp_connect() before c94c54e4df9a.
# a9afe086	03-Feb-2023	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: bring comment for tcp_connect() up to date We no longer use in_pcbbind() since 25102351509. The comment about truncating old TIME-WAIT describes a code that had been removed back in 2004 in c94c54e4df9a.
# a9d22cce	03-Feb-2023	Gleb Smirnoff <glebius@FreeBSD.org>	inpcb: use family specific sockaddr argument for connect functions Do the cast from sockaddr to either IPv4 or IPv6 sockaddr in the protocol's pr_connect method and from there on go down the call stack with family specific argument. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D38356
# 3d76be28	03-Feb-2023	Gleb Smirnoff <glebius@FreeBSD.org>	netinet6: require network epoch for in6_pcbconnect() This removes recursive epoch entry in the syncache case. Fixes unprotected access to V_in6_ifaddrhead in in6_pcbladdr(), as well as access to prison IP address lists. It also matches what IPv4 in_pcbconnect() does. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D38355
# 221b9e3d	03-Feb-2023	Gleb Smirnoff <glebius@FreeBSD.org>	inpcb: merge two versions of in6_pcbconnect() into one No functional change. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D38354
# 76f1499f	03-Feb-2023	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: retire net.inet.tcp.tcp_require_unique_port It was a safe belt just in case if the new port allocation behaviour introduced in 25102351509 would cause a problem. Reviewed by: markj, rscheff, tuexen Differential revision: https://reviews.freebsd.org/D38353
# 18b83b62	26-Jan-2023	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: reduce the size of t_rttupdated in tcpcb During tcp session start, various mechanisms need to track a few initial RTTs before becoming active. Prevent overflows of the corresponding tracking counter and reduce the size of tcpcb simultaneously. Reviewed By: #transport, tuexen, guest-ccui Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D21117
# eaabc937	14-Dec-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: retire TCPDEBUG This subsystem is superseded by modern debugging facilities, e.g. DTrace probes and TCP black box logging. We intentionally leave SO_DEBUG in place, as many utilities may set it on a socket. Also the tcp::debug DTrace probes look at this flag on a socket. Reviewed by: gnn, tuexen Discussed with: rscheff, rrs, jtl Differential revision: https://reviews.freebsd.org/D37694
# 446ccdd0	07-Dec-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: use single locked callout per tcpcb for the TCP timers Use only one callout structure per tcpcb that is responsible for handling all five TCP timeouts. Use locked version of callout, of course. The callout function tcp_timer_enter() chooses soonest timer and executes it with lock held. Unless the timer reports that the tcpcb has been freed, the callout is rescheduled for next soonest timer, if there is any. With single callout per tcpcb on connection teardown we should be able to fully stop the callout and immediately free it, avoiding use of callout_async_drain(). There is one gotcha here: callout_stop() can actually touch our memory when a rare race condition happens. See comment above tcp_timer_stop(). Synchronous stop of the callout makes tcp_discardcb() the single entry point for tcpcb destructor, merging the tcp_freecb() to the end of the function. While here, also remove lots of lingering checks in the beginning of TCP timer functions. With a locked callout they are unnecessary. While here, clean unused parts of timer KPI for the pluggable TCP stacks. While here, remove TCPDEBUG from tcp_timer.c, as this allows for more simplification of TCP timers. The TCPDEBUG is scheduled for removal. Move the DTrace probes in timers to the beginning of a function, where a tcpcb is always existing. Discussed with: rrs, tuexen, rscheff (the TCP part of the diff) Reviewed by: hselasky, kib, mav (the callout part) Differential revision: https://reviews.freebsd.org/D37321
# e68b3792	07-Dec-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: embed inpcb into tcpcb For the TCP protocol inpcb storage specify allocation size that would provide space to most of the data a TCP connection needs, embedding into struct tcpcb several structures, that previously were allocated separately. The most import one is the inpcb itself. With embedding we can provide strong guarantee that with a valid TCP inpcb the tcpcb is always valid and vice versa. Also we reduce number of allocs/frees per connection. The embedded inpcb is placed in the beginning of the struct tcpcb, since in_pcballoc() requires that. However, later we may want to move it around for cache line efficiency, and this can be done with a little effort. The new intotcpcb() macro is ready for such move. The congestion algorithm data, the TCP timers and osd(9) data are also embedded into tcpcb, and temprorary struct tcpcb_mem goes away. There was no extra allocation here, but we went through extra pointer every time we accessed this data. One interesting side effect is that now TCP data is allocated from SMR-protected zone. Potentially this allows the TCP stacks or other TCP related modules to utilize that for their own synchronization. Large part of the change was done with sed script: s/tp->ccv->/tp->t_ccv./g s/tp->ccv/\&tp->t_ccv/g s/tp->cc_algo/tp->t_cc/g s/tp->t_timers->tt_/tp->tt_/g s/CCV$ccv, osd$/\&CCV(ccv, t_osd)/g Dependency side effect is that code that needs to know struct tcpcb should also know struct inpcb, that added several <netinet/in_pcb.h>. Differential revision: https://reviews.freebsd.org/D37127
# bd4f9866	16-Nov-2022	Michael Tuexen <tuexen@FreeBSD.org>	tcp: remove unused t_rttbest No functional change intended. Reviewed by: rscheff@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D37401
# 9eb0e832	08-Nov-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: provide macros to access inpcb and socket from a tcpcb There should be no functional changes with this commit. Reviewed by: rscheff Differential revision: https://reviews.freebsd.org/D37123
# 22c81cc5	06-Nov-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: add AccECN CE packet counters to tcpinfo Provide diagnostics information around AccECN into the tcpinfo struct. Event: IETF 115 Hackathon Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D37280
# 53af6903	06-Oct-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: remove INP_TIMEWAIT flag Mechanically cleanup INP_TIMEWAIT from the kernel sources. After 0d7445193ab, this commit shall not cause any functional changes. Note: this flag was very often checked together with INP_DROPPED. If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we will be able to remove most of this checks and turn them to assertions. Some of them can be turned into assertions right now, but that should be carefully done on a case by case basis. Differential revision: https://reviews.freebsd.org/D36400
# 9c3507f9	06-Oct-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: in tcp_usr_detach() remove special handling of compressed time-wait Differential revision: https://reviews.freebsd.org/D36399
# 08af8aac	27-Sep-2022	Randall Stewart <rrs@FreeBSD.org>	Tcp progress timeout Rack has had the ability to timeout connections that just sit idle automatically. This feature of course is off by default and requires the user set it on (though the socket option has been missing in tcp_usrreq.c). Lets get the progress timeout fully supported in the base stack as well as rack. Reviewed by: tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D36716
# 493105c2	21-Sep-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: fix simultaneous open and refine e80062a2d43 - The soisconnected() call on transition from SYN_RCVD to ESTABLISHED is also necessary for a half-synchronized connection. Fix that just setting the flag, when we transfer SYN-SENT -> SYN-RECEIVED. - Provide a comment that explains at what conditions the call to soisconnected() is necessary. - Hence mechanically rename the TF_INCQUEUE flag to TF_SONOTCONN. - Extend the change to the BBR and RACK stacks. Note: the interaction between the accept_filter(9) and the socket layer is not fully consistent, yet. For most accept filters this call to soisconnected() will not move the connection from the incomplete queue to the complete. The move would happen only when the filter has received the desired data, and soisconnected() would be called once again from sorwakeup(). Ideally, we should mark socket as connected only there, and leave the soisconnected() from SYN_RCVD->ESTABLISHED only for the simultaneous open case. However, this doesn't yet work. Reviewed by: rscheff, tuexen, rrs Differential revision: https://reviews.freebsd.org/D36641
# e80062a2	08-Sep-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: avoid call to soisconnected() on transition to ESTABLISHED This call existed since pre-FreeBSD times, and it is hard to understand why it was there in the first place. After 6f3caa6d815 it definitely became necessary always and commit message from f1ee30ccd60 confirms that. Now that 6f3caa6d815 is effectively backed out by 07285bb4c22, the call appears to be useful only for sockets that landed on the incomplete queue, e.g. sockets that have accept_filter(9) enabled on them. Provide a new TCP flag to mark connections that are known to be on the incomplete queue, and call soisconnected() only for those connections. Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D36488
# 0773b44e	05-Sep-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: tcp6_connect() requires net epoch PR: 262663 Reported & tested by: dch MFC after: 2 weeks
# e7d02be1	17-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	protosw: refactor protosw and domain static declaration and load o Assert that every protosw has pr_attach. Now this structure is only for socket protocols declarations and nothing else. o Merge struct pr_usrreqs into struct protosw. This was suggested in 1996 by wollman@ (see 7b187005d18ef), and later reiterated in 2006 by rwatson@ (see 6fbb9cf860dcd). o Make struct domain hold a variable sized array of protosw pointers. For most protocols these pointers are initialized statically. Those domains that may have loadable protocols have spacers. IPv4 and IPv6 have 8 spacers each (andre@ dff3237ee54ea). o For inetsw and inet6sw leave a comment noting that many protosw entries very likely are dead code. o Refactor pf_proto_[un]register() into protosw_[un]register(). o Isolate pr_*_notsupp() methods into uipc_domain.c Reviewed by: melifaro Differential revision: https://reviews.freebsd.org/D36232
# d9f6ac88	17-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	protosw: retire PRU_ flags and their char names For many years only TCP debugging used them, but relatively recently TCP DTrace probes also start to use them. Move their declarations into tcp_debug.h, but start including tcp_debug.h unconditionally, so that compilation with DTrace and without TCPDEBUG is possible.
# 07285bb4	10-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: utilize new solisten_clone() and solisten_enqueue() This streamlines cloning of a socket from a listener. Now we do not drop the inpcb lock during creation of a new socket, do not do useless state transitions, and put a fully initialized socket+inpcb+tcpcb into the listen queue. Before this change, first we would allocate the socket and inpcb+tcpcb via tcp_usr_attach() as TCPS_CLOSED, link them into global list of pcbs, unlock pcb and put this onto incomplete queue (see 6f3caa6d815). Then, after sonewconn() we would lock it again, transition into TCPS_SYN_RECEIVED, insert into inpcb hash, finalize initialization of tcpcb. And then, in call into tcp_do_segment() and upon transition to TCPS_ESTABLISHED call soisconnected(). This call would lock the listening socket once again with a LOR protection sequence and then we would relocate the socket onto the complete queue and only now it is ready for accept(2). Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D36064
# c7a62c92	10-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	inpcb: gather v4/v6 handling code into in_pcballoc() from protocols Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D36062
# 1b91978f	06-Jul-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: remove a condition in tcp_usr_detach() that never happens The comment from Robert Watson doubts that this condition ever happens. Our analysis confirm that. Also, we found that if you manage to create such a connection with help of some other bug, then after the "second case" code is executed, the kernel will panic in other part of the stack. Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D35714
# d8596171	04-Jul-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: use only soref()/sorele() as socket reference count o Retire SS_FDREF as it is basically a debug flag on top of already existing soref()/sorele(). o Convert SS_PROTOREF into soref()/sorele(). o Change reference model for the listen queues, see below. o Make sofree() private. The correct KPI to use is only sorele(). o Make soabort() respect the model and sorele() instead of sofree(). Note on listening queues. Until now the sockets on a queue had zero reference count. And the reference were given only upon accept(2). The assumption was that there is no way to see the queued socket from anywhere except its head. This is not true, since queued sockets already have pcbs, which are linked at least into the global pcb lists. With this change we put the reference right in the sonewconn() and on accept(2) path we just hand the existing reference to the file descriptor. Differential revision: https://reviews.freebsd.org/D35679
# 74703901	04-Jul-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: use a TCP flag to check if connection has been close(2)d The flag SS_NOFDREF is a private flag of the socket layer. It also is supposed to be read with SOCK_LOCK(), which we don't own here. Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D35663
# 97453e5e	23-Jun-2022	Claudio Jeker <claudio@openbsd.org>	Unlock inp when handling TCP_MD5SIG socket options Unlock the inp when hanlding TCP_MD5SIG socket options. tcp_ipsec_pcbctl handles locking the inp when the option is being modified. This was found by Claudio Jeker while working on the OpenBGPd port. On 14 we get a panic when trying to call getsockopt, on 13.1 the process locks up using 100% CPU. Reviewed by: rscheff (transport), tuexen MFC after: 3 days Sponsored by: Klara Inc. Differential Revision: https://reviews.freebsd.org/D35532
# b338b1fd	18-Apr-2022	Mateusz Guzik <mjg@FreeBSD.org>	tcp: plug set-but-not-used vars Sponsored by: Rubicon Communications, LLC ("Netgate")
# ea9017fb	21-Feb-2022	Randall Stewart <rrs@FreeBSD.org>	tcp: Congestion control move to using reference counting. In the transport call on 12/3 Gleb asked to move the CC modules towards using reference counting to prevent folks from unloading a module in use. It was also agreed that Michael would do a user space utility like tcp_drop that could be used to move all connections that are using a specific CC to some other CC. This is the half I committed to doing, making it so that we maintain a refcount on a cc module every time a pcb refers to it and decrementing that every time a pcb no longer uses a cc module. This also helps us simplify the whole unloading process by getting rid of tcp_ccunload() which munged through all the tcb's. Instead we mark a module as being removed and prevent further references to it. We also make sure that if a module is marked as being removed it cannot be made as the default and also the opposite of that, if its a default it fails and does not mark it as being removed. Reviewed by: Michael Tuexen, Gleb Smirnoff Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D33249
# 3f169c54	09-Feb-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Add/update AccECN related statistics and numbers Reserve couters in the tcps struct in preparation for AccECN, extend the debugging output for TF2 flags, optimize the syncache flags from individual bits to a codepoint for the specifc ECN handshake. This is in preparation of AccECN. No functional chance except for extended debug output capabilities. Reviewed By: #transport, rrs Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D34161
# 528c7649	08-Feb-2022	Michael Tuexen <tuexen@FreeBSD.org>	tcp: fix compliation when KERN_TLS is not defined Reported by: Gary Jennejohn Fixes: fd7daa727126 - main - tcp: make tcp_ctloutput_set() non-static Sponsored by: Netflix, Inc.
# fd7daa72	08-Feb-2022	Michael Tuexen <tuexen@FreeBSD.org>	tcp: make tcp_ctloutput_set() non-static tcp_ctloutput_set() will be used via the sysctl interface in a upcoming command line tool tcpsso. Reviewed by: glebius, rscheff Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D34164
# 3b0ee680	03-Feb-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Prevent setting of ECN bits with setsockopt() setsockopt() grants full access to the deprecated TOS byte. For TCP, mask out the ECN codepoint, so that only the DSCP portion can be adjusted. Reviewed By: tuexen, hselasky, #manpages, #transport, debdrup Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D34154
# 3b3c08c1	02-Feb-2022	Michael Tuexen <tuexen@FreeBSD.org>	tcp: cleanup functions related to socket option handling Consistently only pass the inp and the sopt around. Don't pass the so around, since in a upcoming commit tcp_ctloutput_set() will be called from a context different from setsockopt(). Also expect the inp to be locked when calling tcp_ctloutput_[gs]et(), this is also required for the upcoming use by tcpsso, a command line tool to set socket options. Reviewed by: glebius, rscheff Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D34151
# aac52f94	18-Jan-2022	Randall Stewart <rrs@FreeBSD.org>	tcp: Warning cleanup from new compiler. The clang compiler recently got an update that generates warnings of unused variables where they were set, and then never used. This revision goes through the tcp stack and cleans all of those up. Reviewed by: Michael Tuexen, Gleb Smirnoff Sponsored by: Netflix Inc. Differential Revision:
# 1d41a494	13-Jan-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp_usr_connect: report actual error code when stack requests drop
# 4287aa56	28-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	tcp_usr_shutdown: don't cast inp_ppcb to tcpcb before checking inp_flags While here move out one more erroneous condition out of the epoch and common return. The only functional change is that if we send control on a shut down socket we would get EINVAL instead of ECONNRESET. Reviewed by: tuexen Reported by: syzbot+8388cf7f401a7b6bece6@syzkaller.appspotmail.com Fixes: f64dc2ab5be38e5366271ef85ea90d8cb1c7841a
# 0af4ce45	27-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	tcp_usr_shutdown: don't cast inp_ppcb to tcpcb before checking inp_flags Fixes: f64dc2ab5be38e5366271ef85ea90d8cb1c7841a
# 37a7f557	27-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	tcp_usr_rcvd: don't cast inp_ppcb to tcpcb before checking inp_flags Fixes: f64dc2ab5be38e5366271ef85ea90d8cb1c7841a
# a370832b	26-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: remove delayed drop KPI No longer needed after tcp_output() can ask caller to drop. Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D33371
# f64dc2ab	26-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: TCP output method can request tcp_drop The advanced TCP stacks (bbr, rack) may decide to drop a TCP connection when they do output on it. The default stack never does this, thus existing framework expects tcp_output() always to return locked and valid tcpcb. Provide KPI extension to satisfy demands of advanced stacks. If the output method returns negative error code, it means that caller must call tcp_drop(). In tcp_var() provide three inline methods to call tcp_output(): - tcp_output() is a drop-in replacement for the default stack, so that default stack can continue using it internally without modifications. For advanced stacks it would perform tcp_drop() and unlock and report that with negative error code. - tcp_output_unlock() handles the negative code and always converts it to positive and always unlocks. - tcp_output_nodrop() just calls the method and leaves the responsibility to drop on the caller. Sweep over the advanced stacks and use new KPI instead of using HPTS delayed drop queue for that. Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D33370
# 40fa3e40	26-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: mechanically substitute call to tfb_tcp_output to new method. Made with sed(1) execution: sed -Ef sed -i "" $(grep --exclude tcp_var.h -lr tcp_output sys/) sed: s/tp->t_fb->tfb_tcp_output$tp$/tcp_output(tp)/ s/to tfb_tcp_output/to tcp_output()/ Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D33366
# ef396441	10-Nov-2021	Gleb Smirnoff <glebius@FreeBSD.org>	tcp_usr_detach: revert debugging piece from f5cf1e5f5a500. The code was probably useful during the problem being chased down, but for brevity makes sense just to return to the original KASSERT. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D32968
# df07bfda	12-Nov-2021	Michael Tuexen <tuexen@FreeBSD.org>	tcp: Fix a locking issue INP_WLOCK_RECHECK_CLEANUP() and INP_WLOCK_RECHECK() might return from the function, so any locks held must be released. Reported by: syzbot+b1a888df08efaa7b4bf1@syzkaller.appspotmail.com Reviewed by: markj Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D32975
# b8d60729	11-Nov-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Congestion control cleanup. NOTE: HEADS UP read the note below if your kernel config is not including GENERIC!! This patch does a bit of cleanup on TCP congestion control modules. There were some rather interesting surprises that one could get i.e. where you use a socket option to change from one CC (say cc_cubic) to another CC (say cc_vegas) and you could in theory get a memory failure and end up on cc_newreno. This is not what one would expect. The new code fixes this by requiring a cc_data_sz() function so we can malloc with M_WAITOK and pass in to the init function preallocated memory. The CC init is expected in this case not to fail but if it does and a module does break the "no fail with memory given" contract we do fall back to the CC that was in place at the time. This also fixes up a set of common newreno utilities that can be shared amongst other CC modules instead of the other CC modules reaching into newreno and executing what they think is a "common and understood" function. Lets put these functions in cc.c and that way we have a common place that is easily findable by future developers or bug fixers. This also allows newreno to evolve and grow support for its features i.e. ABE and HYSTART++ without having to dance through hoops for other CC modules, instead both newreno and the other modules just call into the common functions if they desire that behavior or roll there own if that makes more sense. Note: This commit changes the kernel configuration!! If you are not using GENERIC in some form you must add a CC module option (one of CC_NEWRENO, CC_VEGAS, CC_CUBIC, CC_CDG, CC_CHD, CC_DCTCP, CC_HTCP, CC_HD). You can have more than one defined as well if you desire. Note that if you create a kernel configuration that does not define a congestion control module and includes INET or INET6 the kernel compile will break. Also you need to define a default, generic adds 'options CC_DEFAULT=\"newreno\" but you can specify any string that represents the name of the CC module (same names that show up in the CC module list under net.inet.tcp.cc). If you fail to add the options CC_DEFAULT in your kernel configuration the kernel build will also break. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. RELNOTES:YES Differential Revision: https://reviews.freebsd.org/D32693
# f581a26e	25-Oct-2021	Gleb Smirnoff <glebius@FreeBSD.org>	Factor out tcp6_use_min_mtu() to handle IPV6_USE_MIN_MTU by TCP. Pass control for IP/IP6 level options from generic tcp_ctloutput_set() down to per-stack ctloutput. Call tcp6_use_min_mtu() from tcp stack tcp_default_ctloutput(). Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D32655
# de156263	25-Oct-2021	Gleb Smirnoff <glebius@FreeBSD.org>	Several IP level socket options may affect TCP. After handling them in IP level ctloutput, pass them down to TCP ctloutput. We already have a hack to handle IPV6_USE_MIN_MTU. Leave it in place for now, but comment out how it should be handled. For IPv4 we are interested in IP_TOS and IP_TTL. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D32655
# fc4d53cc	25-Oct-2021	Gleb Smirnoff <glebius@FreeBSD.org>	Split tcp_ctloutput() into set/get parts. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D32655
# e2833083	25-Oct-2021	Peter Lei <peterlei@netflix.com>	tcp: socket option to get stack alias name TCP stack sysctl nodes are currently inserted using the stack name alias. Allow the user to get the current stack's alias to allow for programatic sysctl access. Obtained from: Netflix
# bf256782	16-Sep-2021	Mark Johnston <markj@FreeBSD.org>	ktls: Fix error/mode confusion in TCP_*TLS_MODE getsockopt handlers ktls_get_(rx\|tx)_mode() can return an errno value or a TLS mode, so errors are effectively hidden. Fix this by using a separate output parameter. Convert to the new socket buffer locking macros while here. Note that the socket buffer lock is not needed to synchronize the SOLISTENING check here, we can rely on the PCB lock. Reviewed by: jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31977
# bd4a39cc	07-Sep-2021	Mark Johnston <markj@FreeBSD.org>	socket: Properly interlock when transitioning to a listening socket Currently, most protocols implement pru_listen with something like the following: SOCK_LOCK(so); error = solisten_proto_check(so); if (error) { SOCK_UNLOCK(so); return (error); } solisten_proto(so); SOCK_UNLOCK(so); solisten_proto_check() fails if the socket is connected or connecting. However, the socket lock is not used during I/O, so this pattern is racy. The change modifies solisten_proto_check() to additionally acquire socket buffer locks, and the calling thread holds them until solisten_proto() or solisten_proto_abort() is called. Now that the socket buffer locks are preserved across a listen(2), this change allows socket I/O paths to properly interlock with listen(2). This fixes a large number of syzbot reports, only one is listed below and the rest will be dup'ed to it. Reported by: syzbot+9fece8a63c0e27273821@syzkaller.appspotmail.com Reviewed by: tuexen, gallatin MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31659
# 3f1f6b6e	05-Aug-2021	Michael Tuexen <tuexen@FreeBSD.org>	tcp, udp: improve input validation in handling bind() Reported by: syzbot+24fcfd8057e9bc339295@syzkaller.appspotmail.com Reported by: syzbot+6e90ceb5c89285b2655b@syzkaller.appspotmail.com Reviewed by: markj, rscheff MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D31422
# 4747500d	04-Jun-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: A better fix for the previously attempted fix of the ack-war issue with tcp. So it turns out that my fix before was not correct. It ended with us failing some of the "improved" SYN tests, since we are not in the correct states. With more digging I have figured out the root of the problem is that when we receive a SYN\|FIN the reassembly code made it so we create a segq entry to hold the FIN. In the established state where we were not in order this would be correct i.e. a 0 len with a FIN would need to be accepted. But if you are in a front state we need to strip the FIN so we correctly handle the ACK but ignore the FIN. This gets us into the proper states and avoids the previous ack war. I back out some of the previous changes but then add a new change here in tcp_reass() that fixes the root cause of the issue. We still leave the rack panic fixes in place however. Reviewed by: mtuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30627
# f96603b5	31-May-2021	Mark Johnston <markj@FreeBSD.org>	tcp, udp: Permit binding with AF_UNSPEC if the address is INADDR_ANY Prior to commit f161d294b we only checked the sockaddr length, but now we verify the address family as well. This breaks at least ttcp. Relax the check to avoid breaking compatibility too much: permit AF_UNSPEC if the address is INADDR_ANY. Fixes: f161d294b Reported by: Bakul Shah <bakul@iitbombay.org> Reviewed by: tuexen MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30539
# 086a3556	25-May-2021	Andrew Gallatin <gallatin@FreeBSD.org>	tcp: enter network epoch when calling tfb_tcp_fb_fini We need to enter the network epoch when calling into tfb_tcp_fb_fini. I noticed this when I hit an assert running the latest rack Differential Revision: https://reviews.freebsd.org/D30407 Reviewed by: rrs, tuexen Sponsored by: Netflix
# 13c0e198	25-May-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Fix bugs related to the PUSH bit and rack and an ack war Michaels testing with UDP tunneling found an issue with the push bit, which was only partly fixed in the last commit. The problem is the left edge gets transmitted before the adjustments are done to the send_map, this means that right edge bits must be considered to be added only if the entire RSM is being retransmitted. Now syzkaller also continued to find a crash, which Michael sent me the reproducer for. Turns out that the reproducer on default (freebsd) stack made the stack get into an ack-war with itself. After fixing the reference issues in rack the same ack-war was found in rack (and bbr). Basically what happens is we go into the reassembly code and lose the FIN bit. The trick here is we should not be going into the reassembly code if tlen == 0 i.e. the peer never sent you anything. That then gets the proper action on the FIN bit but then you end up in LAST_ACK with no timers running. This is because the usrclosed function gets called and the FIN's and such have already been exchanged. So when we should be entering FIN_WAIT2 (or even FIN_WAIT1) we get stuck in LAST_ACK. Fixing this means tweaking the usrclosed function so that we properly recognize the condition and drop into FIN_WAIT2 where a timer will allow at least TP_MAXIDLE before closing (to allow time for the peer to retransmit its FIN if the ack is lost). Setting the fast_finwait2 timer can speed this up in testing. Reviewed by: mtuexen,rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30451
# 7d2608a5	21-May-2021	Mark Johnston <markj@FreeBSD.org>	tcp: Make error handling in tcp_usr_send() more consistent - Free the input mbuf in a single place instead of in every error path. - Handle PRUS_NOTREADY consistently. - Flush the socket's send buffer if an implicit connect fails. At that point the mbuf has already been enqueued but we don't want to keep it in the send buffer. Reviewed by: gallatin, tuexen Discussed with: jhb MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30349
# d8acd268	12-May-2021	Mark Johnston <markj@FreeBSD.org>	Fix mbuf leaks in various pru_send implementations The various protocol implementations are not very consistent about freeing mbufs in error paths. In general, all protocols must free both "m" and "control" upon an error, except if PRUS_NOTREADY is specified (this is only implemented by TCP and unix(4) and requires further work not handled in this diff), in which case "control" still must be freed. This diff plugs various leaks in the pru_send implementations. Reviewed by: tuexen MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30151
# 0471a8c7	10-May-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: SACK Lost Retransmission Detection (LRD) Recover from excessive losses without reverting to a retransmission timeout (RTO). Disabled by default, enable with sysctl net.inet.tcp.do_lrd=1 Reviewed By: #transport, rrs, tuexen, #manpages Sponsored by: Netapp, Inc. Differential Revision: https://reviews.freebsd.org/D28931
# f161d294	02-May-2021	Mark Johnston <markj@FreeBSD.org>	Add missing sockaddr length and family validation to various protocols Several protocol methods take a sockaddr as input. In some cases the sockaddr lengths were not being validated, or were validated after some out-of-bounds accesses could occur. Add requisite checking to various protocol entry points, and convert some existing checks to assertions where appropriate. Reported by: syzkaller+KASAN Reviewed by: tuexen, melifaro MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29519
# 9e644c23	18-Apr-2021	Michael Tuexen <tuexen@FreeBSD.org>	tcp: add support for TCP over UDP Adding support for TCP over UDP allows communication with TCP stacks which can be implemented in userspace without requiring special priviledges or specific support by the OS. This is joint work with rrs. Reviewed by: rrs Sponsored by: Netflix, Inc. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29469
# 4c0bef07	20-Jan-2021	Kyle Evans <kevans@FreeBSD.org>	kern: net: remove TCP_LINGERTIME TCP_LINGERTIME can be traced back to BSD 4.4 Lite and perhaps beyond, in exactly the same form that it appears here modulo slightly different context. It used to be the case that there was a single pr_usrreq method with requests dispatched to it; these exact two lines appeared in tcp_usrreq's PRU_ATTACH handling. The only purpose of this that I can find is to cause surprising behavior on accepted connections. Newly-created sockets will never hit these paths as one cannot set SO_LINGER prior to socket(2). If SO_LINGER is set on a listening socket and inherited, one would expect the timeout to be inherited rather than changed arbitrarily like this -- noting that SO_LINGER is nonsense on a listening socket beyond inheritance, since they cannot be 'connected' by definition. Neither Illumos nor Linux reset the timer like this based on testing and inspection of Illumos, and testing of Linux. Reviewed by: rscheff, tuexen Differential Revision: https://reviews.freebsd.org/D28265
# a034518a	19-Dec-2020	Andrew Gallatin <gallatin@FreeBSD.org>	Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection. This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases. This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller. Reviewed by: jhb, bz (earlier version), bcr (man page) Tested by: gonzo Sponsored by: Netfix Differential Revision: https://reviews.freebsd.org/D21636
# 662c1305	01-Sep-2020	Mateusz Guzik <mjg@FreeBSD.org>	net: clean up empty lines in .c and .h files
# f903a308	16-Jul-2020	Michael Tuexen <tuexen@FreeBSD.org>	(Re)-allow 0.0.0.0 to be used as an address in connect() for TCP In r361752 an error handling was introduced for using 0.0.0.0 or 255.255.255.255 as the address in connect() for TCP, since both addresses can't be used. However, the stack maps 0.0.0.0 implicitly to a local address and at least two regressions were reported. Therefore, re-allow the usage of 0.0.0.0. While there, change the error indicated when using 255.255.255.255 from EAFNOSUPPORT to EACCES as mentioned in the man-page of connect(). Reviewed by: rrs MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D25401
# e854dd38	08-Jun-2020	Randall Stewart <rrs@FreeBSD.org>	An important statistic in determining if a server process (or client) is being delayed is to know the time to first byte in and time to first byte out. Currently we have no way to know these all we have is t_starttime. That (t_starttime) tells us what time the 3 way handshake completed. We don't know when the first request came in or how quickly we responded. Nor from a client perspective do we know how long from when we sent out the first byte before the server responded. This small change adds the ability to track the TTFB's. This will show up in BB logging which then can be pulled for later analysis. Note that currently the tracking is via the ticks variable of all three variables. This provides a very rough estimate (hz=1000 its 1ms). A follow-on set of work will be to change all three of these values into something with a much finer resolution (either microseconds or nanoseconds), though we may want to make the resolution configurable so that on lower powered machines we could still use the much cheaper ticks variable. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D24902
# 2cf21ae5	03-Jun-2020	Randall Stewart <rrs@FreeBSD.org>	We should never allow either the broadcast or IN_ADDR_ANY to be connected to or sent to. This was fond when working with Michael Tuexen and Skyzaller. Skyzaller seems to want to use either of these two addresses to connect to at times. And it really is an error to do so, so lets not allow that behavior. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D24852
# d442a657	03-Jun-2020	Michael Tuexen <tuexen@FreeBSD.org>	Restrict enabling TCP-FASTOPEN to end-points in CLOSED or LISTEN state Enabling TCP-FASTOPEN on an end-point which is in a state other than CLOSED or LISTEN, is a bug in the application. So it should not work. Also the TCP code does not (and needs not to) handle this. While there, also simplify the setting of the TF_FASTOPEN flag. This issue was found by running syzkaller. Reviewed by: rrs MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D25115
# 25102351	18-May-2020	Mike Karels <karels@FreeBSD.org>	Allow TCP to reuse local port with different destinations Previously, tcp_connect() would bind a local port before connecting, forcing the local port to be unique across all outgoing TCP connections for the address family. Instead, choose a local port after selecting the destination and the local address, requiring only that the tuple is unique and does not match a wildcard binding. Reviewed by: tuexen (rscheff, rrs previous version) MFC after: 1 month Sponsored by: Forcepoint LLC Differential Revision: https://reviews.freebsd.org/D24781
# e240ce42	15-May-2020	Michael Tuexen <tuexen@FreeBSD.org>	Allow only IPv4 addresses in sendto() for TCP on AF_INET sockets. This problem was found by looking at syzkaller reproducers for some other problems. Reviewed by: rrs Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D24831
# 9d176904	10-May-2020	Michael Tuexen <tuexen@FreeBSD.org>	Remove trailing whitespace.
# d3b6c96b	04-May-2020	Randall Stewart <rrs@FreeBSD.org>	Adjust the fb to have a way to ask the underlying stack if it can support the PRUS option (OOB). And then have the new function call that to validate and give the correct error response if needed to the user (rack and bbr do not support obsoleted OOB data). Sponsoered by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D24574
# f1f93475	27-Apr-2020	John Baldwin <jhb@FreeBSD.org>	Initial support for kernel offload of TLS receive. - Add a new TCP_RXTLS_ENABLE socket option to set the encryption and authentication algorithms and keys as well as the initial sequence number. - When reading from a socket using KTLS receive, applications must use recvmsg(). Each successful call to recvmsg() will return a single TLS record. A new TCP control message, TLS_GET_RECORD, will contain the TLS record header of the decrypted record. The regular message buffer passed to recvmsg() will receive the decrypted payload. This is similar to the interface used by Linux's KTLS RX except that Linux does not return the full TLS header in the control message. - Add plumbing to the TOE KTLS interface to request either transmit or receive KTLS sessions. - When a socket is using receive KTLS, redirect reads from soreceive_stream() into soreceive_generic(). - Note that this interface is currently only defined for TLS 1.1 and 1.2, though I believe we will be able to reuse the same interface and structures for 1.3.
# ec1db6e1	27-Apr-2020	John Baldwin <jhb@FreeBSD.org>	Add the initial sequence number to the TLS enable socket option. This will be needed for KTLS RX. Reviewed by: gallatin Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D24451
# a3574665	13-Feb-2020	Michael Tuexen <tuexen@FreeBSD.org>	sack_newdata and snd_recover hold the same value. Therefore, use only a single instance: use snd_recover also where sack_newdata was used. Submitted by: Richard Scheffenegger Differential Revision: https://reviews.freebsd.org/D18811
# 481be5de	12-Feb-2020	Randall Stewart <rrs@FreeBSD.org>	White space cleanup -- remove trailing tab's or spaces from any line. Sponsored by: Netflix Inc.
# 42ce7937	29-Jan-2020	Gleb Smirnoff <glebius@FreeBSD.org>	Fix missing NET_EPOCH_ENTER() when compiled with TCP_OFFLOAD. Reported by: Coverity CID: 1413162
# 7754e281	22-Jan-2020	Bjoern A. Zeeb <bz@FreeBSD.org>	Fix NOINET kernels after r356983. All gotos to the label are within the #ifdef INET section, which leaves us with an unused label. Cover the label under #ifdef INET as well to avoid the warning and compile time error.
# c1604fe4	21-Jan-2020	Gleb Smirnoff <glebius@FreeBSD.org>	Make in_pcbladdr() require network epoch entered by its callers. Together with this widen network epoch coverage up to tcp_connect() and udp_connect(). Revisions from r356974 and up to this revision cover D23187. Differential Revision: https://reviews.freebsd.org/D23187
# e2636f0a	21-Jan-2020	Gleb Smirnoff <glebius@FreeBSD.org>	Remove extraneous NET_EPOCH_ASSERT - the full function is covered.
# 3fed74e9	21-Jan-2020	Gleb Smirnoff <glebius@FreeBSD.org>	Re-absorb tcp_detach() back into tcp_usr_detach() as the comment suggests. Not a functional change.
# 5fc8df3c	21-Jan-2020	Gleb Smirnoff <glebius@FreeBSD.org>	Don't enter network epoch in tcp_usr_detach. A PCB removal doesn't require that.
# 7669c586	21-Jan-2020	Gleb Smirnoff <glebius@FreeBSD.org>	tcp_usr_attach() doesn't need network epoch. in_pcbfree() and in_pcbdetach() perform all necessary synchronization themselves.
# 0f6385e7	21-Jan-2020	Gleb Smirnoff <glebius@FreeBSD.org>	Inline tcp_attach() into tcp_usr_attach(). Not a functional change.
# 109eb549	21-Jan-2020	Gleb Smirnoff <glebius@FreeBSD.org>	Make tcp_output() require network epoch. Enter the epoch before calling into tcp_output() from those functions, that didn't do that before. This eliminates a bunch of epoch recursions in TCP.
# adc56f5a	02-Dec-2019	Edward Tomasz Napierala <trasz@FreeBSD.org>	Make use of the stats(3) framework in the TCP stack. This makes it possible to retrieve per-connection statistical information such as the receive window size, RTT, or goodput, using a newly added TCP_STATS getsockopt(3) option, and extract them using the stats_voistat_fetch(3) API. See the net/tcprtt port for an example consumer of this API. Compared to the existing TCP_INFO system, the main differences are that this mechanism is easy to extend without breaking ABI, and provides statistical information instead of raw "snapshots" of values at a given point in time. stats(3) is more generic and can be used in both userland and the kernel. Reviewed by: thj Tested by: thj Obtained from: Netflix Relnotes: yes Sponsored by: Klara Inc, Netflix Differential Revision: https://reviews.freebsd.org/D20655
# 3cf38784	01-Dec-2019	Michael Tuexen <tuexen@FreeBSD.org>	Move all ECN related flags from the flags to the flags2 field. This allows adding more ECN related flags in the future. No functional change intended. Submitted by: Richard Scheffenegger Reviewed by: rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22497
# 97a95ee1	06-Nov-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER() in TCP functions that are executed in syscall context. No functional change here.
# 4a91aa8f	24-Oct-2019	Michael Tuexen <tuexen@FreeBSD.org>	Ensure that the flags indicating IPv4/IPv6 are not changed by failing bind() calls. This would lead to inconsistent state resulting in a panic. A fix for stable/11 was committed in https://svnweb.freebsd.org/base?view=revision&revision=338986 An accelerated MFC is planned as discussed with emaste@. Reported by: syzbot+2609a378d89264ff5a42@syzkaller.appspotmail.com Obtained from: jtl@ MFC after: 1 day Sponsored by: Netflix, Inc.
# 9e14430d	08-Oct-2019	John Baldwin <jhb@FreeBSD.org>	Add a TOE KTLS mode and a TOE hook for allocating TLS sessions. This adds the glue to allocate TLS sessions and invokes it from the TLS enable socket option handler. This also adds some counters for active TOE sessions. The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when TOE KTLS is in use on a socket, but cannot be set via setsockopt(). To simplify various checks, a TLS session now includes an explicit 'mode' member set to the value returned by TLSTX_TLS_MODE. Various places that used to check 'sw_encrypt' against NULL to determine software vs ifnet (NIC) TLS now check 'mode' instead. Reviewed by: np, gallatin Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21891
# b2e60773	26-Aug-2019	John Baldwin <jhb@FreeBSD.org>	Add kernel-side support for in-kernel TLS. KTLS adds support for in-kernel framing and encryption of Transport Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports offload of TLS for transmitted data. Key negotation must still be performed in userland. Once completed, transmit session keys for a connection are provided to the kernel via a new TCP_TXTLS_ENABLE socket option. All subsequent data transmitted on the socket is placed into TLS frames and encrypted using the supplied keys. Any data written to a KTLS-enabled socket via write(2), aio_write(2), or sendfile(2) is assumed to be application data and is encoded in TLS frames with an application data type. Individual records can be sent with a custom type (e.g. handshake messages) via sendmsg(2) with a new control message (TLS_SET_RECORD_TYPE) specifying the record type. At present, rekeying is not supported though the in-kernel framework should support rekeying. KTLS makes use of the recently added unmapped mbufs to store TLS frames in the socket buffer. Each TLS frame is described by a single ext_pgs mbuf. The ext_pgs structure contains the header of the TLS record (and trailer for encrypted records) as well as references to the associated TLS session. KTLS supports two primary methods of encrypting TLS frames: software TLS and ifnet TLS. Software TLS marks mbufs holding socket data as not ready via M_NOTREADY similar to sendfile(2) when TLS framing information is added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then called to schedule TLS frames for encryption. In the case of sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving the mbufs marked M_NOTREADY until encryption is completed. For other writes (vn_sendfile when pages are available, write(2), etc.), the PRUS_NOTREADY is set when invoking pru_send() along with invoking ktls_enqueue(). A pool of worker threads (the "KTLS" kernel process) encrypts TLS frames queued via ktls_enqueue(). Each TLS frame is temporarily mapped using the direct map and passed to a software encryption backend to perform the actual encryption. (Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if someone wished to make this work on architectures without a direct map.) KTLS supports pluggable software encryption backends. Internally, Netflix uses proprietary pure-software backends. This commit includes a simple backend in a new ktls_ocf.ko module that uses the kernel's OpenCrypto framework to provide AES-GCM encryption of TLS frames. As a result, software TLS is now a bit of a misnomer as it can make use of hardware crypto accelerators. Once software encryption has finished, the TLS frame mbufs are marked ready via pru_ready(). At this point, the encrypted data appears as regular payload to the TCP stack stored in unmapped mbufs. ifnet TLS permits a NIC to offload the TLS encryption and TCP segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS) is allocated on the interface a socket is routed over and associated with a TLS session. TLS records for a TLS session using ifnet TLS are not marked M_NOTREADY but are passed down the stack unencrypted. The ip_output_send() and ip6_output_send() helper functions that apply send tags to outbound IP packets verify that the send tag of the TLS record matches the outbound interface. If so, the packet is tagged with the TLS send tag and sent to the interface. The NIC device driver must recognize packets with the TLS send tag and schedule them for TLS encryption and TCP segmentation. If the the outbound interface does not match the interface in the TLS send tag, the packet is dropped. In addition, a task is scheduled to refresh the TLS send tag for the TLS session. If a new TLS send tag cannot be allocated, the connection is dropped. If a new TLS send tag is allocated, however, subsequent packets will be tagged with the correct TLS send tag. (This latter case has been tested by configuring both ports of a Chelsio T6 in a lagg and failing over from one port to another. As the connections migrated to the new port, new TLS send tags were allocated for the new port and connections resumed without being dropped.) ifnet TLS can be enabled and disabled on supported network interfaces via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported across both vlan devices and lagg interfaces using failover, lacp with flowid enabled, or lacp with flowid enabled. Applications may request the current KTLS mode of a connection via a new TCP_TXTLS_MODE socket option. They can also use this socket option to toggle between software and ifnet TLS modes. In addition, a testing tool is available in tools/tools/switch_tls. This is modeled on tcpdrop and uses similar syntax. However, instead of dropping connections, -s is used to force KTLS connections to switch to software TLS and -i is used to switch to ifnet TLS. Various sysctls and counters are available under the kern.ipc.tls sysctl node. The kern.ipc.tls.enable node must be set to true to enable KTLS (it is off by default). The use of unmapped mbufs must also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS. KTLS is enabled via the KERN_TLS kernel option. This patch is the culmination of years of work by several folks including Scott Long and Randall Stewart for the original design and implementation; Drew Gallatin for several optimizations including the use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records awaiting software encryption, and pluggable software crypto backends; and John Baldwin for modifications to support hardware TLS offload. Reviewed by: gallatin, hselasky, rrs Obtained from: Netflix Sponsored by: Netflix, Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21277
# 0ecd976e	02-Aug-2019	Bjoern A. Zeeb <bz@FreeBSD.org>	IPv6 cleanup: kernel Finish what was started a few years ago and harmonize IPv6 and IPv4 kernel names. We are down to very few places now that it is feasible to do the change for everything remaining with causing too much disturbance. Remove "aliases" for IPv6 names which confusingly could indicate that we are talking about a different data structure or field or have two fields, one for each address family. Try to follow common conventions used in FreeBSD. * Rename sin6p to sin6 as that is how it is spelt in most places. * Remove "aliases" (#defines) for: - in6pcb which really is an inpcb and nothing separate - sotoin6pcb which is sotoinpcb (as per above) - in6p_sp which is inp_sp - in6p_flowinfo which is inp_flow * Try to use ia6 for in6_addr rather than in6p. * With all these gone also rename the in6p variables to inp as that is what we call it in most of the network stack including parts of netinet6. The reasons behind this cleanup are that we try to further unify netinet and netinet6 code where possible and that people will less ignore one or the other protocol family when doing code changes as they may not have spotted places due to different names for the same thing. No functional changes. Discussed with: tuexen (SCTP changes) MFC after: 3 months Sponsored by: Netflix
# 82334850	28-Jun-2019	John Baldwin <jhb@FreeBSD.org>	Add an external mbuf buffer type that holds multiple unmapped pages. Unmapped mbufs allow sendfile to carry multiple pages of data in a single mbuf, without mapping those pages. It is a requirement for Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web serving workloads when used by sendfile, due to effectively compressing socket buffers by an order of magnitude, and hence reducing cache misses. For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer now points to a struct mbuf_ext_pgs structure instead of a data buffer. This structure contains an array of physical addresses (this reduces cache misses compared to an earlier version that stored an array of vm_page_t pointers). It also stores additional fields needed for in-kernel TLS such as the TLS header and trailer data that are currently unused. To more easily detect these mbufs, the M_NOMAP flag is set in m_flags in addition to M_EXT. Various functions like m_copydata() have been updated to safely access packet contents (using uiomove_fromphys()), to make things like BPF safe. NIC drivers advertise support for unmapped mbufs on transmit via a new IFCAP_NOMAP capability. This capability can be toggled via the new 'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only transmit packet contents via DMA and use bus_dma, adding the capability to if_capabilities and if_capenable should be all that is required. If a NIC does not support unmapped mbufs, they are converted to a chain of mapped mbufs (using sf_bufs to provide the mapping) in ip_output or ip6_output. If an unmapped mbuf requires software checksums, it is also converted to a chain of mapped mbufs before computing the checksum. Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Discussed with: ae, kp (firewalls) Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616
# 68cea2b1	18-Apr-2019	John Baldwin <jhb@FreeBSD.org>	Push down INP_WLOCK slightly in tcp_ctloutput. The inp lock is not needed for testing the V6 flag as that flag is set once when the inp is created and never changes. For non-TCP socket options the lock is immediately dropped after checking that flag. This just pushes the lock down to only be acquired for TCP socket options. This isn't a hot-path, more a cosmetic cleanup I noticed while reading the code. Reviewed by: bz MFC after: 1 month Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19740
# c8b53ced	30-Nov-2018	Michael Tuexen <tuexen@FreeBSD.org>	Limit option_len for the TCP_CCALGOOPT. Limiting the length to 2048 bytes seems to be acceptable, since the values used right now are using 8 bytes. Reviewed by: glebius, bz, rrs MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18366
# c6c0be27	24-Aug-2018	Michael Tuexen <tuexen@FreeBSD.org>	Fix a shadowed variable warning. Thanks to Peter Lei for reporting the issue. Approved by: re(kib@) MFH: 1 month Sponsored by: Netflix, Inc.
# 5dff1c38	21-Aug-2018	Michael Tuexen <tuexen@FreeBSD.org>	Enabling the IPPROTO_IPV6 level socket option IPV6_USE_MIN_MTU on a TCP socket resulted in sending fragmented IPV6 packets. This is fixes by reducing the MSS to the appropriate value. In addtion, if the socket option is set before the handshake happens, announce this MSS to the peer. This is not stricly required, but done since TCP is conservative. PR: 173444 Reviewed by: bz@, rrs@ MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16796
# c28440db	19-Aug-2018	Randall Stewart <rrs@FreeBSD.org>	This change represents a substantial restructure of the way we reassembly inbound tcp segments. The old algorithm just blindly dropped in segments without coalescing. This meant that every segment could take up greater and greater room on the linked list of segments. This of course is now subject to a tighter limit (100) of segments which in a high BDP situation will cause us to be a lot more in-efficent as we drop segments beyond 100 entries that we receive. What this restructure does is cause the reassembly buffer to coalesce segments putting an emphasis on the two common cases (which avoid walking the list of segments) i.e. where we add to the back of the queue of segments and where we add to the front. We also have the reassembly buffer supporting a couple of debug options (black box logging as well as counters for code coverage). These are compiled out by default but can be added by uncommenting the defines. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D16626
# 8e02b4e0	19-Aug-2018	Michael Tuexen <tuexen@FreeBSD.org>	Don't expose the uptime via the TCP timestamps. The TCP client side or the TCP server side when not using SYN-cookies used the uptime as the TCP timestamp value. This patch uses in all cases an offset, which is the result of a keyed hash function taking the source and destination addresses and port numbers into account. The keyed hash function is the same a used for the initial TSN. Reviewed by: rrs@ MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16636
# 51e08d53	31-Jul-2018	Michael Tuexen <tuexen@FreeBSD.org>	Fix INET only builds. r336940 introduced an "unused variable" warning on platforms which support INET, but not INET6, like MALTA and MALTA64 as reported by Mark Millard. Improve the #ifdefs to address this issue. Sponsored by: Netflix, Inc.
# 888973f5	30-Jul-2018	Michael Tuexen <tuexen@FreeBSD.org>	Allow implicit TCP connection setup for TCP/IPv6. TCP/IPv4 allows an implicit connection setup using sendto(), which is used for TTCP and TCP fast open. This patch adds support for TCP/IPv6. While there, improve some tests for detecting multicast addresses, which are mapped. Reviewed by: bz@, kbowling@, rrs@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16458
# 8db239dc	30-Jul-2018	Michael Tuexen <tuexen@FreeBSD.org>	Fix some TCP fast open issues. The following issues are fixed: * Whenever a TCP server with TCP fast open enabled, calls accept(), recv(), send(), and close() before the TCP-ACK segment has been received, the TCP connection is just dropped and the reception of the TCP-ACK segment triggers the sending of a TCP-RST segment. * Whenever a TCP server with TCP fast open enabled, calls accept(), recv(), send(), send(), and close() before the TCP-ACK segment has been received, the first byte provided in the second send call is not transferred. * Whenever a TCP client with TCP fast open enabled calls sendto() followed by close() the TCP connection is just dropped. Reviewed by: jtl@, kbowling@, rrs@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16485
# 22699887	21-Jul-2018	Matt Macy <mmacy@FreeBSD.org>	NULL out cc_data in pluggable TCP {cc}_cb_destroy When ABE was added (rS331214) to NewReno and leak fixed (rS333699) , it now has a destructor (newreno_cb_destroy) for per connection state. Other congestion controls may allocate and free cc_data on entry and exit, but the field is never explicitly NULLed if moving back to NewReno which only internally allocates stateful data (no entry contstructor) resulting in a situation where newreno_cb_destory might be called on a junk pointer. - NULL out cc_data in the framework after calling {cc}_cb_destroy - free(9) checks for NULL so there is no need to perform not NULL checks before calling free. - Improve a comment about NewReno in tcp_ccalgounload This is the result of a debugging session from Jason Wolfe, Jason Eggleston, and mmacy@ and very helpful insight from lstewart@. Submitted by: Kevin Bowling Reviewed by: lstewart Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16282
# 6573d758	03-Jul-2018	Matt Macy <mmacy@FreeBSD.org>	epoch(9): allow preemptible epochs to compose - Add tracker argument to preemptible epochs - Inline epoch read path in kernel and tied modules - Change in_epoch to take an epoch as argument - Simplify tfb_tcp_do_segment to not take a ti_locked argument, there's no longer any benefit to dropping the pcbinfo lock and trying to do so just adds an error prone branchfest to these functions - Remove cases of same function recursion on the epoch as recursing is no longer free. - Remove the the TAILQ_ENTRY and epoch_section from struct thread as the tracker field is now stack or heap allocated as appropriate. Tested by: pho and Limelight Networks Reviewed by: kbowling at llnw dot com Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16066
# fd389e7c	19-Apr-2018	Randall Stewart <rrs@FreeBSD.org>	These two modules need the tcp_hpts.h file for when the option is enabled (not sure how LINT/build-universe missed this) opps. Sponsored by: Netflix Inc
# 3ee9c3c4	19-Apr-2018	Randall Stewart <rrs@FreeBSD.org>	This commit brings in the TCP high precision timer system (tcp_hpts). It is the forerunner/foundational work of bringing in both Rack and BBR which use hpts for pacing out packets. The feature is optional and requires the TCPHPTS option to be enabled before the feature will be active. TCP modules that use it must assure that the base component is compile in the kernel in which they are loaded. MFC after: Never Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15020
# 8fa799bd	06-Apr-2018	Jonathan T. Looney <jtl@FreeBSD.org>	If a user closes the socket before we call tcp_usr_abort(), then tcp_drop() may unlock the INP. Currently, tcp_usr_abort() does not check for this case, which results in a panic while trying to unlock the already-unlocked INP (not to mention, a use-after-free violation). Make tcp_usr_abort() check the return value of tcp_drop(). In the case where tcp_drop() returns NULL, tcp_usr_abort() can skip further steps to abort the connection and simply unlock the INP_INFO lock prior to returning. Reviewed by: glebius MFC after: 2 weeks Sponsored by: Netflix, Inc.
# c73b6f4d	04-Apr-2018	Ed Maste <emaste@FreeBSD.org>	Fix kernel memory disclosure in tcp_ctloutput strcpy was used to copy a string into a buffer copied to userland, which left uninitialized data after the terminating 0-byte. Use the same approach as in tcp_subr.c: strncpy and explicit '\0'. admbugs: 765, 822 MFC after: 1 day Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com> Reported by: Vlad Tsyrklevich Security: Kernel memory disclosure Sponsored by: The FreeBSD Foundation
# a6456410	02-Apr-2018	Navdeep Parhar <np@FreeBSD.org>	Add a hook to allow the toedev handling an offloaded connection to provide accurate TCP_INFO. Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D14816
# e24e5683	23-Mar-2018	Jonathan T. Looney <jtl@FreeBSD.org>	Make the TCP blackbox code committed in r331347 be an optional feature controlled by the TCP_BLACKBOX option. Enable this as part of amd64 GENERIC. For now, leave it disabled on other platforms. Sponsored by: Netflix, Inc.
# 2529f56e	22-Mar-2018	Jonathan T. Looney <jtl@FreeBSD.org>	Add the "TCP Blackbox Recorder" which we discussed at the developer summits at BSDCan and BSDCam in 2017. The TCP Blackbox Recorder allows you to capture events on a TCP connection in a ring buffer. It stores metadata with the event. It optionally stores the TCP header associated with an event (if the event is associated with a packet) and also optionally stores information on the sockets. It supports setting a log ID on a TCP connection and using this to correlate multiple connections that share a common log ID. You can log connections in different modes. If you are doing a coordinated test with a particular connection, you may tell the system to put it in mode 4 (continuous dump). Or, if you just want to monitor for errors, you can put it in mode 1 (ring buffer) and dump all the ring buffers associated with the connection ID when we receive an error signal for that connection ID. You can set a default mode that will be applied to a particular ratio of incoming connections. You can also manually set a mode using a socket option. This commit includes only basic probes. rrs@ has added quite an abundance of probes in his TCP development work. He plans to commit those soon. There are user-space programs which we plan to commit as ports. These read the data from the log device and output pcapng files, and then let you analyze the data (and metadata) in the pcapng files. Reviewed by: gnn (previous version) Obtained from: Netflix, Inc. Relnotes: yes Differential Revision: https://reviews.freebsd.org/D11085
# dd388cfd	21-Mar-2018	Gleb Smirnoff <glebius@FreeBSD.org>	The net.inet.tcp.nolocaltimewait=1 optimization prevents local TCP connections from entering the TIME_WAIT state. However, it omits sending the ACK for the FIN, which results in RST. This becomes a bigger deal if the sysctl net.inet.tcp.blackhole is 2. In this case RST isn't send, so the other side of the connection (also local) keeps retransmitting FINs. To fix that in tcp_twstart() we will not call tcp_close() immediately. Instead we will allocate a tcptw on stack and proceed to the end of the function all the way to tcp_twrespond(), to generate the correct ACK, then we will drop the last PCB reference. While here, make a few tiny improvements: - use bools for boolean variable - staticize nolocaltimewait - remove pointless acquisiton of socket lock Reported by: jtl Reviewed by: jtl Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D14697
# 18a75309	25-Feb-2018	Patrick Kelsey <pkelsey@FreeBSD.org>	Greatly reduce the number of #ifdefs supporting the TCP_RFC7413 kernel option. The conditional compilation support is now centralized in tcp_fastopen.h and tcp_var.h. This doesn't provide the minimum theoretical code/data footprint when TCP_RFC7413 is disabled, but nearly all the TFO code should wind up being removed by the optimizer, the additional footprint in the syncache entries is a single pointer, and the additional overhead in the tcpcb is at the end of the structure. This enables the TCP_RFC7413 kernel option by default in amd64 and arm64 GENERIC. Reviewed by: hiren MFC after: 1 month Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14048
# c560df6f	25-Feb-2018	Patrick Kelsey <pkelsey@FreeBSD.org>	This is an implementation of the client side of TCP Fast Open (TFO) [RFC7413]. It also includes a pre-shared key mode of operation in which the server requires the client to be in possession of a shared secret in order to successfully open TFO connections with that server. The names of some existing fastopen sysctls have changed (e.g., net.inet.tcp.fastopen.enabled -> net.inet.tcp.fastopen.server_enable). Reviewed by: tuexen MFC after: 1 month Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14047
# 51369649	20-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.
# 63ec505a	18-Aug-2017	Michael Tuexen <tuexen@FreeBSD.org>	Ensure inp_vflag is consistently set for TCP endpoints. Make sure that the flags INP_IPV4 and INP_IPV6 are consistently set for inpcbs used for TCP sockets, no matter if the setting is derived from the net.inet6.ip6.v6only sysctl or the IPV6_V6ONLY socket option. For UDP this was already done right. PR: 221385 MFC after: 1 week
# 5dba6ada	22-May-2017	Michael Tuexen <tuexen@FreeBSD.org>	The connect() system call should return -1 and set errno to EAFNOSUPPORT if it is called on a TCP socket * with an IPv6 address and the socket is bound to an IPv4-mapped IPv6 address. * with an IPv4-mapped IPv6 address and the socket is bound to an IPv6 address. Thanks to Jonathan T. Leighton for reporting this issue. Reviewed by: bz gnn MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D9163
# fbbd9655	28-Feb-2017	Warner Losh <imp@FreeBSD.org>	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96
# 4616026f	09-Feb-2017	Ermal Luçi <eri@FreeBSD.org>	Revert r313527 Heh svn is not git
# c0fadfdb	09-Feb-2017	Ermal Luçi <eri@FreeBSD.org>	Correct missed variable name. Reported-by: ohartmann@walstatt.org
# fcf59617	06-Feb-2017	Andrey V. Elsukov <ae@FreeBSD.org>	Merge projects/ipsec into head/. Small summary ------------- o Almost all IPsec releated code was moved into sys/netipsec. o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel option IPSEC_SUPPORT added. It enables support for loading and unloading of ipsec.ko and tcpmd5.ko kernel modules. o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type support was removed. Added TCP/UDP checksum handling for inbound packets that were decapsulated by transport mode SAs. setkey(8) modified to show run-time NAT-T configuration of SA. o New network pseudo interface if_ipsec(4) added. For now it is build as part of ipsec.ko module (or with IPSEC kernel). It implements IPsec virtual tunnels to create route-based VPNs. o The network stack now invokes IPsec functions using special methods. The only one header file <netipsec/ipsec_support.h> should be included to declare all the needed things to work with IPsec. o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed. Now these protocols are handled directly via IPsec methods. o TCP_SIGNATURE support was reworked to be more close to RFC. o PF_KEY SADB was reworked: - now all security associations stored in the single SPI namespace, and all SAs MUST have unique SPI. - several hash tables added to speed up lookups in SADB. - SADB now uses rmlock to protect access, and concurrent threads can do SA lookups in the same time. - many PF_KEY message handlers were reworked to reflect changes in SADB. - SADB_UPDATE message was extended to support new PF_KEY headers: SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They can be used by IKE daemon to change SA addresses. o ipsecrequest and secpolicy structures were cardinally changed to avoid locking protection for ipsecrequest. Now we support only limited number (4) of bundled SAs, but they are supported for both INET and INET6. o INPCB security policy cache was introduced. Each PCB now caches used security policies to avoid SP lookup for each packet. o For inbound security policies added the mode, when the kernel does check for full history of applied IPsec transforms. o References counting rules for security policies and security associations were changed. The proper SA locking added into xform code. o xform code was also changed. Now it is possible to unregister xforms. tdb_xxx structures were changed and renamed to reflect changes in SADB/SPDB, and changed rules for locking and refcounting. Reviewed by: gnn, wblock Obtained from: Yandex LLC Relnotes: yes Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D9352
# f5cf1e5f	18-Oct-2016	Julien Charbon <jch@FreeBSD.org>	Fix a double-free when an inp transitions to INP_TIMEWAIT state after having been dropped. This fixes enforces in_pcbdrop() logic in tcp_input(): "in_pcbdrop() is used by TCP to mark an inpcb as unused and avoid future packet delivery or event notification when a socket remains open but TCP has closed." PR: 203175 Reported by: Palle Girgensohn, Slawa Olhovchenkov Tested by: Slawa Olhovchenkov Reviewed by: Slawa Olhovchenkov Approved by: gnn, Slawa Olhovchenkov Differential Revision: https://reviews.freebsd.org/D8211 MFC after: 1 week Sponsored by: Verisign, inc
# 68bd7ed1	12-Oct-2016	Jonathan T. Looney <jtl@FreeBSD.org>	The TFO server-side code contains some changes that are not conditioned on the TCP_RFC7413 kernel option. This change removes those few instructions from the packet processing path. While not strictly necessary, for the sake of consistency, I applied the new IS_FASTOPEN macro to all places in the packet processing path that used the (t_flags & TF_FASTOPEN) check. Reviewed by: hiren Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D8219
# 3ac12506	06-Oct-2016	Jonathan T. Looney <jtl@FreeBSD.org>	Remove "long" variables from the TCP stack (not including the modular congestion control framework). Reviewed by: gnn, lstewart (partial) Sponsored by: Juniper Networks, Netflix Differential Revision: (multiple) Tested by: Limelight, Netflix
# 5a17b6ad	14-Sep-2016	Michael Tuexen <tuexen@FreeBSD.org>	Ensure that the IPPROTO_TCP level socket options * TCP_KEEPINIT * TCP_KEEPINTVL * TCP_KEEPIDLE * TCP_KEEPCNT always always report the values currently used when getsockopt() is used. This wasn't the case when the sysctl-inherited default values where used. Ensure that the IPPROTO_TCP level socket option TCP_INFO has the TCPI_OPT_ECN flag set in the tcpi_options field when ECN support has been negotiated successfully. Reviewed by: rrs, jtl, hiren MFC after: 1 month Differential Revision: 7833
# 587d67c0	16-Aug-2016	Randall Stewart <rrs@FreeBSD.org>	Here we update the modular tcp to be able to switch to an alternate TCP stack in other then the closed state (pre-listen/connect). The idea is that if that is supported by the alternate stack, it is asked if its ok to switch. If it approves the "handoff" then we allow the switch to happen. Also the fini() function now gets a flag to tell if you are switching away or the tcb is destroyed. The init() call into the alternate stack is moved to the end so the tcb is more fully formed before the init transpires. Sponsored by: Netflix Inc. Differential Revision: D6790
# bac5bedf	26-Apr-2016	Conrad Meyer <cem@FreeBSD.org>	tcp_usrreq: Free allocated buffer in relock case The disgusting macro INP_WLOCK_RECHECK may early-return. In tcp_default_ctloutput() the TCP_CCALGOOPT case allocates memory before invoking this macro, which may leak memory. Add a _CLEANUP variant that takes a code argument to perform variable cleanup in the early return path. Use it to free the 'pbuf' allocated in tcp_default_ctloutput(). I am not especially happy with this macro, but I reckon it's not any worse than INP_WLOCK_RECHECK already was. Reported by: Coverity CID: 1350286 Sponsored by: EMC / Isilon Storage Division
# bf840a17	14-Mar-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Redo r294869. The array of counters for TCP states doesn't belong to struct tcpstat, because the structure can be zeroed out by netstat(1) -z, and of course running connection counts shouldn't be touched. Place running connection counts into separate array, and provide separate read-only sysctl oid for it.
# e79cb051	03-Mar-2016	George V. Neville-Neil <gnn@FreeBSD.org>	Fix dtrace probes (introduced in 287759): debug__input was used for output and drop; connect didn't always fire a user probe some probes were missing in fastpath Submitted by: Hannes Mehnert Sponsored by: REMS, EPSRC Differential Revision: https://reviews.freebsd.org/D5525
# 4644fda3	27-Jan-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Rename netinet/tcp_cc.h to netinet/cc/cc.h. Discussed with: lstewart
# af6fef3a	27-Jan-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Fix issues with TCP_CONGESTION handling after r294540: o Return back the buf[TCP_CA_NAME_MAX] for TCP_CONGESTION, for TCP_CCALGOOPT use dynamically allocated *pbuf. o For SOPT_SET TCP_CONGESTION do NULL terminating of string taking from userland. o For SOPT_SET TCP_CONGESTION do the search for the algorithm keeping the inpcb lock. o For SOPT_GET TCP_CONGESTION first strlcpy() the name holding the inpcb lock into temporary buffer, then copyout. Together with: lstewart
# 57a78e3b	26-Jan-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Augment struct tcpstat with tcps_states[], which is used for book-keeping the amount of TCP connections by state. Provides a cheap way to get connection count without traversing the whole pcb list. Sponsored by: Netflix
# d519cedb	21-Jan-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Provide new socket option TCP_CCALGOOPT, which stands for TCP congestion control algorithm options. The argument is variable length and is opaque to TCP, forwarded directly to the algorithm's ctl_output method. Provide new includes directory netinet/cc, where algorithm specific headers can be installed. The new API doesn't yet have any in tree consumers. The original code written by lstewart. Reviewed by: rrs, emax Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D711
# 73e263b1	21-Jan-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Refactor TCP_CONGESTION setsockopt handling: - Use M_TEMP instead of stack variable. - Unroll error handling, removing several levels of indentation.
# 2de3e790	21-Jan-2016	Gleb Smirnoff <glebius@FreeBSD.org>	- Rename cc.h to more meaningful tcp_cc.h. - Declare it a kernel only include, which it already is. - Don't include tcp.h implicitly from tcp_cc.h
# 0c39d38d	06-Jan-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Historically we have two fields in tcpcb to describe sender MSS: t_maxopd, and t_maxseg. This dualism emerged with T/TCP, but was not properly cleaned up after T/TCP removal. After all permutations over the years the result is that t_maxopd stores a minimum of peer offered MSS and MTU reduced by minimum protocol header. And t_maxseg stores (t_maxopd - TCPOLEN_TSTAMP_APPA) if timestamps are in action, or is equal to t_maxopd otherwise. That's a very rough estimate of MSS reduced by options length. Throughout the code it was used in places, where preciseness was not important, like cwnd or ssthresh calculations. With this change: - t_maxopd goes away. - t_maxseg now stores MSS not adjusted by options. - new function tcp_maxseg() is provided, that calculates MSS reduced by options length. The functions gives a better estimate, since it takes into account SACK state as well. Reviewed by: jtl Differential Revision: https://reviews.freebsd.org/D3593
# 281a0fd4	24-Dec-2015	Patrick Kelsey <pkelsey@FreeBSD.org>	Implementation of server-side TCP Fast Open (TFO) [RFC7413]. TFO is disabled by default in the kernel build. See the top comment in sys/netinet/tcp_fastopen.c for implementation particulars. Reviewed by: gnn, jch, stas MFC after: 3 days Sponsored by: Verisign, Inc. Differential Revision: https://reviews.freebsd.org/D4350
# 55bceb1e	15-Dec-2015	Randall Stewart <rrs@FreeBSD.org>	First cut of the modularization of our TCP stack. Still to do is to clean up the timer handling using the async-drain. Other optimizations may be coming to go with this. Whats here will allow differnet tcp implementations (one included). Reviewed by: jtl, hiren, transports Sponsored by: Netflix Inc. Differential Revision: D4055
# 86a996e6	13-Oct-2015	Hiren Panchasara <hiren@FreeBSD.org>	There are times when it would be really nice to have a record of the last few packets and/or state transitions from each TCP socket. That would help with narrowing down certain problems we see in the field that are hard to reproduce without understanding the history of how we got into a certain state. This change provides just that. It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is destroyed, the list is freed. I thought this was likely to be more performance-friendly than saving copies of the tcpcb. Plus, with the packets, you should be able to reverse-engineer what happened to the tcpcb. To enable the feature, you will need to compile a kernel with the TCPPCAP option. Even then, the feature defaults to being deactivated. You can activate it by setting a positive value for the number of captured packets. You can do that on either a global basis or on a per-socket basis (via a setsockopt call). There is no way to get the packets out of the kernel other than using kmem or getting a coredump. I thought that would help some of the legal/privacy concerns regarding such a feature. However, it should be possible to add a future effort to export them in PCAP format. I tested this at low scale, and found that there were no mbuf leaks and the peak mbuf usage appeared to be unchanged with and without the feature. The main performance concern I can envision is the number of mbufs that would be used on systems with a large number of sockets. If you save five packets per direction per socket and have 3,000 sockets, that will consume at least 30,000 mbufs just to keep these packets. I tried to reduce the concerns associated with this by limiting the number of clusters (not mbufs) that could be used for this feature. Again, in my testing, that appears to work correctly. Differential Revision: D3100 Submitted by: Jonathan Looney <jlooney at juniper dot net> Reviewed by: gnn, hiren
# 550e9d42	15-Sep-2015	Hiren Panchasara <hiren@FreeBSD.org>	Remove unnecessary tcp state transition call. Differential Revision: D3451 Reviewed by: markj MFC after: 2 weeks Sponsored by: Limelight Networks
# 5d06879a	13-Sep-2015	George V. Neville-Neil <gnn@FreeBSD.org>	dd DTrace probe points, translators and a corresponding script to provide the TCPDEBUG functionality with pure DTrace. Reviewed by: rwatson MFC after: 2 weeks Sponsored by: Limelight Networks Differential Revision: D3530
# 079672cb	08-Aug-2015	Julien Charbon <jch@FreeBSD.org>	Fix a kernel assertion issue introduced with r286227: Avoid too strict INP_INFO_RLOCK_ASSERT checks due to tcp_notify() being called from in6_pcbnotify(). Reported by: Larry Rosenman <ler@lerctr.org> Submitted by: markj, jch
# ff9b006d	02-Aug-2015	Julien Charbon <jch@FreeBSD.org>	Decompose TCP INP_INFO lock to increase short-lived TCP connections scalability: - The existing TCP INP_INFO lock continues to protect the global inpcb list stability during full list traversal (e.g. tcp_pcblist()). - A new INP_LIST lock protects inpcb list actual modifications (inp allocation and free) and inpcb global counters. It allows to use TCP INP_INFO_RLOCK lock in critical paths (e.g. tcp_input()) and INP_INFO_WLOCK only in occasional operations that walk all connections. PR: 183659 Differential Revision: https://reviews.freebsd.org/D2599 Reviewed by: jhb, adrian Tested by: adrian, nitroboost-gmail.com Sponsored by: Verisign, Inc.
# 4741bfcb	29-Jul-2015	Patrick Kelsey <pkelsey@FreeBSD.org>	Revert r265338, r271089 and r271123 as those changes do not handle non-inline urgent data and introduce an mbuf exhaustion attack vector similar to FreeBSD-SA-15:15.tcp, but not requiring VNETs. Address the issue described in FreeBSD-SA-15:15.tcp. Reviewed by: glebius Approved by: so Approved by: jmallett (mentor) Security: FreeBSD-SA-15:15.tcp Sponsored by: Norse Corp, Inc.
# eb96dc33	09-Mar-2015	Julien Charbon <jch@FreeBSD.org>	In TCP, connect() can return incorrect error code EINVAL instead of EADDRINUSE or ECONNREFUSED PR: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=196035 Differential Revision: https://reviews.freebsd.org/D1982 Reported by: Mark Nunberg <mnunberg@haskalah.org> Submitted by: Harrison Grundy <harrison.grundy@astrodoggroup.com> Reviewed by: adrian, jch, glebius, gnn Approved by: jhb MFC after: 2 weeks
# 2cbcd3c1	30-Nov-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Merge from projects/sendfile: - Provide pru_ready function for TCP. - Don't call tcp_output() from tcp_usr_send() if no ready data was put into the socket buffer. - In case of dropped connection don't try to m_freem() not ready data. Sponsored by: Nginx, Inc. Sponsored by: Netflix
# 651e4e6a	30-Nov-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Merge from projects/sendfile: extend protocols API to support sending not ready data: o Add new flag to pru_send() flags - PRUS_NOTREADY. o Add new protocol method pru_ready(). Sponsored by: Nginx, Inc. Sponsored by: Netflix
# 300fa232	29-Nov-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Missed in r274421: use sbavail() instead of bare access to sb_cc.
# cea40c48	30-Oct-2014	Julien Charbon <jch@FreeBSD.org>	Fix a race condition in TCP timewait between tcp_tw_2msl_reuse() and tcp_tw_2msl_scan(). This race condition drives unplanned timewait timeout cancellation. Also simplify implementation by holding inpcb reference and removing tcptw reference counting. Differential Revision: https://reviews.freebsd.org/D826 Submitted by: Marc De la Gueronniere <mdelagueronniere@verisign.com> Submitted by: jch Reviewed By: jhb (mentor), adrian, rwatson Sponsored by: Verisign, Inc. MFC after: 2 weeks X-MFC-With: r264321
# 489dcc92	12-Oct-2014	Julien Charbon <jch@FreeBSD.org>	A connection in TIME_WAIT state before calling close() actually did not received any RST packet. Do not set error to ECONNRESET in this case. Differential Revision: https://reviews.freebsd.org/D879 Reviewed by: rpaulo, adrian Approved by: jhb (mentor) Sponsored by: Verisign, Inc.
# a7e201bb	10-Sep-2014	Andrey V. Elsukov <ae@FreeBSD.org>	Make in6_pcblookup_hash_locked and in6_pcbladdr static. Obtained from: Yandex LLC Sponsored by: Yandex LLC
# e407b67b	04-May-2014	Gleb Smirnoff <glebius@FreeBSD.org>	The FreeBSD-SA-14:08.tcp was a lesson on not doing acrobatics with mixing on stack memory and UMA memory in one linked list. Thus, rewrite TCP reassembly code in terms of memory usage. The algorithm remains unchanged. We actually do not need extra memory to build a reassembly queue. Arriving mbufs are always packet header mbufs. So we got the length of data as pkthdr.len. We got m_nextpkt for linkage. And we need only one pointer to point at the tcphdr, use PH_loc for that. In tcpcb the t_segq fields becomes mbuf pointer. The t_segqlen field now counts not packets, but bytes in the queue. This gives us more precision when comparing to socket buffer limits. Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 6f3caa6d	28-Jan-2014	George V. Neville-Neil <gnn@FreeBSD.org>	Decrease lock contention within the TCP accept case by removing the INP_INFO lock from tcp_usr_accept. As the PR/patch states this was following the advice already in the code. See the PR below for a full disucssion of this change and its measured effects. PR: 183659 Submitted by: Julian Charbon Reviewed by: jhb
# 2f3eb7f4	08-Nov-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Make TCP_KEEP* socket options readable. At least PostgreSQL wants to read the values. Reported by: sobomax
# 76039bc8	26-Oct-2013	Gleb Smirnoff <glebius@FreeBSD.org>	The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare to this event, adding if_var.h to files that do need it. Also, include all includes that now are included due to implicit pollution via if_var.h Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 57f60867	25-Aug-2013	Mark Johnston <markj@FreeBSD.org>	Implement the ip, tcp, and udp DTrace providers. The probe definitions use dynamic translation so that their arguments match the definitions for these providers in Solaris and illumos. Thus, existing scripts for these providers should work unmodified on FreeBSD. Tested by: gnn, hiren MFC after: 1 month
# adfaf8f6	25-Jan-2013	Navdeep Parhar <np@FreeBSD.org>	Add checks for SO_NO_OFFLOAD in a couple of places that I missed earlier in r245915.
# 460cf046	25-Jan-2013	Navdeep Parhar <np@FreeBSD.org>	There is no need to call into the TOE driver twice in pru_rcvd (tod_rcvd and then tod_output right after that). Reviewed by: bz@
# 37cc0ecb	25-Jan-2013	Navdeep Parhar <np@FreeBSD.org>	Heed SO_NO_OFFLOAD. MFC after: 1 week
# 85c05144	27-Sep-2012	Gleb Smirnoff <glebius@FreeBSD.org>	Fix bug in TCP_KEEPCNT setting, which slipped in in the last round of reviewing of r231025. Unlike other options from this family TCP_KEEPCNT doesn't specify time interval, but a count, thus parameter supplied doesn't need to be multiplied by hz. Reported & tested by: amdmi3
# 09fe6320	19-Jun-2012	Navdeep Parhar <np@FreeBSD.org>	- Updated TOE support in the kernel. - Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs. These are available as t3_tom and t4_tom modules that augment cxgb(4) and cxgbe(4) respectively. The cxgb/cxgbe drivers continue to work as usual with or without these extra features. - iWARP driver for Terminator 3 ASIC (kernel verbs). T4 iWARP in the works and will follow soon. Build-tested with make universe. 30s overview ============ What interfaces support TCP offload? Look for TOE4 and/or TOE6 in the capabilities of an interface: # ifconfig -m \| grep TOE Enable/disable TCP offload on an interface (just like any other ifnet capability): # ifconfig cxgbe0 toe # ifconfig cxgbe0 -toe Which connections are offloaded? Look for toe4 and/or toe6 in the output of netstat and sockstat: # netstat -np tcp \| grep toe # sockstat -46c \| grep toe Reviewed by: bz, gnn Sponsored by: Chelsio communications. MFC after: ~3 months (after 9.1, and after ensuring MFC is feasible)
# 9077f387	05-Feb-2012	Gleb Smirnoff <glebius@FreeBSD.org>	Add new socket options: TCP_KEEPINIT, TCP_KEEPIDLE, TCP_KEEPINTVL and TCP_KEEPCNT, that allow to control initial timeout, idle time, idle re-send interval and idle send count on a per-socket basis. Reviewed by: andre, bz, lstewart
# db3cee51	06-Jan-2012	Navdeep Parhar <np@FreeBSD.org>	Always release the inp lock before returning from tcp_detach. MFC after: 5 days
# 873789cb	16-Oct-2011	Andre Oppermann <andre@FreeBSD.org>	Move the tcp_sendspace and tcp_recvspace sysctl's from the middle of tcp_usrreq.c to the top of tcp_output.c and tcp_input.c respectively next to the socket buffer autosizing controls. MFC after: 1 week
# e233e2ac	16-Oct-2011	Andre Oppermann <andre@FreeBSD.org>	VNET virtualize tcp_sendspace/tcp_recvspace and change the type to INT. A long is not necessary as the TCP window is limited to 2**30. A larger initial window isn't useful. MFC after: 1 week
# c8360ae2	16-Oct-2011	Andre Oppermann <andre@FreeBSD.org>	Update the comment and description of tcp_sendspace and tcp_recvspace to better reflect their purpose. MFC after: 1 week
# b598155a	02-Jun-2011	Robert Watson <rwatson@FreeBSD.org>	Do not leak the pcbinfohash lock in the case where in6_pcbladdr() returns an error during TCP connect(2) on an IPv6 socket. Submitted by: bz Sponsored by: Juniper Networks, Inc.
# fa046d87	30-May-2011	Robert Watson <rwatson@FreeBSD.org>	Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
# b287c6c7	30-Apr-2011	Bjoern A. Zeeb <bz@FreeBSD.org>	Make the TCP code compile without INET. Sort #includes and add #ifdef INETs. Add some comments at #endifs given more nestedness. To make the compiler happy, some default initializations were added in accordance with the style on the files. Reviewed by: gnn Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems MFC after: 4 days
# d28b9e89	04-Feb-2011	John Baldwin <jhb@FreeBSD.org>	When turning off TCP_NOPUSH, only call tcp_output() to immediately flush any pending data if the connection is established. Submitted by: csjp Reviewed by: lstewart MFC after: 1 week
# 7f79e7e4	29-Jan-2011	Bjoern A. Zeeb <bz@FreeBSD.org>	Remove duplicate printing of TF_NOPUSH in db_print_tflags(). MFC after: 10 days
# 79e955ed	07-Jan-2011	John Baldwin <jhb@FreeBSD.org>	Trim extra spaces before tabs.
# f5d34df5	17-Nov-2010	George V. Neville-Neil <gnn@FreeBSD.org>	Add new, per connection, statistics for TCP, including: Retransmitted Packets Zero Window Advertisements Out of Order Receives These statistics are available via the -T argument to netstat(1). MFC after: 2 weeks
# dbc42409	11-Nov-2010	Lawrence Stewart <lstewart@FreeBSD.org>	This commit marks the first formal contribution of the "Five New TCP Congestion Control Algorithms for FreeBSD" FreeBSD Foundation funded project. More details about the project are available at: http://caia.swin.edu.au/freebsd/5cc/ - Add a KPI and supporting infrastructure to allow modular congestion control algorithms to be used in the net stack. Algorithms can maintain per-connection state if required, and connections maintain their own algorithm pointer, which allows different connections to concurrently use different algorithms. The TCP_CONGESTION socket option can be used with getsockopt()/setsockopt() to programmatically query or change the congestion control algorithm respectively from within an application at runtime. - Integrate the framework with the TCP stack in as least intrusive a manner as possible. Care was also taken to develop the framework in a way that should allow integration with other congestion aware transport protocols (e.g. SCTP) in the future. The hope is that we will one day be able to share a single set of congestion control algorithm modules between all congestion aware transport protocols. - Introduce a new congestion recovery (TF_CONGRECOVERY) state into the TCP stack and use it to decouple the meaning of recovery from a congestion event and recovery from packet loss (TF_FASTRECOVERY) a la RFC2581. ECN and delay based congestion control protocols don't generally need to recover from packet loss and need a different way to note a congestion recovery episode within the stack. - Remove the net.inet.tcp.newreno sysctl, which simplifies some portions of code and ensures the stack always uses the appropriate mechanisms for recovering from packet loss during a congestion recovery episode. - Extract the NewReno congestion control algorithm from the TCP stack and massage it into module form. NewReno is always built into the kernel and will remain the default algorithm for the forseeable future. Implementations of additional different algorithms will become available in the near future. - Bump __FreeBSD_version to 900025 and note in UPDATING that rebuilding code that relies on the size of "struct tcpcb" is required. Many thanks go to the Cisco University Research Program Fund at Community Foundation Silicon Valley and the FreeBSD Foundation. Their support of our work at the Centre for Advanced Internet Architectures, Swinburne University of Technology is greatly appreciated. In collaboration with: David Hayes <dahayes at swin edu au> and Grenville Armitage <garmitage at swin edu au> Sponsored by: Cisco URP, FreeBSD Foundation Reviewed by: rpaulo Tested by: David Hayes (and many others over the years) MFC after: 3 months
# a7d5f7eb	19-Oct-2010	Jamie Gritton <jamie@FreeBSD.org>	A new jail(8) with a configuration file, to replace the work currently done by /etc/rc.d/jail.
# 1c18314d	16-Sep-2010	Andre Oppermann <andre@FreeBSD.org>	Remove the TCP inflight bandwidth limiter as announced in r211315 to give way for the pluggable congestion control framework. It is the task of the congestion control algorithm to set the congestion window and amount of inflight data without external interference. In 'struct tcpcb' the variables previously used by the inflight limiter are renamed to spares to keep the ABI intact and to have some more space for future extensions. In 'struct tcp_info' the variable 'tcpi_snd_bwnd' is not removed to preserve the ABI. It is always set to 0. In siftr.c in 'struct pkt_node' the variable 'snd_bwnd' is not removed to preserve the ABI. It is always set to 0. These unused variable in the various structures may be reused in the future or garbage collected before the next release or at some other point when an ABI change happens anyway for other reasons. No MFC is planned. The inflight bandwidth limiter stays disabled by default in the other branches but remains available.
# 03b868be	01-Jun-2010	Robert Watson <rwatson@FreeBSD.org>	Merge r204809 from head to stable/8: Add a comment to tcp_usr_accept() to indicate why it is we acquire the tcbinfo lock there: r175612, which re-added it, masked a race between sonewconn(2) and accept(2) that could allow an incompletely initialized address on a newly-created socket on a listen queue to be exposed. Full details can be found in that commit message. Sponsored by: Juniper Networks Approved by: re (bz)
# 8296cddf	06-Mar-2010	Robert Watson <rwatson@FreeBSD.org>	Add a comment to tcp_usr_accept() to indicate why it is we acquire the tcbinfo lock there: r175612, which re-added it, masked a race between sonewconn(2) and accept(2) that could allow an incompletely initialized address on a newly-created socket on a listen queue to be exposed. Full details can be found in that commit message. MFC after: 1 week Sponsored by: Juniper Networks
# e10b0dfd	05-Jan-2010	John Baldwin <jhb@FreeBSD.org>	MFC 200847: - Rename the __tcpi_(snd\|rcv)_mss fields of the tcp_info structure to remove the leading underscores since they are now implemented. - Implement the tcpi_rto and tcpi_last_data_recv fields in the tcp_info structure.
# 43d94734	22-Dec-2009	John Baldwin <jhb@FreeBSD.org>	- Rename the __tcpi_(snd\|rcv)_mss fields of the tcp_info structure to remove the leading underscores since they are now implemented. - Implement the tcpi_rto and tcpi_last_data_recv fields in the tcp_info structure. Reviewed by: rwatson MFC after: 2 weeks
# 11c99a6d	15-Sep-2009	Andre Oppermann <andre@FreeBSD.org>	-Put the optimized soreceive_stream() under a compile time option called TCP_SORECEIVE_STREAM for the time being. Requested by: brooks Once compiled in make it easily switchable for testers by using a tuneable net.inet.tcp.soreceive_stream and a corresponding read-only sysctl to report the current state. Suggested by: rwatson MFC after: 2 days
# 530c0060	01-Aug-2009	Robert Watson <rwatson@FreeBSD.org>	Merge the remainder of kern_vimage.c and vimage.h into vnet.c and vnet.h, we now use jails (rather than vimages) as the abstraction for virtualization management, and what remained was specific to virtual network stacks. Minor cleanups are done in the process, and comments updated to reflect these changes. Reviewed by: bz Approved by: re (vimage blanket)
# eddfbb76	14-Jul-2009	Robert Watson <rwatson@FreeBSD.org>	Build on Jeff Roberson's linker-set based dynamic per-CPU allocator (DPCPU), as suggested by Peter Wemm, and implement a new per-virtual network stack memory allocator. Modify vnet to use the allocator instead of monolithic global container structures (vinet, ...). This change solves many binary compatibility problems associated with VIMAGE, and restores ELF symbols for virtualized global variables. Each virtualized global variable exists as a "reference copy", and also once per virtual network stack. Virtualized global variables are tagged at compile-time, placing the in a special linker set, which is loaded into a contiguous region of kernel memory. Virtualized global variables in the base kernel are linked as normal, but those in modules are copied and relocated to a reserved portion of the kernel's vnet region with the help of a the kernel linker. Virtualized global variables exist in per-vnet memory set up when the network stack instance is created, and are initialized statically from the reference copy. Run-time access occurs via an accessor macro, which converts from the current vnet and requested symbol to a per-vnet address. When "options VIMAGE" is not compiled into the kernel, normal global ELF symbols will be used instead and indirection is avoided. This change restores static initialization for network stack global variables, restores support for non-global symbols and types, eliminates the need for many subsystem constructors, eliminates large per-subsystem structures that caused many binary compatibility issues both for monitoring applications (netstat) and kernel modules, removes the per-function INIT_VNET_*() macros throughout the stack, eliminates the need for vnet_symmap ksym(2) munging, and eliminates duplicate definitions of virtualized globals under VIMAGE_GLOBALS. Bump __FreeBSD_version and update UPDATING. Portions submitted by: bz Reviewed by: bz, zec Discussed with: gnn, jamie, jeff, jhb, julian, sam Suggested by: peter Approved by: re (kensmith)
# 88d166bf	23-Jun-2009	Bjoern A. Zeeb <bz@FreeBSD.org>	Make callers to in6_selectsrc() and in6_pcbladdr() pass in memory to save the selected source address rather than returning an unreferenced copy to a pointer that might long be gone by the time we use the pointer for anything meaningful. Asked for by: rwatson Reviewed by: rwatson
# ef760e6a	22-Jun-2009	Andre Oppermann <andre@FreeBSD.org>	Add soreceive_stream(), an optimized version of soreceive() for stream (TCP) sockets. It is functionally identical to generic soreceive() but has a number stream specific optimizations: o does only one sockbuf unlock/lock per receive independent of the length of data to be moved into the uio compared to soreceive() which unlocks/locks per mbuf. o uses m_mbuftouio() instead of its own copy(out) variant. o much more compact code flow as a large number of special cases is removed. o much improved reability. It offers significantly reduced CPU usage and lock contention when receiving fast TCP streams. Additional gains are obtained when the receiving application is using SO_RCVLOWAT to batch up some data before a read (and wakeup) is done. This function was written by "reverse engineering" and is not just a stripped down variant of soreceive(). It is not yet enabled by default on TCP sockets. Instead it is commented out in the protocol initialization in tcp_usrreq.c until more widespread testing has been done. Testers, especially with 10GigE gear, are welcome. MFP4: r164817 //depot/user/andre/soreceive_stream/
# 9f78a87a	16-Jun-2009	John Baldwin <jhb@FreeBSD.org>	- Change members of tcpcb that cache values of ticks from int to u_int: t_rcvtime, t_starttime, t_rtttime, t_bw_rtttime, ts_recent_age, t_badrxtwin. - Change t_recent in struct timewait from u_long to u_int32_t to match the type of the field it shadows from tcpcb: ts_recent. - Change t_starttime in struct timewait from u_long to u_int to match the t_starttime field in tcpcb. Requested by: bde (1, 3)
# a13c655c	11-Jun-2009	John Baldwin <jhb@FreeBSD.org>	Correct printf format type mismatches.
# 0e8cc7e7	10-Jun-2009	John Baldwin <jhb@FreeBSD.org>	Change a few members of tcpcb that store cached copies of ticks to be ints instead of unsigned longs. This fixes a few overflow edge cases on 64-bit platforms. Specifically, if an idle connection receives a packet shortly before 2^31 clock ticks of uptime (about 25 days with hz=1000) and the keep alive timer fires after 2^31 clock ticks, the keep alive timer will think that the connection has been idle for a very long time and will immediately drop the connection instead of sending a keep alive probe. Reviewed by: silby, gnn, lstewart MFC after: 1 week
# 78b50714	11-Apr-2009	Robert Watson <rwatson@FreeBSD.org>	Update stats in struct tcpstat using two new macros, TCPSTAT_ADD() and TCPSTAT_INC(), rather than directly manipulating the fields across the kernel. This will make it easier to change the implementation of these statistics, such as using per-CPU versions of the data structures. MFC after: 3 days
# 970caf60	07-Apr-2009	Bjoern A. Zeeb <bz@FreeBSD.org>	With the right comparison we get a proper wscale value and thus more adequate TCP performance with IPv6. Changes for IPv4, r166403 and r172795, both ignored the IPv6 counterpart and left it in the state of art of year 2000. The same logic in syncache already shares code between v4 and v6 so things do not need to be adapted there. Reported by: Steinar Haug (sthaug nethelp.no) Tested by: Steinar Haug (sthaug nethelp.no) MFC after: 3 days
# ad71fe3c	15-Mar-2009	Robert Watson <rwatson@FreeBSD.org>	Correct a number of evolved problems with inp_vflag and inp_flags: certain flags that should have been in inp_flags ended up in inp_vflag, meaning that they were inconsistently locked, and in one case, interpreted. Move the following flags from inp_vflag to gaps in the inp_flags space (and clean up the inp_flags constants to make gaps more obvious to future takers): INP_TIMEWAIT INP_SOCKREF INP_ONESBCAST INP_DROPPED Some aspects of this change have no effect on kernel ABI at all, as these are UDP/TCP/IP-internal uses; however, netstat and sockstat detect INP_TIMEWAIT when listing TCP sockets, so any MFC will need to take this into account. MFC after: 1 week (or after dependencies are MFC'd) Reviewed by: bz
# ce2ae9ab	24-Feb-2009	Robert Watson <rwatson@FreeBSD.org>	In tcp_usr_shutdown() and tcp_usr_send(), I missed converting NULL checks for the tcpcb, previously used to detect complete disconnection, with INP_DROPPED checks. Correct that, preventing shutdown() from improperly generating a TCP segment with destination IP and port of 0.0.0.0:0. PR: kern/132050 Reported by: david gueluy <david.gueluy at netasq.com> MFC after: 3 weeks
# b89e82dd	05-Feb-2009	Jamie Gritton <jamie@FreeBSD.org>	Standardize the various prison_foo_ip[46] functions and prison_if to return zero on success and an error code otherwise. The possible errors are EADDRNOTAVAIL if an address being checked for doesn't match the prison, and EAFNOSUPPORT if the prison doesn't have any addresses in that address family. For most callers of these functions, use the returned error code instead of e.g. a hard-coded EADDRNOTAVAIL or EINVAL. Always include a jailed() check in these functions, where a non-jailed cred always returns success (and makes no changes). Remove the explicit jailed() checks that preceded many of the function calls. Approved by: bz (mentor)
# dcdb4371	16-Dec-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Use inc_flags instead of the inc_isipv6 alias which so far had been the only flag with random usage patterns. Switch inc_flags to be used as a real bit field by using INC_ISIPV6 with bitops to check for the 'isipv6' condition. While here fix a place or two where in case of v4 inc_flags were not properly initialized before.[1] Found by: rwatson during review [1] Discussed with: rwatson Reviewed by: rwatson MFC after: 4 weeks
# fc384fa5	15-Dec-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Another step assimilating IPv[46] PCB code - directly use the inpcb names rather than the following IPv6 compat macros: in6pcb,in6p_sp, in6p_ip6_nxt,in6p_flowinfo,in6p_vflag, in6p_flags,in6p_socket,in6p_lport,in6p_fport,in6p_ppcb and sotoin6pcb(). Apart from removing duplicate code in netipsec, this is a pure whitespace, not a functional change. Discussed with: rwatson Reviewed by: rwatson (version before review requested changes) MFC after: 4 weeks (set the timer and see then)
# 4b79449e	02-Dec-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Rather than using hidden includes (with cicular dependencies), directly include only the header files needed. This reduces the unneeded spamming of various headers into lots of files. For now, this leaves us with very few modules including vnet.h and thus needing to depend on opt_route.h. Reviewed by: brooks, gnn, des, zec, imp Sponsored by: The FreeBSD Foundation
# 413628a7	29-Nov-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
# 5cd54324	27-Nov-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Replace most INP_CHECK_SOCKAF() uses checking if it is an IPv6 socket by comparing a constant inp vflag. This is expected to help to reduce extra locking. Suggested by: rwatson Reviewed by: rwatson MFC after: 6 weeks
# 6aee2fc5	26-Nov-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Merge in6_pcbfree() into in_pcbfree() which after the previous IPsec change in r185366 only differed in two additonal IPv6 lines. Rather than splattering conditional code everywhere add the v6 check centrally at this single place. Reviewed by: rwatson (as part of a larger changset) MFC after: 6 weeks () () possibly need to leave a stub wrapper in 7 to keep the symbol.
# 0206cdb8	26-Nov-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Remove in6_pcbdetach() as it is exactly the same function as in_pcbdetach() and we don't need the code twice. Reviewed by: rwatson MFC after: 6 weeks () () possibly need to leave a stub wrapper in 7 to keep the symbol.
# d7f03759	19-Oct-2008	Ulf Lilleengen <lulf@FreeBSD.org>	- Import the HEAD csup code which is the basis for the cvsmode work.
# 8b615593	02-Oct-2008	Marko Zec <zec@FreeBSD.org>	Step 1.5 of importing the network stack virtualization infrastructure from the vimage project, as per plan established at devsummit 08/08: http://wiki.freebsd.org/Image/Notes200808DevSummit Introduce INIT_VNET_() initializer macros, VNET_FOREACH() iterator macros, and CURVNET_SET() context setting macros, all currently resolving to NOPs. Prepare for virtualization of selected SYSCTL objects by introducing a family of SYSCTL_V_() macros, currently resolving to their global counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT(). Move selected #defines from sys/sys/vimage.h to newly introduced header files specific to virtualized subsystems (sys/net/vnet.h, sys/netinet/vinet.h etc.). All the changes are verified to have zero functional impact at this point in time by doing MD5 comparision between pre- and post-change object files(). () netipsec/keysock.c did not validate depending on compile time options. Implemented by: julian, bz, brooks, zec Reviewed by: julian, bz, brooks, kris, rwatson, ... Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
# 603724d3	17-Aug-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Commit step 1 of the vimage project, (network stack) virtualization work done by Marko Zec (zec@). This is the first in a series of commits over the course of the next few weeks. Mark all uses of global variables to be virtualized with a V_ prefix. Use macros to map them back to their global names for now, so this is a NOP change only. We hope to have caught at least 85-90% of what is needed so we do not invalidate a lot of outstanding patches again. Obtained from: //depot/projects/vimage-commit2/... Reviewed by: brooks, des, ed, mav, julian, jamie, kris, rwatson, zec, ... (various people I forgot, different versions) md5 (with a bit of help) Sponsored by: NLnet Foundation, The FreeBSD Foundation X-MFC after: never V_Commit_Message_Reviewed_By: more people than the patch
# f2512ba1	31-Jul-2008	Rui Paulo <rpaulo@FreeBSD.org>	MFp4 (//depot/projects/tcpecn/): TCP ECN support. Merge of my GSoC 2006 work for NetBSD. TCP ECN is defined in RFC 3168. Partly reviewed by: dwmalone, silby Obtained from: NetBSD
# 8ab7ce7c	05-May-2008	Kip Macy <kmacy@FreeBSD.org>	replace spaces added in last change with tabs
# 535fbad6	05-May-2008	Kip Macy <kmacy@FreeBSD.org>	add rcv_nxt, snd_nxt, and toe offload id to FreeBSD-specific extension fields for tcp_info
# 8501a69c	17-Apr-2008	Robert Watson <rwatson@FreeBSD.org>	Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to explicitly select write locking for all use of the inpcb mutex. Update some pcbinfo lock assertions to assert locked rather than write-locked, although in practice almost all uses of the pcbinfo rwlock main exclusive, and all instances of inpcb lock acquisition are exclusive. This change should introduce (ideally) little functional change. However, it lays the groundwork for significantly increased parallelism in the TCP/IP code. MFC after: 3 months Tested by: kris (superset of committered patch)
# 109058b0	23-Jan-2008	Robert Watson <rwatson@FreeBSD.org>	tcp_usrreq.c:1.313 removed tcbinfo locking from tcp_usr_accept(), which while in principle a good idea, opened us up to a race inherrent to the syncache's direct insertion of incoming TCP connections into the "completed connection" listen queue, as it transpires that the socket is inserted before the inpcb is fully filled in by syncache_expand(). The bug manifested with the occasional returning of 0.0.0.0:0 in the address returned by the accept() system call, which occurred if accept managed to execute tcp_usr_accept() before syncache_expand() had copied the endpoint addresses into inpcb connection state. Re-add tcbinfo locking around the address copyout, which has the effect of delaying the copy until syncache_expand() has finished running, as it is run while the tcbinfo lock is held. This is undesirable in that it increases contention on tcbinfo further, but a more significant change will be required to how the syncache inserts new sockets in order to fix this and keep more granular locking here. In particular, either more state needs to be passed into sonewconn() so that pru_attach() can fill in the fields before the socket is inserted, or the socket needs to be inserted in the incomplete connection queue until it is actually ready to be used. Reported by: glebius (and kris) Tested by: glebius
# 1e8f5ffa	17-Jan-2008	Robert Watson <rwatson@FreeBSD.org>	In tcp_ctloutput(), don't hold the inpcb lock over sooptcopyin(), rather, drop the lock and then re-acquire it, revalidating TCP connection state assumptions when we do so. This avoids a potential lock order reversal (and potential deadlock, although none have been reported) due to the inpcb lock being held over a page fault. MFC after: 1 week PR: 102752 Reviewed by: bz Reported by: VÃ¡clav Haisman <v dot haisman at sh dot cvut dot cz>
# bc65987a	18-Dec-2007	Kip Macy <kmacy@FreeBSD.org>	Incorporate TCP offload hooks in to core TCP code. - Rename output routines tcp_gen_* -> tcp_output_. - Rename notification routines that turn in to no-ops in the absence of TOE from tcp_gen_ -> tcp_offload_. - Fix some minor comment nits. - Add a / FALLTHROUGH */ Reviewed by: Sam Leffler, Robert Watson, and Mike Silbersack
# 9b3bc6bf	19-Oct-2007	Mike Silbersack <silby@FreeBSD.org>	Pick the smallest possible TCP window scaling factor that will still allow us to scale up to sb_max, aka kern.ipc.maxsockbuf. We do this because there are broken firewalls that will corrupt the window scale option, leading to the other endpoint believing that our advertised window is unscaled. At scale factors larger than 5 the unscaled window will drop below 1500 bytes, leading to serious problems when traversing these broken firewalls. With the default maxsockbuf of 256K, a scale factor of 3 will be chosen by this algorithm. Those who choose a larger maxsockbuf should watch out for the compatiblity problems mentioned above. Reviewed by: andre
# 4b421e2d	07-Oct-2007	Mike Silbersack <silby@FreeBSD.org>	Add FBSDID to all files in netinet so that people can more easily include file version information in bug reports. Approved by: re (kensmith)
# e2f2059f	23-Sep-2007	Mike Silbersack <silby@FreeBSD.org>	Two changes: - Reintegrate the ANSI C function declaration change from tcp_timer.c rev 1.92 - Reorganize the tcpcb structure so that it has a single pointer to the "tcp_timer" structure which contains all of the tcp timer callouts. This change means that when the single tcp timer change is reintegrated, tcpcb will not change in size, and therefore the ABI between netstat and the kernel will not change. Neither of these changes should have any functional impact. Reviewed by: bmah, rrs Approved by: re (bmah)
# 85d94372	07-Sep-2007	Robert Watson <rwatson@FreeBSD.org>	Back out tcp_timer.c:1.93 and associated changes that reimplemented the many TCP timers as a single timer, but retain the API changes necessary to reintroduce this change. This will back out the source of at least two reported problems: lock leaks in certain timer edge cases, and TCP timers continuing to fire after a connection has closed (a bug previously fixed and then reintroduced with the timer rewrite). In a follow-up commit, some minor restylings and comment changes performed after the TCP timer rewrite will be reapplied, and a further change to allow the TCP timer rewrite to be added back without disturbing the ABI. The new design is believed to be a good thing, but the outstanding issues are leading to significant stability/correctness problems that are holding up 7.0. This patch was generated by silby, but is being committed by proxy due to poor network connectivity for silby this week. Approved by: re (kensmith) Submitted by: silby Tested by: rwatson, kris Problems reported by: peter, kris, others
# 218cbbea	30-Jul-2007	Dag-Erling Smørgrav <des@FreeBSD.org>	Make tcpstates[] static, and make sure TCPSTATES is defined before <netinet/tcp_fsm.h> is included into any compilation unit that needs tcpstates[]. Also remove incorrect extern declarations and TCPDEBUG conditionals. This allows kernels both with and without TCPDEBUG to build, and unbreaks the tinderbox. Approved by: re (rwatson)
# 24face54	28-Jul-2007	Matt Jacob <mjacob@FreeBSD.org>	Fix compilation problems- tcpstates is only available if TCPDEBUG is set. Approved by: re (in spirit)
# 3c010a41	15-Jun-2007	Matt Jacob <mjacob@FreeBSD.org>	Garbage collect some debug code that not only no longer could work but in fact probably causes a random pointer dereferences. Garbage collect the tp variable too.
# abc7d910	30-May-2007	Robert Watson <rwatson@FreeBSD.org>	(1) In tcp_usrclosed(), tp can never become NULL, so don't test for NULL before handling the socket disconnection case. (2) Clean up surrounding comments and formatting. Found with: Coverity Prevent(tm) (1) CID: 2203
# 54d642bb	11-May-2007	Robert Watson <rwatson@FreeBSD.org>	Reduce network stack oddness: implement .pru_sockaddr and .pru_peeraddr protocol entry points using functions named proto_getsockaddr and proto_getpeeraddr rather than proto_setsockaddr and proto_setpeeraddr. While it's true that sockaddrs are allocated and set, the net effect is to retrieve (get) the socket address or peer address from a socket, not set it, so align names to that intent.
# 169db7b2	11-May-2007	Robert Watson <rwatson@FreeBSD.org>	Remove unneeded wrappers for in_setsockaddr() and in_setpeeraddr(), which used to exist so pcbinfo locks could be acquired, but are no longer required as a result of socket/pcb reference model refinements.
# f2565d68	10-May-2007	Robert Watson <rwatson@FreeBSD.org>	Move universally to ANSI C function declarations, with relatively consistent style(9)-ish layout.
# 1a553740	06-May-2007	Andre Oppermann <andre@FreeBSD.org>	Remove unused requested_s_scale from struct tcpcb.
# 3529149e	06-May-2007	Andre Oppermann <andre@FreeBSD.org>	Use existing TF_SACK_PERMIT flag in struct tcpcb t_flags field instead of a decdicated sack_enable int for this bool. Change all users accordingly.
# 84ca8aa6	01-May-2007	Robert Watson <rwatson@FreeBSD.org>	Remove unused pcbinfo arguments to in_setsockaddr() and in_setpeeraddr().
# b8152ba7	11-Apr-2007	Andre Oppermann <andre@FreeBSD.org>	Change the TCP timer system from using the callout system five times directly to a merged model where only one callout, the next to fire, is registered. Instead of callout_reset(9) and callout_stop(9) the new function tcp_timer_activate() is used which then internally manages the callout. The single new callout is a mutex callout on inpcb simplifying the locking a bit. tcp_timer() is the called function which handles all race conditions in one place and then dispatches the individual timer functions. Reviewed by: rwatson (earlier version)
# ad3f9ab3	21-Mar-2007	Andre Oppermann <andre@FreeBSD.org>	ANSIfy function declarations and remove register keywords for variables. Consistently apply style to all function declarations.
# e406f5a1	21-Mar-2007	Andre Oppermann <andre@FreeBSD.org>	Remove tcp_minmssoverload DoS detection logic. The problem it tried to protect us from wasn't really there and it only bloats the code. Should the problem surface in the future we can simply resurrect it from cvs history.
# 7c72af87	26-Feb-2007	Mohan Srinivasan <mohans@FreeBSD.org>	Reap FIN_WAIT_2 connections marked SOCANTRCVMORE faster. This mitigate potential issues where the peer does not close, potentially leaving thousands of connections in FIN_WAIT_2. This is controlled by a new sysctl fast_finwait2_recycle, which is disabled by default. Reviewed by: gnn, silby.
# 497057ee	17-Feb-2007	Robert Watson <rwatson@FreeBSD.org>	Add "show inpcb", "show tcpcb" DDB commands, which should come in handy for debugging sblock and other network panics.
# 1baaf834	02-Feb-2007	Bruce M Simpson <bms@FreeBSD.org>	Expose smoothed RTT and RTT variance measurements to userland via socket option TCP_INFO. Note that the units used in the original Linux API are in microseconds, so use a 64-bit mantissa to convert FreeBSD's internal measurements from struct tcpcb from ticks.
# 6741ecf5	01-Feb-2007	Andre Oppermann <andre@FreeBSD.org>	Auto sizing TCP socket buffers. Normally the socket buffers are static (either derived from global defaults or set with setsockopt) and do not adapt to real network conditions. Two things happen: a) your socket buffers are too small and you can't reach the full potential of the network between both hosts; b) your socket buffers are too big and you waste a lot of kernel memory for data just sitting around. With automatic TCP send and receive socket buffers we can start with a small buffer and quickly grow it in parallel with the TCP congestion window to match real network conditions. FreeBSD has a default 32K send socket buffer. This supports a maximal transfer rate of only slightly more than 2Mbit/s on a 100ms RTT trans-continental link. Or at 200ms just above 1Mbit/s. With TCP send buffer auto scaling and the default values below it supports 20Mbit/s at 100ms and 10Mbit/s at 200ms. That's an improvement of factor 10, or 1000%. For the receive side it looks slightly better with a default of 64K buffer size. New sysctls are: net.inet.tcp.sendbuf_auto=1 (enabled) net.inet.tcp.sendbuf_inc=8192 (8K, step size) net.inet.tcp.sendbuf_max=262144 (256K, growth limit) net.inet.tcp.recvbuf_auto=1 (enabled) net.inet.tcp.recvbuf_inc=16384 (16K, step size) net.inet.tcp.recvbuf_max=262144 (256K, growth limit) Tested by: many (on HEAD and RELENG_6) Approved by: re MFC after: 1 month
# 087b55ea	01-Feb-2007	Andre Oppermann <andre@FreeBSD.org>	Change the way the advertized TCP window scaling is computed. Instead of upper-bounding it to the size of the initial socket buffer lower-bound it to the smallest MSS we accept. Ideally we'd use the actual MSS information here but it is not available yet. For socket buffer auto sizing to be effective we need room to grow the receive window. The window scale shift is determined at connection setup and can't be changed afterwards. The previous, original, method effectively just did a power of two roundup of the socket buffer size at connection setup severely limiting the headroom for larger socket buffers. Tested by: many (as part of the socket buffer auto sizing patch) MFC after: 1 month
# 21367f63	22-Nov-2006	Sam Leffler <sam@FreeBSD.org>	Change error codes returned by protocol operations when an inpcb is marked INP_DROPPED or INP_TIMEWAIT: o return ECONNRESET instead of EINVAL for close, disconnect, shutdown, rcvd, rcvoob, and send operations o return ECONNABORTED instead of EINVAL for accept These changes should reduce confusion in applications since EINVAL is normally interpreted to mean an invalid file descriptor. This change does not conflict with POSIX or other standards I checked. The return of EINVAL has always been possible but rare; it's become more common with recent changes to the socket/inpcb handling and with finer-grained locking and preemption. Note: there are other instances of EINVAL for this state that were left unchanged; they should be reviewed. Reviewed by: rwatson, andre, ru MFC after: 1 month
# 7ff0b850	17-Sep-2006	Andre Oppermann <andre@FreeBSD.org>	Make tcp_usr_send() free the passed mbufs on error in all cases as the comment to it claims. Sponsored by: TCP/IP Optimization Fundraise 2005
# a152f8a3	21-Jul-2006	Robert Watson <rwatson@FreeBSD.org>	Change semantics of socket close and detach. Add a new protocol switch function, pru_close, to notify protocols that the file descriptor or other consumer of a socket is closing the socket. pru_abort is now a notification of close also, and no longer detaches. pru_detach is no longer used to notify of close, and will be called during socket tear-down by sofree() when all references to a socket evaporate after an earlier call to abort or close the socket. This means detach is now an unconditional teardown of a socket, whereas previously sockets could persist after detach of the protocol retained a reference. This faciliates sharing mutexes between layers of the network stack as the mutex is required during the checking and removal of references at the head of sofree(). With this change, pru_detach can now assume that the mutex will no longer be required by the socket layer after completion, whereas before this was not necessarily true. Reviewed by: gnn
# d915b280	18-Jul-2006	Stephan Uphoff <ups@FreeBSD.org>	Fix race conditions on enumerating pcb lists by moving the initialization ( and where appropriate the destruction) of the pcb mutex to the init/finit functions of the pcb zones. This allows locking of the pcb entries and race condition free comparison of the generation count. Rearrange locking a bit to avoid extra locking operation to update the generation count in in_pcballoc(). (in_pcballoc now returns the pcb locked) I am planning to convert pcb list handling from a type safe to a reference count model soon. ( As this allows really freeing the PCBs) Reviewed by: rwatson@, mohans@ MFC after: 1 week
# b4470c16	26-Jun-2006	Robert Watson <rwatson@FreeBSD.org>	In tcp6_usr_attach(), return immediately if SS_ISDISCONNECTED, to avoid dereferencing an uninitialized inp variable. Submitted by: Michiel Boland <michiel at boland dot org> MFC after: 1 month
# f2de87fe	04-Jun-2006	Robert Watson <rwatson@FreeBSD.org>	Push acquisition of pcbinfo lock out of tcp_usr_attach() into tcp_attach() after the call to soreserve(), as it doesn't require the global lock. Rearrange inpcb locking here also. MFC after: 1 month
# c78cbc7b	24-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Instead of calling tcp_usr_detach() from tcp_usr_abort(), break out common pcb tear-down logic into tcp_detach(), which is called from either. Invoke tcp_drop() from the tcp_usr_abort() path rather than tcp_disconnect(), as we want to drop it immediately not perform a FIN sequence. This is one reason why some people were experiencing panics in sodealloc(), as the netisr and aborting thread were simultaneously trying to tear down the socket. This bug could often be reproduced using repeated runs of the listenclose regression test. MFC after: 3 months PR: 96090 Reported by: Peter Kostouros <kpeter at melbpc dot org dot au>, kris Tested by: Peter Kostouros <kpeter at melbpc dot org dot au>, kris
# e6e65783	02-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Clarify comment on handling of non-timewait TCP states in tcp_usr_detach(). MFC after: 3 months
# 3d2d3ef4	03-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	After checking for SO_ISDISCONNECTED in tcp_usr_accept(), return immediately rather than jumping to the normal output handling, which assumes we've pulled out the inpcb, which hasn't happened at this point (and isn't necessary). Return ECONNABORTED instead of EINVAL when the inpcb has entered INP_TIMEWAIT or INP_DROPPED, as this is the documented error value. This may correct the panic seen by Ganbold. MFC after: 1 month Reported by: Ganbold <ganbold at micom dot mng dot net>
# 953b5606	02-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	During reformulation of tcp_usr_detach(), the call to initiate TCP disconnect for fully connected sockets was dropped, meaning that if the socket was closed while the connection was alive, it would be leaked. Structure tcp_usr_detach() so that there are two clear parts: initiating disconnect, and reclaiming state, and reintroduce the tcp_disconnect() call in the first part. MFC after: 3 months
# 34af7bae	01-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Properly handle an edge case previously not handled correctly: a socket can have a tcp connection that has entered time wait attached to it, in the event that shutdown() is called on the socket and the FINs properly exchange before close(). In this case we don't detach or free the inpcb, just leave the tcptw detached and freed, but we must release the inpcb lock (which we didn't previously). MFC after: 3 months
# 623dce13	01-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Update TCP for infrastructural changes to the socket/pcb refcount model, pru_abort(), pru_detach(), and in_pcbdetach(): - Universally support and enforce the invariant that so_pcb is never NULL, converting dozens of unnecessary NULL checks into assertions, and eliminating dozens of unnecessary error handling cases in protocol code. - In some cases, eliminate unnecessary pcbinfo locking, as it is no longer required to ensure so_pcb != NULL. For example, the receive code no longer requires the pcbinfo lock, and the send code only requires it if building a new connection on an otherwise unconnected socket triggered via sendto() with an address. This should significnatly reduce tcbinfo lock contention in the receive and send cases. - In order to support the invariant that so_pcb != NULL, it is now necessary for the TCP code to not discard the tcpcb any time a connection is dropped, but instead leave the tcpcb until the socket is shutdown. This case is handled by setting INP_DROPPED, to substitute for using a NULL so_pcb to indicate that the connection has been dropped. This requires the inpcb lock, but not the pcbinfo lock. - Unlike all other protocols in the tree, TCP may need to retain access to the socket after the file descriptor has been closed. Set SS_PROTOREF in tcp_detach() in order to prevent the socket from being freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether or not it needs to free the socket when the connection finally does close. The typical case where this occurs is if close() is called on a TCP socket before all sent data in the send socket buffer has been transmitted or acknowledged. If INP_SOCKREF is found when the connection is dropped, we release the inpcb, tcpcb, and socket instead of flagging INP_DROPPED. - Abort and detach protocol switch methods no longer return failures, nor attempt to free sockets, as the socket layer does this. - Annotate the existence of a long-standing race in the TCP timer code, in which timers are stopped but not drained when the socket is freed, as waiting for drain may lead to deadlocks, or have to occur in a context where waiting is not permitted. This race has been handled by testing to see if the tcpcb pointer in the inpcb is NULL (and vice versa), which is not normally permitted, but may be true of a inpcb and tcpcb have been freed. Add a counter to test how often this race has actually occurred, and a large comment for each instance where we compare potentially freed memory with NULL. This will have to be fixed in the near future, but requires is to further address how to handle the timer shutdown shutdown issue. - Several TCP calls no longer potentially free the passed inpcb/tcpcb, so no longer need to return a pointer to indicate whether the argument passed in is still valid. - Un-macroize debugging and locking setup for various protocol switch methods for TCP, as it lead to more obscurity, and as locking becomes more customized to the methods, offers less benefit. - Assert copyright on tcp_usrreq.c due to significant modifications that have been made as part of this work. These changes significantly modify the memory management and connection logic of our TCP implementation, and are (as such) High Risk Changes, and likely to contain serious bugs. Please report problems to the current@ mailing list ASAP, ideally with simple test cases, and optionally, packet traces. MFC after: 3 months
# bc725eaf	01-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Chance protocol switch method pru_detach() so that it returns void rather than an error. Detaches do not "fail", they other occur or the protocol flags SS_PROTOREF to take ownership of the socket. soclose() no longer looks at so_pcb to see if it's NULL, relying entirely on the protocol to decide whether it's time to free the socket or not using SS_PROTOREF. so_pcb is now entirely owned and managed by the protocol code. Likewise, no longer test so_pcb in other socket functions, such as soreceive(), which have no business digging into protocol internals. Protocol detach routines no longer try to free the socket on detach, this is performed in the socket code if the protocol permits it. In rts_detach(), no longer test for rp != NULL in detach, and likewise in other protocols that don't permit a NULL so_pcb, reduce the incidence of testing for it during detach. netinet and netinet6 are not fully updated to this change, which will be in an upcoming commit. In their current state they may leak memory or panic. MFC after: 3 months
# ac45e92f	01-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Change protocol switch pru_abort() API so that it returns void rather than an int, as an error here is not meaningful. Modify soabort() to unconditionally free the socket on the return of pru_abort(), and modify most protocols to no longer conditionally free the socket, since the caller will do this. This commit likely leaves parts of netinet and netinet6 in a situation where they may panic or leak memory, as they have not are not fully updated by this commit. This will be corrected shortly in followup commits to these components. MFC after: 3 months
# e59898ff	14-Dec-2005	Maxime Henrion <mux@FreeBSD.org>	Fix a bunch of SYSCTL_INT() that should have been SYSCTL_ULONG() to match the type of the variable they are exporting. Spotted by: Thomas Hurst <tom@hur.st> MFC after: 3 days
# d374e81e	30-Oct-2005	Robert Watson <rwatson@FreeBSD.org>	Push the assignment of a new or updated so_qlimit from solisten() following the protocol pru_listen() call to solisten_proto(), so that it occurs under the socket lock acquisition that also sets SO_ACCEPTCONN. This requires passing the new backlog parameter to the protocol, which also allows the protocol to be aware of changes in queue limit should it wish to do something about the new queue limit. This continues a move towards the socket layer acting as a library for the protocol. Bump __FreeBSD_version due to a change in the in-kernel protocol interface. This change has been tested with IPv4 and UNIX domain sockets, but not other protocols.
# ef8fd904	23-Aug-2005	Andre Oppermann <andre@FreeBSD.org>	Remove unnecessary IPSEC includes. MFC after: 2 weeks Sponsored by: TCP/IP Optimization Fundraise 2005
# a1f7e5f8	24-Jul-2005	Hajimu UMEMOTO <ume@FreeBSD.org>	scope cleanup. with this change - most of the kernel code will not care about the actual encoding of scope zone IDs and won't touch "s6_addr16[1]" directly. - similarly, most of the kernel code will not care about link-local scoped addresses as a special case. - scope boundary check will be stricter. For example, the current BSD code allows a packet with src=::1 and dst=(some global IPv6 address) to be sent outside of the node, if the application do: s = socket(AF_INET6); bind(s, "::1"); sendto(s, some_global_IPv6_addr); This is clearly wrong, since ::1 is only meaningful within a single node, but the current implementation of the BSD kernel cannot reject this attempt. Submitted by: JINMEI Tatuya <jinmei__at__isl.rdc.toshiba.co.jp> Obtained from: KAME
# 30393994	31-May-2005	Robert Watson <rwatson@FreeBSD.org>	When aborting tcp_attach() due to a problem allocating or attaching the tcpcb, lock the inpcb before calling in_pcbdetach() or in6_pcbdetach(), as they expect the inpcb to be passed locked. MFC after: 7 days
# e6e0b5ff	31-May-2005	Robert Watson <rwatson@FreeBSD.org>	Assert tcbinfo lock, inpcb lock in tcp_disconnect(). Assert tcbinfo lock, inpcb lock in in tcp_usrclosed(). MFC after: 7 days
# 7609aad7	01-Jun-2005	Robert Watson <rwatson@FreeBSD.org>	Assert tcbinfo lock in tcp_attach(), as it is required; the caller (tcp_usr_attach()) currently grabs it. MFC after: 7 days
# 2cdbfa66	20-May-2005	Paul Saab <ps@FreeBSD.org>	Replace t_force with a t_flag (TF_FORCEDATA). Submitted by: Raja Mukerji. Reviewed by: Mohan, Silby, Andre Opperman.
# b60d26c9	01-May-2005	Robert Watson <rwatson@FreeBSD.org>	Remove now unused inirw variable from previous use of COMMON_END(). Reported by: csjp
# 73fddeda	01-May-2005	Peter Grehan <grehan@FreeBSD.org>	Fix typo in last commit. Approved by: rwatson
# d1401c90	01-May-2005	Robert Watson <rwatson@FreeBSD.org>	Slide unlocking of the tcbinfo lock earlier in tcp_usr_send(), as it's needed only for implicit connect cases. Under load, especially on SMP, this can greatly reduce contention on the tcbinfo lock. NB: Ambiguities about the state of so_pcb need to be resolved so that all use of the tcbinfo lock in non-implicit connection cases can be eliminated. Submited by: Kazuaki Oda <kaakun at highway dot ne dot jp>
# 812d8653	28-Mar-2005	Sam Leffler <sam@FreeBSD.org>	eliminate extraneous null ptr checks Noticed by: Coverity Prevent analysis tool
# d2bc35ab	14-Mar-2005	Robert Watson <rwatson@FreeBSD.org>	In tcp_usr_send(), broaden coverage of the socket buffer lock in the non-OOB case so that the sbspace() check is performed under the same lock instance as the append to the send socket buffer. MFC after: 1 week
# 0daccb9c	21-Feb-2005	Robert Watson <rwatson@FreeBSD.org>	In the current world order, solisten() implements the state transition of a socket from a regular socket to a listening socket able to accept new connections. As part of this state transition, solisten() calls into the protocol to update protocol-layer state. There were several bugs in this implementation that could result in a race wherein a TCP SYN received in the interval between the protocol state transition and the shortly following socket layer transition would result in a panic in the TCP code, as the socket would be in the TCPS_LISTEN state, but the socket would not have the SO_ACCEPTCONN flag set. This change does the following: - Pushes the socket state transition from the socket layer solisten() to to socket "library" routines called from the protocol. This permits the socket routines to be called while holding the protocol mutexes, preventing a race exposing the incomplete socket state transition to TCP after the TCP state transition has completed. The check for a socket layer state transition is performed by solisten_proto_check(), and the actual transition is performed by solisten_proto(). - Holds the socket lock for the duration of the socket state test and set, and over the protocol layer state transition, which is now possible as the socket lock is acquired by the protocol layer, rather than vice versa. This prevents additional state related races in the socket layer. This permits the dual transition of socket layer and protocol layer state to occur while holding locks for both layers, making the two changes atomic with respect to one another. Similar changes are likely require elsewhere in the socket/protocol code. Reported by: Peter Holm <peter@holm.cc> Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net> Philosophical head nod: gnn
# 9945c0e2	14-Feb-2005	Maxim Konovalov <maxim@FreeBSD.org>	o Add handling of an IPv4-mapped IPv6 address. o Use SYSCTL_IN() macro instead of direct call of copyin(9). Submitted by: ume o Move sysctl_drop() implementation to sys/netinet/tcp_subr.c where most of tcp sysctls live. o There are net.inet[6].tcp[6].getcred sysctls already, no needs in a separate struct tcp_ident_mapping. Suggested by: ume
# 212a79b0	06-Feb-2005	Maxim Konovalov <maxim@FreeBSD.org>	o Implement net.inet.tcp.drop sysctl and userland part, tcpdrop(8) utility: The tcpdrop command drops the TCP connection specified by the local address laddr, port lport and the foreign address faddr, port fport. Obtained from: OpenBSD Reviewed by: rwatson (locking), ru (man page), -current MFC after: 1 month
# c398230b	06-Jan-2005	Warner Losh <imp@FreeBSD.org>	/* -> /*- for license, minor formatting changes
# c8443a1d	27-Nov-2004	Robert Watson <rwatson@FreeBSD.org>	Do export the advertised receive window via the tcpi_rcv_space field of struct tcp_info.
# b8af5dfa	26-Nov-2004	Robert Watson <rwatson@FreeBSD.org>	Implement parts of the TCP_INFO socket option as found in Linux 2.6. This socket option allows processes query a TCP socket for some low level transmission details, such as the current send, bandwidth, and congestion windows. Linux provides a 'struct tcpinfo' structure containing various variables, rather than separate socket options; this makes the API somewhat fragile as it makes it dificult to add new entries of interest as requirements and implementation evolve. As such, I've included a large pad at the end of the structure. Right now, relatively few of the Linux API fields are filled in, and some contain no logical equivilent on FreeBSD. I've include __'d entries in the structure to make it easier to figure ou what is and isn't omitted. This API/ABI should be considered unstable for the time being.
# 756d52a1	08-Nov-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Initialize struct pr_userreqs in new/sparse style and fill in common default elements in net_init_domain(). This makes it possible to grep these structures and see any bogosities.
# c94c54e4	02-Nov-2004	Andre Oppermann <andre@FreeBSD.org>	Remove RFC1644 T/TCP support from the TCP side of the network stack. A complete rationale and discussion is given in this message and the resulting discussion: http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706 Note that this commit removes only the functional part of T/TCP from the tcp_* related functions in the kernel. Other features introduced with RFC1644 are left intact (socket layer changes, sendmsg(2) on connection oriented protocols) and are meant to be reused by a simpler and less intrusive reimplemention of the previous T/TCP functionality. Discussed on: -arch
# a4f757cd	16-Aug-2004	Robert Watson <rwatson@FreeBSD.org>	White space cleanup for netinet before branch: - Trailing tab/space cleanup - Remove spurious spaces between or before tabs This change avoids touching files that Andre likely has in his working set for PFIL hooks changes for IPFW/DUMMYNET. Approved by: re (scottl) Submitted by: Xin LI <delphij@frontfree.net>
# 1f44b0a1	14-Aug-2004	David Malone <dwmalone@FreeBSD.org>	Get rid of the RANDOM_IP_ID option and make it a sysctl. NetBSD have already done this, so I have styled the patch on their work: 1) introduce a ip_newid() static inline function that checks the sysctl and then decides if it should return a sequential or random IP ID. 2) named the sysctl net.inet.ip.random_id 3) IPv6 flow IDs and fragment IDs are now always random. Flow IDs and frag IDs are significantly less common in the IPv6 world (ie. rarely generated per-packet), so there should be smaller performance concerns. The sysctl defaults to 0 (sequential IP IDs). Reviewed by: andre, silby, mlaier, ume Based on: NetBSD MFC after: 2 months
# 0aa8ce50	26-Jul-2004	John-Mark Gurney <jmg@FreeBSD.org>	compare pointer against NULL, not 0 when inpcb is NULL, this is no longer invalid since jlemon added the tcp_twstart function... this prevents close "failing" w/ EINVAL when it really was successful... Reviewed by: jeremy (NetBSD)
# 8a59da30	16-Jul-2004	Hajimu UMEMOTO <ume@FreeBSD.org>	when IN6P_AUTOFLOWLABEL is set, the flowlabel is not set on outgoing tcp connections. Reported by: Orla McGann <orly@cnri.dit.ie> Reviewed by: Orla McGann <orly@cnri.dit.ie> Obtained from: KAME
# 3f9d1ef9	26-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Remove spl's from TCP protocol entry points. While not all locking is merged here yet, this will ease the merge process by bringing the locked and unlocked versions into sync.
# 4e397bc5	18-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	In tcp_ctloutput(), don't hold the inpcb lock over a call to ip_ctloutput(), as it may need to perform blocking memory allocations. This also improves consistency with locking relative to other points that call into ip_ctloutput(). Bumped into by: Grover Lines <grover@ceribus.net>
# c0b99ffa	14-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	The socket field so_state is used to hold a variety of socket related flags relating to several aspects of socket functionality. This change breaks out several bits relating to send and receive operation into a new per-socket buffer field, sb_state, in order to facilitate locking. This is required because, in order to provide more granular locking of sockets, different state fields have different locking properties. The following fields are moved to sb_state: SS_CANTRCVMORE (so_state) SS_CANTSENDMORE (so_state) SS_RCVATMARK (so_state) Rename respectively to: SBS_CANTRCVMORE (so_rcv.sb_state) SBS_CANTSENDMORE (so_snd.sb_state) SBS_RCVATMARK (so_rcv.sb_state) This facilitates locking by isolating fields to be located with other identically locked fields, and permits greater granularity in socket locking by avoiding storing fields with different locking semantics in the same short (avoiding locking conflicts). In the future, we may wish to coallesce sb_state and sb_flags; for the time being I leave them separate and there is no additional memory overhead due to the packing/alignment of shorts in the socket buffer structure.
# f36cfd49	07-Apr-2004	Warner Losh <imp@FreeBSD.org>	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999 and email from Peter Wemm, Alan Cox and Robert Watson. Approved by: core, peter, alc, rwatson
# 52710de1	04-Apr-2004	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Fix a panic possibility caused by returning without releasing locks. It was fixed by moving problemetic checks, as well as checks that doesn't need locking before locks are acquired. Submitted by: Ryan Sommers <ryans@gamersimpact.com> In co-operation with: cperciva, maxim, mlaier, sam Tested by: submitter (previous patch), me (current patch) Reviewed by: cperciva, mlaier (previous patch), sam (current patch) Approved by: sam Dedicated to: enough!
# 56dc72c3	28-Mar-2004	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Remove unused argument.
# b0330ed9	27-Mar-2004	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Reduce 'td' argument to 'cred' (struct ucred) argument in those functions: - in_pcbbind(), - in_pcbbind_setup(), - in_pcbconnect(), - in_pcbconnect_setup(), - in6_pcbbind(), - in6_pcbconnect(), - in6_pcbsetport(). "It should simplify/clarify things a great deal." --rwatson Requested by: rwatson Reviewed by: rwatson, ume
# 6823b823	27-Mar-2004	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Remove unused argument. Reviewed by: ume
# 88f6b043	16-Feb-2004	Bruce M Simpson <bms@FreeBSD.org>	Shorten the name of the socket option used to enable TCP-MD5 packet treatment. Submitted by: Vincent Jardin
# 265ed012	13-Feb-2004	Bruce M Simpson <bms@FreeBSD.org>	Brucification. Submitted by: bde
# 1cfd4b53	10-Feb-2004	Bruce M Simpson <bms@FreeBSD.org>	Initial import of RFC 2385 (TCP-MD5) digest support. This is the first of two commits; bringing in the kernel support first. This can be enabled by compiling a kernel with options TCP_SIGNATURE and FAST_IPSEC. For the uninitiated, this is a TCP option which provides for a means of authenticating TCP sessions which came into being before IPSEC. It is still relevant today, however, as it is used by many commercial router vendors, particularly with BGP, and as such has become a requirement for interconnect at many major Internet points of presence. Several parts of the TCP and IP headers, including the segment payload, are digested with MD5, including a shared secret. The PF_KEY interface is used to manage the secrets using security associations in the SADB. There is a limitation here in that as there is no way to map a TCP flow per-port back to an SPI without polluting tcpcb or using the SPD; the code to do the latter is unstable at this time. Therefore this code only supports per-host keying granularity. Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6), TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective users of this feature, this will not pose any problem. This implementation is output-only; that is, the option is honoured when responding to a host initiating a TCP session, but no effort is made [yet] to authenticate inbound traffic. This is, however, sufficient to interwork with Cisco equipment. Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with local patches. Patches for tcpdump to validate TCP-MD5 sessions are also available from me upon request. Sponsored by: sentex.net
# e29ef13f	10-Jan-2004	Don Lewis <truckman@FreeBSD.org>	Check that sa_len is the appropriate value in tcp_usr_bind(), tcp6_usr_bind(), tcp_usr_connect(), and tcp6_usr_connect() before checking to see whether the address is multicast so that the proper errno value will be returned if sa_len is incorrect. The checks are identical to the ones in in_pcbbind_setup(), in6_pcbbind(), and in6_pcbladdr(), which are called after the multicast address check passes. MFC after: 30 days
# 53369ac9	08-Jan-2004	Andre Oppermann <andre@FreeBSD.org>	Limiters and sanity checks for TCP MSS (maximum segement size) resource exhaustion attacks. For network link optimization TCP can adjust its MSS and thus packet size according to the observed path MTU. This is done dynamically based on feedback from the remote host and network components along the packet path. This information can be abused to pretend an extremely low path MTU. The resource exhaustion works in two ways: o during tcp connection setup the advertized local MSS is exchanged between the endpoints. The remote endpoint can set this arbitrarily low (except for a minimum MTU of 64 octets enforced in the BSD code). When the local host is sending data it is forced to send many small IP packets instead of a large one. For example instead of the normal TCP payload size of 1448 it forces TCP payload size of 12 (MTU 64) and thus we have a 120 times increase in workload and packets. On fast links this quickly saturates the local CPU and may also hit pps processing limites of network components along the path. This type of attack is particularly effective for servers where the attacker can download large files (WWW and FTP). We mitigate it by enforcing a minimum MTU settable by sysctl net.inet.tcp.minmss defaulting to 256 octets. o the local host is reveiving data on a TCP connection from the remote host. The local host has no control over the packet size the remote host is sending. The remote host may chose to do what is described in the first attack and send the data in packets with an TCP payload of at least one byte. For each packet the tcp_input() function will be entered, the packet is processed and a sowakeup() is signalled to the connected process. For example an attack with 2 Mbit/s gives 4716 packets per second and the same amount of sowakeup()s to the process (and context switches). This type of attack is particularly effective for servers where the attacker can upload large amounts of data. Normally this is the case with WWW server where large POSTs can be made. We mitigate this by calculating the average MSS payload per second. If it goes below 'net.inet.tcp.minmss' and the pps rate is above 'net.inet.tcp.minmssoverload' defaulting to 1000 this particular TCP connection is resetted and dropped. MITRE CVE: CAN-2004-0002 Reviewed by: sam (mentor) MFC after: 1 day
# 5bd311a5	25-Nov-2003	Sam Leffler <sam@FreeBSD.org>	Split the "inp" mutex class into separate classes for each of divert, raw, tcp, udp, raw6, and udp6 sockets to avoid spurious witness complaints. Reviewed by: rwatson Approved by: re (rwatson)
# 97d8d152	20-Nov-2003	Andre Oppermann <andre@FreeBSD.org>	Introduce tcp_hostcache and remove the tcp specific metrics from the routing table. Move all usage and references in the tcp stack from the routing table metrics to the tcp hostcache. It caches measured parameters of past tcp sessions to provide better initial start values for following connections from or to the same source or destination. Depending on the network parameters to/from the remote host this can lead to significant speedups for new tcp connections after the first one because they inherit and shortcut the learning curve. tcp_hostcache is designed for multiple concurrent access in SMP environments with high contention and is hash indexed by remote ip address. It removes significant locking requirements from the tcp stack with regard to the routing table. Reviewed by: sam (mentor), bms Reviewed by: -net, -current, core@kame.net (IPv6 parts) Approved by: re (scottl)
# a557af22	17-Nov-2003	Robert Watson <rwatson@FreeBSD.org>	Introduce a MAC label reference in 'struct inpcb', which caches the MAC label referenced from 'struct socket' in the IPv4 and IPv6-based protocols. This permits MAC labels to be checked during network delivery operations without dereferencing inp->inp_socket to get to so->so_label, which will eventually avoid our having to grab the socket lock during delivery at the network layer. This change introduces 'struct inpcb' as a labeled object to the MAC Framework, along with the normal circus of entry points: initialization, creation from socket, destruction, as well as a delivery access control check. For most policies, the inpcb label will simply be a cache of the socket label, so a new protocol switch method is introduced, pr_sosetlabel() to notify protocols that the socket layer label has been updated so that the cache can be updated while holding appropriate locks. Most protocols implement this using pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use the the worker function in_pcbsosetlabel(), which calls into the MAC Framework to perform a cache update. Biba, LOMAC, and MLS implement these entry points, as do the stub policy, and test policy. Reviewed by: sam, bms Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 395bb186	27-Oct-2003	Sam Leffler <sam@FreeBSD.org>	speedup stream socket recv handling by tracking the tail of the mbuf chain instead of walking the list for each append Submitted by: ps/jayanth Obtained from: netbsd (jason thorpe)
# a3b6edc3	08-Mar-2003	Jonathan Lemon <jlemon@FreeBSD.org>	Remove check for t_state == TCPS_TIME_WAIT and introduce the tw structure. Sponsored by: DARPA, NAI Labs
# edf02ff1	24-Feb-2003	Jeffrey Hsu <hsu@FreeBSD.org>	Hold the TCP protocol lock while modifying the connection hash table.
# efac726e	23-Oct-2002	Ian Dowse <iedowse@FreeBSD.org>	Unbreak the automatic remapping of an INADDR_ANY destination address to the primary local IP address when doing a TCP connect(). The tcp_connect() code was relying on in_pcbconnect (actually in_pcbladdr) modifying the passed-in sockaddr, and I failed to notice this in the recent change that added in_pcbconnect_setup(). As a result, tcp_connect() was ending up using the unmodified sockaddr address instead of the munged version. There are two cases to handle: if in_pcbconnect_setup() succeeds, then the PCB has already been updated with the correct destination address as we pass it pointers to inp_faddr and inp_fport directly. If in_pcbconnect_setup() fails due to an existing but dead connection, then copy the destination address from the old connection.
# 5200e00e	21-Oct-2002	Ian Dowse <iedowse@FreeBSD.org>	Replace in_pcbladdr() with a more generic inner subroutine for in_pcbconnect() called in_pcbconnect_setup(). This version performs all of the functions of in_pcbconnect() except for the final committing of changes to the PCB. In the case of an EADDRINUSE error it can also provide to the caller the PCB of the duplicate connection, avoiding an extra in_pcblookup_hash() lookup in tcp_connect(). This change will allow the "temporary connect" hack in udp_output() to be removed and is part of the preparation for adding the IP_SENDSRCADDR control message. Discussed on: -net Approved by: re
# 4a6a94d8	22-Aug-2002	Archie Cobbs <archie@FreeBSD.org>	Replace (ab)uses of "NULL" where "0" is really meant.
# 26ef6ac4	21-Aug-2002	Don Lewis <truckman@FreeBSD.org>	Create new functions in_sockaddr(), in6_sockaddr(), and in6_v4mapsin6_sockaddr() which allocate the appropriate sockaddr_in* structure and initialize it with the address and port information passed as arguments. Use calls to these new functions to replace code that is replicated multiple times in in_setsockaddr(), in_setpeeraddr(), in6_setsockaddr(), in6_setpeeraddr(), in6_mapped_sockaddr(), and in6_mapped_peeraddr(). Inline COMMON_END in tcp_usr_accept() so that we can call in_sockaddr() with temporary copies of the address and port after the PCB is unlocked. Fix the lock violation in tcp6_usr_accept() (caused by calling MALLOC() inside in6_mapped_peeraddr() while the PCB is locked) by changing the implementation of tcp6_usr_accept() to match tcp_usr_accept(). Reviewed by: suz
# 1fcc99b5	17-Aug-2002	Matthew Dillon <dillon@FreeBSD.org>	Implement TCP bandwidth delay product window limiting, similar to (but not meant to duplicate) TCP/Vegas. Add four sysctls and default the implementation to 'off'. net.inet.tcp.inflight_enable enable algorithm (defaults to 0=off) net.inet.tcp.inflight_debug debugging (defaults to 1=on) net.inet.tcp.inflight_min minimum window limit net.inet.tcp.inflight_max maximum window limit MFC after: 1 week
# d46a5312	29-Jul-2002	Maxim Konovalov <maxim@FreeBSD.org>	Use a common way to release locks before exit. Reviewed by: hsu
# 66ef17c4	25-Jul-2002	Hajimu UMEMOTO <ume@FreeBSD.org>	make setsockopt(IPV6_V6ONLY, 0) actuall work for tcp6. MFC after: 1 week
# eccb7001	25-Jul-2002	Hajimu UMEMOTO <ume@FreeBSD.org>	cleanup usage of ip6_mapped_addr_on and ip6_v6only. now, ip6_mapped_addr_on is unified into ip6_v6only. MFC after: 1 week
# 9c68f33a	13-Jun-2002	Jeffrey Hsu <hsu@FreeBSD.org>	Because we're holding an exclusive write lock on the head, references to the new inp cannot leak out even though it has been placed on the head list.
# f76fcf6d	10-Jun-2002	Jeffrey Hsu <hsu@FreeBSD.org>	Lock up inpcb. Submitted by: Jennifer Yang <yangjihui@yahoo.com>
# 4cc20ab1	31-May-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Back out my lats commit of locking down a socket, it conflicts with hsu's work. Requested by: hsu
# 243917fe	19-May-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Lock down a socket, milestone 1. o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a socket buffer. The mutex in the receive buffer also protects the data in struct socket. o Determine the lock strategy for each members in struct socket. o Lock down the following members: - so_count - so_options - so_linger - so_state o Remove *_locked() socket APIs. Make the following socket APIs touching the members above now require a locked socket: - sodisconnect() - soisconnected() - soisconnecting() - soisdisconnected() - soisdisconnecting() - sofree() - soref() - sorele() - sorwakeup() - sotryfree() - sowakeup() - sowwakeup() Reviewed by: alfred
# c1cd65ba	24-Mar-2002	Bruce Evans <bde@FreeBSD.org>	Fixed some style bugs in the removal of __P(()). Continuation lines were not outdented to preserve non-KNF lining up of code with parentheses. Switch to KNF formatting.
# 4d77a549	19-Mar-2002	Alfred Perlstein <alfred@FreeBSD.org>	Remove __P.
# b7d6d952	28-Feb-2002	Hajimu UMEMOTO <ume@FreeBSD.org>	- Set inc_isipv6 in tcp6_usr_connect(). - When making a pcb from a sync cache, do not forget to copy inc_isipv6. Obtained from: KAME MFC After: 1 week
# a854ed98	27-Feb-2002	John Baldwin <jhb@FreeBSD.org>	Simple p_ucred -> td_ucred changes to start using the per-thread ucred reference.
# be2ac88c	21-Nov-2001	Jonathan Lemon <jlemon@FreeBSD.org>	Introduce a syncache, which enables FreeBSD to withstand a SYN flood DoS in an improved fashion over the existing code. Reviewed by: silby (in a previous iteration) Sponsored by: DARPA, NAI Labs
# b40ce416	12-Sep-2001	Julian Elischer <julian@FreeBSD.org>	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
# b0e3ad75	21-Aug-2001	Mike Silbersack <silby@FreeBSD.org>	Much delayed but now present: RFC 1948 style sequence numbers In order to ensure security and functionality, RFC 1948 style initial sequence number generation has been implemented. Barring any major crypographic breakthroughs, this algorithm should be unbreakable. In addition, the problems with TIME_WAIT recycling which affect our currently used algorithm are not present. Reviewed by: jesper
# 13cf67f3	26-Jul-2001	Hajimu UMEMOTO <ume@FreeBSD.org>	move ipsec security policy allocation into in_pcballoc, before making pcbs available to the outside world. otherwise, we will see inpcb without ipsec security policy attached (-> panic() in ipsec.c). Obtained from: KAME MFC after: 3 days
# 81e561cd	13-Jul-2001	David E. O'Brien <obrien@FreeBSD.org>	Bump net.inet.tcp.sendspace to 32k and net.inet.tcp.recvspace to 65k. This should help us in nieve benchmark "tests". It seems a wide number of people think 32k buffers would not cause major issues, and is in fact in use by many other OS's at this time. The receive buffers can be bumped higher as buffers are hardly used and several research papers indicate that receive buffers rarely use much space at all. Submitted by: Leo Bicknell <bicknell@ufp.org> <20010713101107.B9559@ussenterprise.ufp.org> Agreed to in principle by: dillon (at the 32k level)
# 2d610a50	07-Jul-2001	Mike Silbersack <silby@FreeBSD.org>	Temporary feature: Runtime tuneable tcp initial sequence number generation scheme. Users may now select between the currently used OpenBSD algorithm and the older random positive increment method. While the OpenBSD algorithm is more secure, it also breaks TIME_WAIT handling; this is causing trouble for an increasing number of folks. To switch between generation schemes, one sets the sysctl net.inet.tcp.tcp_seq_genscheme. 0 = random positive increments, 1 = the OpenBSD algorithm. 1 is still the default. Once a secure _and_ compatible algorithm is implemented, this sysctl will be removed. Reviewed by: jlemon Tested by: numerous subscribers of -net
# 08517d53	22-Jun-2001	Mike Silbersack <silby@FreeBSD.org>	Eliminate the allocation of a tcp template structure for each connection. The information contained in a tcptemp can be reconstructed from a tcpcb when needed. Previously, tcp templates required the allocation of one mbuf per connection. On large systems, this change should free up a large number of mbufs. Reviewed by: bmilekic, jlemon, ru MFC after: 2 weeks
# 33841545	10-Jun-2001	Hajimu UMEMOTO <ume@FreeBSD.org>	Sync with recent KAME. This work was based on kame-20010528-freebsd43-snap.tgz and some critical problem after the snap was out were fixed. There are many many changes since last KAME merge. TODO: - The definitions of SADB_* in sys/net/pfkeyv2.h are still different from RFC2407/IANA assignment because of binary compatibility issue. It should be fixed under 5-CURRENT. - ip6po_m member of struct ip6_pktopts is no longer used. But, it is still there because of binary compatibility issue. It should be removed under 5-CURRENT. Reviewed by: itojun Obtained from: KAME MFC after: 3 weeks
# d1745f45	20-Apr-2001	Jesper Skriver <jesper@FreeBSD.org>	Say goodbye to TCP_COMPAT_42 Reviewed by: wollman Requested by: wollman
# f0a04f3f	17-Apr-2001	Kris Kennaway <kris@FreeBSD.org>	Randomize the TCP initial sequence numbers more thoroughly. Obtained from: OpenBSD Reviewed by: jesper, peter, -developers
# 1db24ffb	11-Mar-2001	Jonathan Lemon <jlemon@FreeBSD.org>	Unbreak LINT. Pointed out by: phk
# c0647e0d	09-Mar-2001	Jonathan Lemon <jlemon@FreeBSD.org>	Push the test for a disconnected socket when accept()ing down to the protocol layer. Not all protocols behave identically. This fixes the brokenness observed with unix-domain sockets (and postfix)
# 91421ba2	20-Feb-2001	Robert Watson <rwatson@FreeBSD.org>	o Move per-process jail pointer (p->pr_prison) to inside of the subject credential structure, ucred (cr->cr_prison). o Allow jail inheritence to be a function of credential inheritence. o Abstract prison structure reference counting behind pr_hold() and pr_free(), invoked by the similarly named credential reference management functions, removing this code from per-ABI fork/exit code. o Modify various jail() functions to use struct ucred arguments instead of struct proc arguments. o Introduce jailed() function to determine if a credential is jailed, rather than directly checking pointers all over the place. o Convert PRISON_CHECK() macro to prison_check() function. o Move jail() function prototypes to jail.h. o Emulate the P_JAILED flag in fill_kinfo_proc() and no longer set the flag in the process flags field itself. o Eliminate that "const" qualifier from suser/p_can/etc to reflect mutex use. Notes: o Some further cleanup of the linux/jail code is still required. o It's now possible to consider resolving some of the process vs credential based permission checking confusion in the socket code. o Mutex protection of struct prison is still not present, and is required to protect the reference count plus some fields in the structure. Reviewed by: freebsd-arch Obtained from: TrustedBSD Project
# 007581c0	02-Feb-2001	Jonathan Lemon <jlemon@FreeBSD.org>	When turning off TCP_NOPUSH, call tcp_output to immediately flush out any data pending in the buffer. Submitted by: Tony Finch <dot@dotat.at>
# fdaf052e	01-Apr-2000	Yoshinobu Inoue <shin@FreeBSD.org>	Support per socket based IPv4 mapped IPv6 addr enable/disable control. Submitted by: ume
# fb59c426	09-Jan-2000	Yoshinobu Inoue <shin@FreeBSD.org>	tcp updates to support IPv6. also a small patch to sys/nfs/nfs_socket.c, as max_hdr size change. Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project
# 6a800098	22-Dec-1999	Yoshinobu Inoue <shin@FreeBSD.org>	IPSEC support in the kernel. pr_input() routines prototype is also changed to support IPSEC and IPV6 chained protocol headers. Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project
# 79ea3cf1	12-Dec-1999	Yoshinobu Inoue <shin@FreeBSD.org>	Always set INP_IPV4 flag for IPv4 pcb entries, because netstat needs it to print out protocol specific pcb info. A patch submitted by guido@gvr.org, and asmodai@wxs.nl also reported the problem. Thanks and sorry for your troubles. Submitted by: guido@gvr.org Reviewed by: shin
# cfa1ca9d	07-Dec-1999	Yoshinobu Inoue <shin@FreeBSD.org>	udp IPv6 support, IPv6/IPv4 tunneling support in kernel, packet divert at kernel for IPv6/IPv4 translater daemon This includes queue related patch submitted by jburkhol@home.com. Submitted by: queue related patch from jburkhol@home.com Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project
# 45d3a132	18-Nov-1999	Peter Wemm <peter@FreeBSD.org>	Fix a warning and a potential panic if TCPDEBUG is active. (tp is a wild pointer and used by TCPDEBUG2())
# 9b8b58e0	30-Aug-1999	Jonathan Lemon <jlemon@FreeBSD.org>	Restructure TCP timeout handling: - eliminate the fast/slow timeout lists for TCP and instead use a callout entry for each timer. - increase the TCP timer granularity to HZ - implement "bad retransmit" recovery, as presented in "On Estimating End-to-End Network Path Properties", by Allman and Paxson. Submitted by: jlemon, wollmann
# c3aac50f	27-Aug-1999	Peter Wemm <peter@FreeBSD.org>	$Id$ -> $FreeBSD$
# 9c9906e9	03-Jun-1999	Peter Wemm <peter@FreeBSD.org>	Plug a mbuf leak in tcp_usr_send(). pru_send() routines are expected to either enqueue or free their mbuf chains, but tcp_usr_send() was dropping them on the floor if the tcpcb/inpcb has been torn down in the middle of a send/write attempt. This has been responsible for a wide variety of mbuf leak patterns, ranging from slow gradual leakage to rather rapid exhaustion. This has been a problem since before 2.2 was branched and appears to have been fixed in rev 1.16 and lost in 1.23/1.28. Thanks to Jayanth Vijayaraghavan <jayanth@yahoo-inc.com> for checking (extensively) into this on a live production 2.2.x system and that it was the actual cause of the leak and looks like it fixes it. The machine in question was loosing (from memory) about 150 mbufs per hour under load and a change similar to this stopped it. (Don't blame Jayanth for this patch though) An alternative approach to this would be to recheck SS_CANTSENDMORE etc inside the splnet() right before calling pru_send() after all the potential sleeps, interrupts and delays have happened. However, this would mean exposing knowledge of the tcp stack's reset handling and removal of the pcb to the generic code. There are other things that call pru_send() directly though. Problem originally noted by: John Plevyak <jplevyak@inktomi.com>
# 3d177f46	03-May-1999	Bill Fumerola <billf@FreeBSD.org>	Add sysctl descriptions to many SYSCTL_XXXs PR: kern/11197 Submitted by: Adrian Chadd <adrian@FreeBSD.org> Reviewed by: billf(spelling/style/minor nits) Looked at by: bde(style)
# 75c13541	28-Apr-1999	Poul-Henning Kamp <phk@FreeBSD.org>	This Implements the mumbled about "Jail" feature. This is a seriously beefed up chroot kind of thing. The process is jailed along the same lines as a chroot does it, but with additional tough restrictions imposed on what the superuser can do. For all I know, it is safe to hand over the root bit inside a prison to the customer living in that prison, this is what it was developed for in fact: "real virtual servers". Each prison has an ip number associated with it, which all IP communications will be coerced to use and each prison has its own hostname. Needless to say, you need more RAM this way, but the advantage is that each customer can run their own particular version of apache and not stomp on the toes of their neighbors. It generally does what one would expect, but setting up a jail still takes a little knowledge. A few notes: I have no scripts for setting up a jail, don't ask me for them. The IP number should be an alias on one of the interfaces. mount a /proc in each jail, it will make ps more useable. /proc/<pid>/status tells the hostname of the prison for jailed processes. Quotas are only sensible if you have a mountpoint per prison. There are no privisions for stopping resource-hogging. Some "#ifdef INET" and similar may be missing (send patches!) If somebody wants to take it from here and develop it into more of a "virtual machine" they should be most welcome! Tools, comments, patches & documentation most welcome. Have fun... Sponsored by: http://www.rndassociates.com/ Run for almost a year by: http://www.servetheweb.com/
# 3879597f	24-Apr-1999	Andrey A. Chernov <ache@FreeBSD.org>	so_linger is in seconds, not in 1/HZ PR: 11252 Submitted by: Martin Kammerhofer <dada@sbox.tu-graz.ac.at>
# b0acefa8	20-Jan-1999	Bill Fenner <fenner@FreeBSD.org>	Add a flag, passed to pru_send routines, PRUS_MORETOCOME. This flag means that there is more data to be put into the socket buffer. Use it in TCP to reduce the interaction between mbuf sizes and the Nagle algorithm. Based on: "Justin C. Walker" <justin@apple.com>'s description of Apple's fix for this problem.
# f1d19042	07-Dec-1998	Archie Cobbs <archie@FreeBSD.org>	The "easy" fixes for compiling the kernel -Wunused: remove unreferenced static and local variables, goto labels, and functions declared but not defined.
# cfe8b629	22-Aug-1998	Garrett Wollman <wollman@FreeBSD.org>	Yow! Completely change the way socket options are handled, eliminating another specialized mbuf type in the process. Also clean up some of the cruft surrounding IPFW, multicast routing, RSVP, and other ill-explored corners.
# c3229e05	27-Jan-1998	David Greenman <dg@FreeBSD.org>	Improved connection establishment performance by doing local port lookups via a hashed port list. In the new scheme, in_pcblookup() goes away and is replaced by a new routine, in_pcblookup_local() for doing the local port check. Note that this implementation is space inefficient in that the PCB struct is now too large to fit into 128 bytes. I might deal with this in the future by using the new zone allocator, but I wanted these changes to be extensively tested in their current form first. Also: 1) Fixed off-by-one errors in the port lookup loops in in_pcbbind(). 2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash() to do the initialial hash insertion. 3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability. 4) Added a new routine, in_pcbremlists() to remove the PCB from the various hash lists. 5) Added/deleted comments where appropriate. 6) Removed unnecessary splnet() locking. In general, the PCB functions should be called at splnet()...there are unfortunately a few exceptions, however. 7) Reorganized a few structs for better cache line behavior. 8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in the future, however. These changes have been tested on wcarchive for more than a month. In tests done here, connection establishment overhead is reduced by more than 50 times, thus getting rid of one of the major networking scalability problems. Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult. WARNING: Anything that knows about inpcb and tcpcb structs will have to be recompiled; at the very least, this includes netstat(1).
# 744f87ea	18-Dec-1997	David Greenman <dg@FreeBSD.org>	Fixed a missing splx(s) bug in tcp_usr_send().
# 0cc12cc5	16-Sep-1997	Joerg Wunsch <joerg@FreeBSD.org>	Make TCPDEBUG a new-style option.
# f8f6cbba	13-Sep-1997	Peter Wemm <peter@FreeBSD.org>	Update network code to use poll support.
# 57bf258e	16-Aug-1997	Garrett Wollman <wollman@FreeBSD.org>	Fix all areas of the system (or at least all those in LINT) to avoid storing socket addresses in mbufs. (Socket buffers are the one exception.) A number of kernel APIs needed to get fixed in order to make this happen. Also, fix three protocol families which kept PCBs in mbufs to not malloc them instead. Delete some old compatibility cruft while we're at it, and add some new routines in the in_cksum family.
# 1fd0b058	02-Aug-1997	Bruce Evans <bde@FreeBSD.org>	Removed unused #includes.
# a29f300e	27-Apr-1997	Garrett Wollman <wollman@FreeBSD.org>	The long-awaited mega-massive-network-code- cleanup. Part I. This commit includes the following changes: 1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility glue for them is deleted, and the kernel will panic on boot if any are compiled in. 2) Certain protocol entry points are modified to take a process structure, so they they can easily tell whether or not it is possible to sleep, and also to access credentials. 3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt() call. Protocols should use the process pointer they are now passed. 4) The PF_LOCAL and PF_ROUTE families have been updated to use the new style, as has the `raw' skeleton family. 5) PF_LOCAL sockets now obey the process's umask when creating a socket in the filesystem. As a result, LINT is now broken. I'm hoping that some enterprising hacker with a bit more time will either make the broken bits work (should be easy for netipx) or dike them out.
# ef53690b	21-Feb-1997	Garrett Wollman <wollman@FreeBSD.org>	Fix potential crash where a user attempts to perform an implied connect in TCP while sending urgent data. It is not clear what purpose is served by doing this, but there's no good reason why it shouldn't work. Submitted by: tjevans@raleigh.ibm.com via wpaul
# 117bcae7	18-Feb-1997	Garrett Wollman <wollman@FreeBSD.org>	Convert raw IP from mondo-switch-statement-from-Hell to pr_usrreqs. Collapse duplicates with udp_usrreq.c and tcp_usrreq.c (calling the generic routines in uipc_socket2.c and in_pcb.c). Calling sockaddr()_ or peeraddr() on a detached socket now traps, rather than harmlessly returning an error; this should never happen. Allow the raw IP buffer sizes to be controlled via sysctl.
# d0390e05	14-Feb-1997	Garrett Wollman <wollman@FreeBSD.org>	Fix the mechanism for choosing wehether to save the slow-start threshold in the route. This allows us to remove the unconditional setting of the pipesize in the route, which should mean that SO_SNDBUF and SO_RCVBUF should actually work again. While we're at it: - Convert udp_usrreq from `mondo switch statement from Hell' to new-style. - Delete old TCP mondo switch statement from Hell, which had previously been diked out.
# 1130b656	14-Jan-1997	Jordan K. Hubbard <jkh@FreeBSD.org>	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# 6d6a026b	07-Oct-1996	David Greenman <dg@FreeBSD.org>	Improved in_pcblookuphash() to support wildcarding, and changed relavent callers of it to take advantage of this. This reduces new connection request overhead in the face of a large number of PCBs in the system. Thanks to David Filo <filo@yahoo.com> for suggesting this and providing a sample implementation (which wasn't used, but showed that it could be done). Reviewed by: wollman
# 7b40aa32	13-Sep-1996	Paul Traina <pst@FreeBSD.org>	Make the misnamed tcp initial keepalive timer value (which is really the time, in seconds, that state for non-established TCP sessions stays about) a sysctl modifyable variable. [part 1 of two commits, I just realized I can't play with the indices as I was typing this commit message.]
# af7a2999	12-Jul-1996	David Greenman <dg@FreeBSD.org>	Fixed two bugs in previous commit: be sure to include tcp_debug.h when TCPDEBUG is defined, and fix typo in TCPDEBUG2() macro.
# 2c37256e	11-Jul-1996	Garrett Wollman <wollman@FreeBSD.org>	Modify the kernel to use the new pr_usrreqs interface rather than the old pr_usrreq mechanism which was poorly designed and error-prone. This commit renames pr_usrreq to pr_ousrreq so that old code which depended on it would break in an obvious manner. This commit also implements the new interface for TCP, although the old function is left as an example (#ifdef'ed out). This commit ALSO fixes a longstanding bug in the TCP timer processing (introduced by davidg on 1995/04/12) which caused timer processing on a TCB to always stop after a single timer had expired (because it misinterpreted the return value from tcp_usrreq() to indicate that the TCB had been deleted). Finally, some code related to polling has been deleted from if.c because it is not relevant t -current and doesn't look at all like my current code.
# 2ee45d7d	11-Mar-1996	David Greenman <dg@FreeBSD.org>	Move or add #include <queue.h> in preparation for upcoming struct socket changes.
# 2baeef32	06-Dec-1995	Bruce Evans <bde@FreeBSD.org>	Removed unnecessary #includes of vm stuff. Most of them were once prerequisites for <sys/sysctl.h>. subr_prof.c: Also replaced #include of <sys/user.h> by #include of <sys/resourcevar.h>.
# 0312fbe9	14-Nov-1995	Poul-Henning Kamp <phk@FreeBSD.org>	New style sysctl & staticize alot of stuff.
# 98163b98	09-Nov-1995	Poul-Henning Kamp <phk@FreeBSD.org>	Start adding new style sysctl here too.
# a45d2726	03-Nov-1995	Andras Olah <olah@FreeBSD.org>	Fix a logical error in T/TCP: when we actively open a connection, we have to decide whether to send a CC or CCnew option in our SYN segment depending on the contents of our TAO cache. This decision has to be made once when the connection starts. The earlier code delayed this decision until the segment was assembled in tcp_output() and retransmitted SYN segments could have different CC options. Reviewed by: Richard Stevens, davidg, wollman
# b6239c4a	29-Oct-1995	Andras Olah <olah@FreeBSD.org>	Start the 2MSL timer when the socket is closed and the TCP connection is in the FIN_WAIT_2 state in order to prevent the conn. hanging there forever. Reviewed by: davidg, olah Submitted by: Arne Henrik Juul <arnej@imf.unit.no> Obtained from: bugs@netbsd.org
# efe4b0eb	21-Sep-1995	Garrett Wollman <wollman@FreeBSD.org>	Second try: get 4.4-Lite-2 into the source tree. The conflicts don't matter because none of our working source files are on the CSRG branch any more. Obtained from: 4.4BSD-Lite-2
# b6e3d50f	13-Sep-1995	Garrett Wollman <wollman@FreeBSD.org>	Don't leak mbufs in an unusual error case in tcp_usrreq(). Reviewed by: Andras Olah <olah@freebsd.org> Obtained from: Lite-2
# d3628763	11-Jun-1995	Rodney W. Grimes <rgrimes@FreeBSD.org>	Merge RELENG_2_0_5 into HEAD
# 9b2e5354	30-May-1995	Rodney W. Grimes <rgrimes@FreeBSD.org>	Remove trailing whitespace.
# 15bd2b43	08-Apr-1995	David Greenman <dg@FreeBSD.org>	Implemented PCB hashing. Includes new functions in_pcbinshash, in_pcbrehash, and in_pcblookuphash.
# b5e8ce9f	16-Mar-1995	Bruce Evans <bde@FreeBSD.org>	Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) and most of the warnings from `gcc -Wnested-externs'. Fix all the bugs found. There were no serious ones.
# c7a82f90	16-Feb-1995	Garrett Wollman <wollman@FreeBSD.org>	Include missing <sys/kernel.h> for `hz'. Submitted by: David Greenman, Rod Grimes, Christoph Kukulies
# 1fdbc7ae	15-Feb-1995	Garrett Wollman <wollman@FreeBSD.org>	Correctly initialize so_linger in ticks (not seconds). Obtained from: Stevens, vol. 2, p. 1010
# 41f82abe	15-Feb-1995	Garrett Wollman <wollman@FreeBSD.org>	Transaction TCP support now standard. Hack away!
# f2ea20e6	15-Feb-1995	Garrett Wollman <wollman@FreeBSD.org>	Add lots of useful MIB variables and a few not-so-useful ones for completeness.
# a0292f23	09-Feb-1995	Garrett Wollman <wollman@FreeBSD.org>	Merge Transaction TCP, courtesy of Andras Olah <olah@cs.utwente.nl> and Bob Braden <braden@isi.edu>. NB: This has not had David's TCP ACK hack re-integrated. It is not clear what the correct solution to this problem is, if any. If a better solution doesn't pop up in response to this message, I'll put David's code back in (or he's welcome to do so himself).
# 9ee39fc6	15-Dec-1994	Garrett Wollman <wollman@FreeBSD.org>	Fix PR 59: don't allow TCP connections withmulticast addresses at either end.
# 610ee2f9	15-Sep-1994	David Greenman <dg@FreeBSD.org>	Made TCPDEBUG truely optional. Based on changes I made in FreeBSD 1.1.5. Fixed somebody's idea of a joke - about the first half of the lines in in_proto.c were spaced over by one space.
# 3c4dd356	02-Aug-1994	David Greenman <dg@FreeBSD.org>	Added $Id$
# 26e30fbb	29-May-1994	David Greenman <dg@FreeBSD.org>	Increased tcp_send/recvspace to 16k, and added TCP_SMALLSPACE ifdef to set it to 4k.
# 26f9a767	25-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
# df8bae1d	24-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	BSD 4.4 Lite Kernel Sources