Cross Reference: /freebsd-current/sys/netinet/tcp

History log of /freebsd-current/sys/netinet/tcp_input.c
Revision	Date	Author	Comments
# df9de82f	25-May-2024	Michael Tuexen <tuexen@FreeBSD.org>	tcp: fix sending RST after second inp lookup When we first find an inp, we set also the tp. If then a second lookup is necessary, the inp is recomputed. If this fails, the tp is not cleared, which resulted in failing KASSERT. Therefore, clear the tp when staring the inp lookup procedure. Reported by: Jenkins Fixes: 02d15215cef2 ("tcp: improve blackhole support") MFC after: 1 week Sponsored by: Netflix, Inc.
# 02d15215	23-May-2024	Michael Tuexen <tuexen@FreeBSD.org>	tcp: improve blackhole support There are two improvements to the TCP blackhole support: (1) If net.inet.tcp.blackhole is set to 2, also sent no RST whenever a segment is received on an existing closed socket or if there is a port mismatch when using UDP encapsulation. (2) If net.inet.tcp.blackhole is set to 3, no RST segment is sent in response to incoming segments on closed sockets or in response to unexpected segments on listening sockets. Thanks to gallatin@ for suggesting such an improvement. Reviewed by: gallatin MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D45304
# fce03f85	05-May-2024	Randall Stewart <rrs@FreeBSD.org>	TCP can be subject to Sack Attacks lets fix this issue. There is a type of attack that a TCP peer can launch on a connection. This is for sure in Rack or BBR and probably even the default stack if it uses lists in sack processing. The idea of the attack is that the attacker is driving you to look at 100's of sack blocks that only update 1 byte. So for example if you have 1 - 10,000 bytes outstanding the attacker sends in something like: ACK 0 SACK(1-512) SACK(1024 - 1536), SACK(2048-2536), SACK(4096 - 4608), SACK(8192-8704) This first sack looks fine but then the attacker sends ACK 0 SACK(1-512) SACK(1025 - 1537), SACK(2049-2537), SACK(4097 - 4609), SACK(8193-8705) ACK 0 SACK(1-512) SACK(1027 - 1539), SACK(2051-2539), SACK(4099 - 4611), SACK(8195-8707) ... These blocks are making you hunt across your linked list and split things up so that you have an entry for every other byte. Has your list grows you spend more and more CPU running through the lists. The idea here is the attacker chooses entries as far apart as possible that make you run through the list. This example is small but in theory if the window is open to say 1Meg you could end up with 100's of thousands link list entries. To combat this we introduce three things. when the peer requests a very small MSS we stop processing SACK's from them. This prevents a malicious peer from just using a small MSS to do the same thing. Any time we get a sack block, we use the sack-filter to remove sacks that are smaller than the smallest v4 mss (minus 40 for max TCP options) unless it ties up to snd_max (since that is legal). All other sacks in theory should be at least an MSS. If we get such an attacker that means we basically start skipping all but MSS sized Sacked blocks. The sack filter used to throw away data when its bounds were exceeded, instead now we increase its size to 15 and then throw away sack's if the filter gets over-run to prevent the malicious attacker from over-running the sack filter and thus we start to process things anyway. The default stack will need to start using the sack-filter which we have talked about in past conference calls to take full advantage of the protections offered by it (and reduce cpu consumption when processing sacks). After this set of changes is in rack can drop its SAD detection completely Reviewed by:tuexen@, rscheff@ Differential Revision: <https://reviews.freebsd.org/D44903>
# c9cd686b	18-Apr-2024	Michael Tuexen <tuexen@FreeBSD.org>	tcp: drop data received after a FIN has been processed RFC 9293 describes the handling of data in the CLOSE-WAIT, CLOSING, LAST-ACK, and TIME-WAIT states: This should not occur since a FIN has been received from the remote side. Ignore the segment text. Therefore, implement this handling. Reviewed by: rrs, rscheff MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D44746
# e8c149ab	07-Apr-2024	Michael Tuexen <tuexen@FreeBSD.org>	tcp: add some debug output Also log, when dropping text or FIN after having received a FIN. This is the intended behavior described in RFC 9293. A follow-up patch will enforce this behavior for the base stack and the RACK stack. Reviewed by: rscheff MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D44669
# 3e1c8a35	06-Apr-2024	Michael Tuexen <tuexen@FreeBSD.org>	tcp: improve consistency No functional change intended. Reported by: Coverity Scan CID: 1523781 Reviewed by: rscheff MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D44645
# dd7b86e2	18-Mar-2024	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: remove IS_FASTOPEN() macro The macro is more obfuscating than helping as it just checks a single flag of t_flags. All other t_flags bits are checked without a macro. A bigger problem was that declaration of the macro in tcp_var.h depended on a kernel option. It is a bad practice to create such definitions in installable headers. Reviewed by: rscheff, tuexen, kib Differential Revision: https://reviews.freebsd.org/D44362
# 40fdc6d2	24-Feb-2024	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: provide correct snd_fack on post_recovery Ensure that snd_fack holds a valid value when doing the post_recovery CC processing, for preparation of the cc_cubic update, so that local pipe calculations can correctly refer to snd_fack during and after CC events. Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D43957
# fcea1cc9	14-Feb-2024	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: fix RTO ssthresh for non-6675 pipe calculation Follow up on D43768 to properly deal with the non-default pipe calculation. When CC_RTO is processed, the timeout will have already pulled back snd_nxt. Further, snd_fack is not pulled along with snd_una. Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D43876
# 3eeb22cb	10-Feb-2024	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: clean scoreboard when releasing the socket buffer The SACK scoreboard is conceptually an extention of the socket buffer. Remove it when the socket buffer goes away with soisdisconnected(). Verify that this is also the expected state in tcp_discardcb(). PR: 276761 Reviewed by: glebius, tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D43805
# 0b3f9e43	27-Jan-2024	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: move cc_post_recovery past snd_una update The RFC6675 pipe calculation (sack.revised, enabled by default since D28702), uses outdated information, while the previous default calculated it correctly with up-to-date information from the incoming ACK. This difference can become as large as the receive window (not the congestion window previously), potentially triggering a massive burst of new packets. MFC after: 1 week Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D43520
# 2d05a1c8	25-Jan-2024	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: commonize check for more data to send, style changes Use SEQ_SUB instead of a plain subtraction, for an implict type conversion and prevention of a possible overflow. Use curly brackets in stacked if statements throughout. Use of the ? operator to enhance readability when clearing the FIN flag in tcp_output(). None of the above change the function. Reviewed By: tuexen, cc, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D43539
# c7c325d0	24-Jan-2024	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: pass maxseg around instead of calculating locally Improve slowpath processing (reordering, retransmissions) slightly by calculating maxseg only once. This typically saves one of two calls to tcp_maxseg(). Reviewed By: glebius, tuexen, cc, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D43536
# 429f14f8	08-Jan-2024	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: clean PRR state after ECN congestion recovery. PRR state was not properly reset on subsequent ECN CE events. Clean up after local transmission failures too. Reviewed by: tuexen, cc, #transport MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D43170
# f4574e2d	08-Jan-2024	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: prevent spurious empty segments and fix uncommon panic Only try sending more data on pure ACKs when there is more data available in the send buffer. In the case of a retransmitted SYN not being sent due to an internal error, the snd_una/snd_nxt accounting could be off, leading to a panic. Pulling snd_nxt up to snd_una prevents this from happening. Reported by: fengdreamer@126.com Reviewed by: cc, tuexen, #transport MFC after: 1 week Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D43343
# 30409ecd	06-Jan-2024	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: do not purge SACK scoreboard on first RTO Keeping the SACK scoreboard intact after the first RTO and retransmitting all data anew only on subsequent RTOs allows a more timely and efficient loss recovery under many adverse cirumstances. Reviewed By: tuexen, #transport MFC after: 10 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D42906
# 893ed42e	06-Jan-2024	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Make use of enum for sack_changed No functional change. Reviewed By: tuexen, #transport MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D43346
# 513f2e2e	19-Dec-2023	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: always set tcp_tun_port to a correct value The tcp_tun_port field that is used to pass port value between UDP and TCP in case of tunneling is a generic field that used to pass data between network layers. It can be contaminated on entry, e.g. by a VLAN tag set by a NIC driver. Explicily set it, so that it is zeroed out in a normal not-tunneled TCP. If it contains garbage, tcp_twcheck() later can enter wrong block of code and treat the packet as incorrectly tunneled one. On main and stable/14 that will end up with sending incorrect responses, but on stable/13 with ipfw(8) and pcb-matching rules it may end up in a panic. This is a minimal conservative patch to be merged to stable branches. Later we may redesign this. PR: 275169 Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D43065
# 9276ad23	07-Dec-2023	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: shift PRR sending cadence slightly left Don't let PRR pass up on the opportunity of clocking out packets on arrival of ACKs - by pulling sends forward by about half a packet. Prevents unexpectedly long runs of incoming ACKs without eliciting a packet transmission. MFC after: 1 week Reviewed By: #transport, tuexen Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D42918
# f42518ff	30-Nov-2023	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: for LRD move sysctl from tcp.do_lrd tp tcp.sack.lrd, remove sockopt Moving lrd sysctl to the tcp.sack branch, since LRD only works with SACK. Remove the sockopt to programmatically control LRD per session. Reviewed By: #transport, tuexen, rrs Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D42851
# 34c45bc6	29-Nov-2023	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: enable LRD by default Lost Retransmission Detection was added as a feature in May 2021, but disabled by default. Enabling the feature by default to reduce the flow completion time by avoiding RTOs when retransmissions get lost too. Reviewed By: tuexen, #transport, zlei MFC after: 10 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D42845
# 29363fb4	23-Nov-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove ancient SCCS tags. Remove ancient SCCS tags from the tree, automated scripting, with two minor fixup to keep things compiling. All the common forms in the tree were removed with a perl script. Sponsored by: Netflix
# 49a6fbe3	15-Nov-2023	Richard Scheffenegger <rscheff@FreeBSD.org>	[tcp] add PRR 6937bis heuristic and retire prr_conservative sysctl Improve Proportional Rate Reduction (RFC6937) by using a heuristic, which automatically chooses between conservative CRB and more aggressive SSRB modes. Only when snd_una advances (a partial ACK), SSRB may be used. Also, that ACK must not have any indication of ongoing loss - using the addition of new holes into the scoreboard as proxy for such an event. MFC after: 4 weeks Reviewed By: #transport, kbowling, rrs Sponsored By: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28822
# e2c6a6d2	09-Oct-2023	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: include RFC6675 IsLost() in pipe calculation Add more accounting while processing SACK data, to keep track of when a packet is deemed lost using the RFC6675 guidance. Together with PRR (RFC6972) this allows a sender to retransmit presumed lost packets faster, and loss recovery to complete earlier. Reviewed By: cc, rrs, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D39299
# 685dc743	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove $FreeBSD$: one-line .c pattern Remove /^[\s]__FBSDID$"\$FreeBSD\$"$;?\s*\n/
# b352ef58	26-Jul-2023	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Handle <RST,ACK> in SYN-RCVD Patch base stack to correctly handle the RST bit independently of other header flags per TCP RFC. MFC after: 1 week Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D40982
# e5738ee0	11-May-2023	Cheng Cui <cc@FreeBSD.org>	Under RSS, assign a TCP flow's inp_flowid anyway. Summary: This brings some benefit of a tcp flow identification for some kernel modules, such as siftr. Reviewers: rrs, rscheff, tuexen, #transport! Approved by: tuexen (mentor), rrs Subscribers: imp, melifaro, glebius Differential Revision: https://reviews.freebsd.org/D40061
# 35bc0bcc	07-Apr-2023	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: reduce argument list to functions that pass a segment The socket argument is superfluous, as a tcpcb always has one and only one socket. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D39434
# 78e6c3aa	27-Mar-2023	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: update error counter when dropping a packet due to bad source Use the same counter that ip_input()/ip6_input() use for bad destination address. For IPv6 this is already heavily abused ip6s_badscope, which needs to be split into several separate error counters. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D39234
# 69c7c811	16-Mar-2023	Randall Stewart <rrs@FreeBSD.org>	Move access to tcp's t_logstate into inline functions and provide new tracepoint and bbpoint capabilities. The TCP stacks have long accessed t_logstate directly, but in order to do tracepoints and the new bbpoints we need to move to using the new inline functions. This adds them and moves rack to now use the tcp_tracepoints. Reviewed by: tuexen, gallatin Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D38831
# 713264f6	06-Mar-2023	Mark Johnston <markj@FreeBSD.org>	netinet: Tighten checks for unspecified source addresses The assertions added in commit b0ccf53f2455 ("inpcb: Assert against wildcard addrs in in_pcblookup_hash_locked()") revealed that protocol layers may pass the unspecified address to in_pcblookup(). Add some checks to filter out such packets before we attempt an inpcb lookup: - Disallow the use of an unspecified source address in in_pcbladdr() and in6_pcbladdr(). - Disallow IP packets with an unspecified destination address. - Disallow TCP packets with an unspecified source address, and add an assertion to verify the comment claiming that the case of an unspecified destination address is handled by the IP layer. Reported by: syzbot+9ca890fb84e984e82df2@syzkaller.appspotmail.com Reported by: syzbot+ae873c71d3c71d5f41cb@syzkaller.appspotmail.com Reported by: syzbot+e3e689aba1d442905067@syzkaller.appspotmail.com Reviewed by: glebius, melifaro MFC after: 2 weeks Sponsored by: Klara, Inc. Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D38570
# 18b83b62	26-Jan-2023	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: reduce the size of t_rttupdated in tcpcb During tcp session start, various mechanisms need to track a few initial RTTs before becoming active. Prevent overflows of the corresponding tracking counter and reduce the size of tcpcb simultaneously. Reviewed By: #transport, tuexen, guest-ccui Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D21117
# aab8c844	05-Jan-2023	Gleb Smirnoff <glebius@FreeBSD.org>	tcp/ipfw: fix "ipfw fwd localaddr,port" The ipfw(4) feature of forwarding to local address without modifying a packet was broken. The first lookup needs always be a non-wildcard one, cause its goal is to find an already existing socket. Otherwise a local wildcard listener with the same port number may match resulting in the connection being forwared to wrong port. Reported by: Pavel Polyakov <bsd kobyla.org> Fixes: d88eb4654f372d0451139a1dbf525a8f2cad1cf8
# eaabc937	14-Dec-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: retire TCPDEBUG This subsystem is superseded by modern debugging facilities, e.g. DTrace probes and TCP black box logging. We intentionally leave SO_DEBUG in place, as many utilities may set it on a socket. Also the tcp::debug DTrace probes look at this flag on a socket. Reviewed by: gnn, tuexen Discussed with: rscheff, rrs, jtl Differential revision: https://reviews.freebsd.org/D37694
# e68b3792	07-Dec-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: embed inpcb into tcpcb For the TCP protocol inpcb storage specify allocation size that would provide space to most of the data a TCP connection needs, embedding into struct tcpcb several structures, that previously were allocated separately. The most import one is the inpcb itself. With embedding we can provide strong guarantee that with a valid TCP inpcb the tcpcb is always valid and vice versa. Also we reduce number of allocs/frees per connection. The embedded inpcb is placed in the beginning of the struct tcpcb, since in_pcballoc() requires that. However, later we may want to move it around for cache line efficiency, and this can be done with a little effort. The new intotcpcb() macro is ready for such move. The congestion algorithm data, the TCP timers and osd(9) data are also embedded into tcpcb, and temprorary struct tcpcb_mem goes away. There was no extra allocation here, but we went through extra pointer every time we accessed this data. One interesting side effect is that now TCP data is allocated from SMR-protected zone. Potentially this allows the TCP stacks or other TCP related modules to utilize that for their own synchronization. Large part of the change was done with sed script: s/tp->ccv->/tp->t_ccv./g s/tp->ccv/\&tp->t_ccv/g s/tp->cc_algo/tp->t_cc/g s/tp->t_timers->tt_/tp->tt_/g s/CCV$ccv, osd$/\&CCV(ccv, t_osd)/g Dependency side effect is that code that needs to know struct tcpcb should also know struct inpcb, that added several <netinet/in_pcb.h>. Differential revision: https://reviews.freebsd.org/D37127
# bd4f9866	16-Nov-2022	Michael Tuexen <tuexen@FreeBSD.org>	tcp: remove unused t_rttbest No functional change intended. Reviewed by: rscheff@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D37401
# 9eb0e832	08-Nov-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: provide macros to access inpcb and socket from a tcpcb There should be no functional changes with this commit. Reviewed by: rscheff Differential revision: https://reviews.freebsd.org/D37123
# f71cb9f7	08-Nov-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: inp_socket is valid through the lifetime of a TCP inpcb The inp_socket is cleared only in in_pcbdetach(), which for TCP is always accompanied with inp_pcbfree(). An inpcb that went through in_pcbfree() shall never be returned by any kind of pcb lookup. Reviewed by: tuexen Differential revision: https://reviews.freebsd.org/D37062
# f567d55f	08-Nov-2022	Gleb Smirnoff <glebius@FreeBSD.org>	inpcb: don't return INP_DROPPED entries from pcb lookups The in_pcbdrop() KPI, which is used solely by TCP, allows to remove a pcb from hash list and mark it as dropped. The comment suggests that such pcb won't be returned by lookups. Indeed, every call to in_pcblookup*() is accompanied by a check for INP_DROPPED. Do what comment suggests: never return such pcbs and remove unnecessary checks. Reviewed by: tuexen Differential revision: https://reviews.freebsd.org/D37061
# 004bb636	05-Nov-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Move sysctl OIDs related to ECN to tcp_ecn.c Keep all ECN related code in (mostly) one place. No functional change. Event: IETF 115 Hackathon Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D37285
# b1258b76	06-Nov-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: add conservative d.cep accounting algorithm Accurate ECN asks to conservatively estimate, when the ACE counter may have wrapped due to a single ACK covering a larger number of segments. This is described in Annex A.2 of the accurate-ecn draft. Event: IETF 115 Hackathon Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D37281
# c348e880	31-Oct-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: make tcp_handle_wakeup() static and robust It is called only from tcp_input() and always has valid parameter. Reviewed by: rscheff, tuexen Differential revision: https://reviews.freebsd.org/D37115
# eda63345	25-Oct-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: remove useless today lock assertion in a middle of function It was added back in 7cfc6904408b, when there was a jump label above and tcp_input() hadn't been locked all through.
# 83c1ec92	20-Oct-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: ECN preparations for ECN++, AccECN (tcp_respond) tcp_respond is another function to build a tcp control packet quickly. With ECN++ and AccECN, both the IP ECN header, and the TCP ECN flags are supposed to reflect the correct state. Also ensure that on receiving multiple ECN SYN-ACKs, the responses triggered will reflect the latest state. Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D36973
# 0d744519	06-Oct-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: remove tcptw, the compressed timewait state structure The memory savings the tcptw brought back in 2003 (see 340c35de6a2) no longer justify the complexity required to maintain it. For longer explanation please check out the email [1]. Surpisingly through almost 20 years the TCP stack functionality of handling the TIME_WAIT state with a normal tcpcb did not bitrot. The existing tcp_input() properly handles a tcpcb in TCPS_TIME_WAIT state, which is confirmed by the packetdrill tcp-testsuite [2]. This change just removes tcptw and leaves INP_TIMEWAIT. The flag will be removed in a separate commit. This makes it easier to review and possibly debug the changes. [1] https://lists.freebsd.org/archives/freebsd-net/2022-January/001206.html [2] https://github.com/freebsd-net/tcp-testsuite Differential revision: https://reviews.freebsd.org/D36398
# cd84e78f	04-Oct-2022	Randall Stewart <rrs@FreeBSD.org>	tcp idle reduce does not work for a server. TCP has an idle-reduce feature that allows a connection to reduce its cwnd after it has been idle more than an RTT. This feature only works for a sending side connection. It does this by at output checking the idle time (t_rcvtime vs ticks) to see if its more than the RTO timeout. The problem comes if you are a web server. You get a request and then send out all the data.. then go idle. The next time you would send is in response to a request from the peer asking for more data. But the thing is you updated t_rcvtime when the request came in so you never reduce. The fix is to do the idle reduce check also on inbound. Reviewed by: tuexen, rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D36721
# 08af8aac	27-Sep-2022	Randall Stewart <rrs@FreeBSD.org>	Tcp progress timeout Rack has had the ability to timeout connections that just sit idle automatically. This feature of course is off by default and requires the user set it on (though the socket option has been missing in tcp_usrreq.c). Lets get the progress timeout fully supported in the base stack as well as rack. Reviewed by: tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D36716
# d1b07f36	26-Sep-2022	Randall Stewart <rrs@FreeBSD.org>	TCP complete end status work. The ending of a connection can tell us a lot about what happened i.e. did it fail to setup, did it timeout, was it a normal close. Often times this is useful information to help analyze and debug issues. Rack has had end status for some time but the base stack as not. Lets go a ahead and add in the missing bits to populate the end status. Reviewed by: tuexen, rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D36712
# e5049a17	26-Sep-2022	Randall Stewart <rrs@FreeBSD.org>	TCP rack does not work properly with cubic. Right now if you use rack with cubic (the new default cc) you will have improper results. This is because rack uses different variables than the base stack (or bbr) and thus tcp_compute_pipe() always returns so that cubic will choose a 30% backoff not the 50% backoff it should when it is newreno compatibility mode. The fix is to allow a stack (rack) to override its own compute_pipe. Reviewed by: tuexen, rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D36711
# 5ae83e0d	21-Sep-2022	Michael Tuexen <tuexen@FreeBSD.org>	tcp: send ACKs when requested When doing Limited Transmit send an ACK when needed by the protocol processing (like sending ACKs with a DSACK block). PR: 264257 PR: 263445 PR: 260393 Reviewed by: rscheff@ MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D36631
# 493105c2	21-Sep-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: fix simultaneous open and refine e80062a2d43 - The soisconnected() call on transition from SYN_RCVD to ESTABLISHED is also necessary for a half-synchronized connection. Fix that just setting the flag, when we transfer SYN-SENT -> SYN-RECEIVED. - Provide a comment that explains at what conditions the call to soisconnected() is necessary. - Hence mechanically rename the TF_INCQUEUE flag to TF_SONOTCONN. - Extend the change to the BBR and RACK stacks. Note: the interaction between the accept_filter(9) and the socket layer is not fully consistent, yet. For most accept filters this call to soisconnected() will not move the connection from the incomplete queue to the complete. The move would happen only when the filter has received the desired data, and soisconnected() would be called once again from sorwakeup(). Ideally, we should mark socket as connected only there, and leave the soisconnected() from SYN_RCVD->ESTABLISHED only for the simultaneous open case. However, this doesn't yet work. Reviewed by: rscheff, tuexen, rrs Differential revision: https://reviews.freebsd.org/D36641
# e80062a2	08-Sep-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: avoid call to soisconnected() on transition to ESTABLISHED This call existed since pre-FreeBSD times, and it is hard to understand why it was there in the first place. After 6f3caa6d815 it definitely became necessary always and commit message from f1ee30ccd60 confirms that. Now that 6f3caa6d815 is effectively backed out by 07285bb4c22, the call appears to be useful only for sockets that landed on the incomplete queue, e.g. sockets that have accept_filter(9) enabled on them. Provide a new TCP flag to mark connections that are known to be on the incomplete queue, and call soisconnected() only for those connections. Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D36488
# c21b7b55	31-Aug-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: finish SACK loss recovery on sudden lack of SACK blocks While a receiver should continue sending SACK blocks for the duration of a SACK loss recovery, if for some reason the TCP options no longer contain these SACK blocks, but we already started maintaining the Scoreboard, keep on handling incoming ACKs (without SACK) as belonging to the SACK recovery. Reported by: thj Reviewed by: tuexen, #transport MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D36046
# c0060575	29-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: remove a dead code leftover from T/TCP, that doesn't have any value today.
# d9f6ac88	17-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	protosw: retire PRU_ flags and their char names For many years only TCP debugging used them, but relatively recently TCP DTrace probes also start to use them. Move their declarations into tcp_debug.h, but start including tcp_debug.h unconditionally, so that compilation with DTrace and without TCPDEBUG is possible.
# d88eb465	10-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: address a wire level race with 2 ACKs at the end of TCP handshake Imagine we are in SYN-RCVD state and two ACKs arrive at the same time, both valid, e.g. coming from the same host and with valid sequence. First packet would locate the listening socket in the inpcb database, write-lock it and start expanding the syncache entry into a socket. Meanwhile second packet would wait on the write lock of the listening socket. First packet will create a new ESTABLISHED socket, free the syncache entry and unlock the listening socket. Second packet would call into syncache_expand(), but this time it will fail as there is no syncache entry. Second packet would generate RST, effectively resetting the remote connection. It seems to me, that it is impossible to solve this problem with just rearranging locks, as the race happens at a wire level. To solve the problem, for an ACK packet arrived on a listening socket, that failed syncache lookup, perform a second non-wildcard lookup right away. That lookup may find the new born socket. Otherwise, we indeed send RST. Tested by: kp Reviewed by: tuexen, rrs PR: 265154 Differential revision: https://reviews.freebsd.org/D36066
# e7231d07	07-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp_input: update comment to match reality.
# 74703901	04-Jul-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: use a TCP flag to check if connection has been close(2)d The flag SS_NOFDREF is a private flag of the socket layer. It also is supposed to be read with SOCK_LOCK(), which we don't own here. Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D35663
# ad3ad064	27-Jun-2022	Gleb Smirnoff <glebius@FreeBSD.org>	blackhole(4): fix operator precedence Fixes: 3ea9a7cf7b09a355cde3a76824809402b99d0892
# f5766992	15-Jun-2022	Hans Petter Selasky <hselasky@FreeBSD.org>	tcp: Correctly compute the TCP goodput in bits per second by using SEQ_SUB(). TCP sequence number differences should be computed using SEQ_SUB(). Differential Revision: https://reviews.freebsd.org/D35505 Reviewed by: rscheff@ MFC after: 1 week Sponsored by: NVIDIA Networking
# 43283184	12-May-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: use socket buffer mutexes in struct socket directly Since c67f3b8b78e the sockbuf mutexes belong to the containing socket, and socket buffers just point to it. In 74a68313b50 macros that access this mutex directly were added. Go over the core socket code and eliminate code that reaches the mutex by dereferencing the sockbuf compatibility pointer. This change requires a KPI change, as some functions were given the sockbuf pointer only without any hint if it is a receive or send buffer. This change doesn't cover the whole kernel, many protocols still use compatibility pointers internally. However, it allows operation of a protocol that doesn't use them. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35152
# f7220c48	05-Feb-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: move ECN handling code to a common file Reduce the burden to maintain correct and extensible ECN related code across multiple stacks and codepaths. Formally no functional change. Incidentially this establishes correct ECN operation in one instance. Reviewed By: rrs, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D34162
# 7994ef3c	04-Feb-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	Revert "tcp: move ECN handling code to a common file" This reverts commit 0c424c90eaa6602e07bca7836b1d178b91f2a88a.
# 0c424c90	04-Feb-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: move ECN handling code to a common file Reduce the burden to maintain correct and extensible ECN related code across multiple stacks and codepaths. Formally no functional change. Incidentially this establishes correct ECN operation in one instance. Reviewed By: rrs, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D34162
# 1ebf4607	03-Feb-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Access all 12 TCP header flags via inline function In order to consistently provide access to all (including reserved) TCP header flag bits, use an accessor function tcp_get_flags and tcp_set_flags. Also expand any flag variable from uint8_t / char to uint16_t. Reviewed By: hselasky, tuexen, glebius, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D34130
# 4531b345	27-Jan-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Tidying up the conditionals for unwinding a spurious RTO - Use the semantically correct TSTMP_xx macro when comparing timestamps. (No functional change) - check for bad retransmits only when TSopt is present in ACK (don't assume there will be a valid TSopt in the TCP options struct) - exclude tsecr == 0, since that most likely indicates an invalid ts echo return (tsecr) value. Reviewed By: tuexen, #transport MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D34062
# 68e623c3	27-Jan-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Rewind erraneous RTO only while performing RTO retransmissions Under rare circumstances, a spurious retranmission is incorrectly detected and rewound, messing up various tcpcb values, which can lead to a panic when SACK is in use. Reviewed By: tuexen, chengc_netapp.com, #transport MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D33979
# 40fa3e40	26-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: mechanically substitute call to tfb_tcp_output to new method. Made with sed(1) execution: sed -Ef sed -i "" $(grep --exclude tcp_var.h -lr tcp_output sys/) sed: s/tp->t_fb->tfb_tcp_output$tp$/tcp_output(tp)/ s/to tfb_tcp_output/to tcp_output()/ Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D33366
# 75add59a	17-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: allocate statistics in the main tcp_init() No reason to have a separate SYSINIT.
# db0ac6de	02-Dec-2021	Cy Schubert <cy@FreeBSD.org>	Revert "wpa: Import wpa_supplicant/hostapd commit 14ab4a816" This reverts commit 266f97b5e9a7958e365e78288616a459b40d924a, reversing changes made to a10253cffea84c0c980a36ba6776b00ed96c3e3b. A mismerge of a merge to catch up to main resulted in files being committed which should not have been.
# de2d4784	02-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	SMR protection for inpcbs With introduction of epoch(9) synchronization to network stack the inpcb database became protected by the network epoch together with static network data (interfaces, addresses, etc). However, inpcb aren't static in nature, they are created and destroyed all the time, which creates some traffic on the epoch(9) garbage collector. Fairly new feature of uma(9) - Safe Memory Reclamation allows to safely free memory in page-sized batches, with virtually zero overhead compared to uma_zfree(). However, unlike epoch(9), it puts stricter requirement on the access to the protected memory, needing the critical(9) section to access it. Details: - The database is already build on CK lists, thanks to epoch(9). - For write access nothing is changed. - For a lookup in the database SMR section is now required. Once the desired inpcb is found we need to transition from SMR section to r/w lock on the inpcb itself, with a check that inpcb isn't yet freed. This requires some compexity, since SMR section itself is a critical(9) section. The complexity is hidden from KPI users in inp_smr_lock(). - For a inpcb list traversal (a pcblist sysctl, or broadcast notification) also a new KPI is provided, that hides internals of the database - inp_next(struct inp_iterator *). Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33022
# 3ea9a7cf	28-Oct-2021	Gleb Smirnoff <glebius@FreeBSD.org>	blackhole(4): disable for locally originated TCP/UDP packets In most cases blackholing for locally originated packets is undesired, leads to different kind of lags and delays. Provide sysctls to enforce it, e.g. for debugging purposes. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D32718
# 74d7fc87	19-Jun-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Add PRR cwnd reduction for non-SACK loss This completes PRR cwnd reduction in all circumstances for the base TCP stack (SACK loss recovery, ECN window reduction, non-SACK loss recovery), preventing the arriving ACKs to clock out new data at the old, too high rate. This reduces the chance to induce additional losses while recovering from loss (during congested network conditions). For non-SACK loss recovery, each ACK is assumed to have one MSS delivered. In order to prevent ACK-split attacks, only one window worth of ACKs is considered to actually have delivered new data. MFC after: 6 weeks Reviewed By: rrs, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29441
# f4bb1869	14-Jun-2021	Mark Johnston <markj@FreeBSD.org>	Consistently use the SOLISTENING() macro Some code was using it already, but in many places we were testing SO_ACCEPTCONN directly. As a small step towards fixing some bugs involving synchronization with listen(2), make the kernel consistently use SOLISTENING(). No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation
# 4747500d	04-Jun-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: A better fix for the previously attempted fix of the ack-war issue with tcp. So it turns out that my fix before was not correct. It ended with us failing some of the "improved" SYN tests, since we are not in the correct states. With more digging I have figured out the root of the problem is that when we receive a SYN\|FIN the reassembly code made it so we create a segq entry to hold the FIN. In the established state where we were not in order this would be correct i.e. a 0 len with a FIN would need to be accepted. But if you are in a front state we need to strip the FIN so we correctly handle the ACK but ignore the FIN. This gets us into the proper states and avoids the previous ack war. I back out some of the previous changes but then add a new change here in tcp_reass() that fixes the root cause of the issue. We still leave the rack panic fixes in place however. Reviewed by: mtuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30627
# 8c69d988	27-May-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: When we have an out-of-order FIN we do want to strip off the FIN bit. The last set of commits fixed both a panic (in rack) and an ACK-war (in freebsd and bbr). However there was a missing case, i.e. where we get an out-of-order FIN by itself. In such a case we don't want to leave the FIN bit set, otherwise we will do the wrong thing and ack the FIN incorrectly. Instead we need to go through the tcp_reasm() code and that way the FIN will be stripped and all will be well. Reviewed by: mtuexen,rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30497
# 13c0e198	25-May-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Fix bugs related to the PUSH bit and rack and an ack war Michaels testing with UDP tunneling found an issue with the push bit, which was only partly fixed in the last commit. The problem is the left edge gets transmitted before the adjustments are done to the send_map, this means that right edge bits must be considered to be added only if the entire RSM is being retransmitted. Now syzkaller also continued to find a crash, which Michael sent me the reproducer for. Turns out that the reproducer on default (freebsd) stack made the stack get into an ack-war with itself. After fixing the reference issues in rack the same ack-war was found in rack (and bbr). Basically what happens is we go into the reassembly code and lose the FIN bit. The trick here is we should not be going into the reassembly code if tlen == 0 i.e. the peer never sent you anything. That then gets the proper action on the FIN bit but then you end up in LAST_ACK with no timers running. This is because the usrclosed function gets called and the FIN's and such have already been exchanged. So when we should be entering FIN_WAIT2 (or even FIN_WAIT1) we get stuck in LAST_ACK. Fixing this means tweaking the usrclosed function so that we properly recognize the condition and drop into FIN_WAIT2 where a timer will allow at least TP_MAXIDLE before closing (to allow time for the peer to retransmit its FIN if the ack is lost). Setting the fast_finwait2 timer can speed this up in testing. Reviewed by: mtuexen,rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30451
# 032bf749	21-May-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	[tcp] Keep socket buffer locked until upcall r367492 would unlock the socket buffer before eventually calling the upcall. This leads to problematic interaction with NFS kernel server/client components (MP threads) accessing the socket buffer with potentially not correctly updated state. Reported by: rmacklem Reviewed By: tuexen, #transport Tested by: rmacklem, otis MFC after: 2 weeks Sponsored By: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29690
# 0471a8c7	10-May-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: SACK Lost Retransmission Detection (LRD) Recover from excessive losses without reverting to a retransmission timeout (RTO). Disabled by default, enable with sysctl net.inet.tcp.do_lrd=1 Reviewed By: #transport, rrs, tuexen, #manpages Sponsored by: Netapp, Inc. Differential Revision: https://reviews.freebsd.org/D28931
# 5d8fd932	06-May-2021	Randall Stewart <rrs@FreeBSD.org>	This brings into sync FreeBSD with the netflix versions of rack and bbr. This fixes several breakages (panics) since the tcp_lro code was committed that have been reported. Quite a few new features are now in rack (prefecting of DGP -- Dynamic Goodput Pacing among the largest). There is also support for ack-war prevention. Documents comming soon on rack.. Sponsored by: Netflix Reviewed by: rscheff, mtuexen Differential Revision: https://reviews.freebsd.org/D30036
# 48be5b97	28-Apr-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: stop spurious rescue retransmissions and potential asserts Reported by: pho@ MFC after: 3 days Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29970
# 1db08fbe	16-Apr-2021	Gleb Smirnoff <glebius@FreeBSD.org>	tcp_input: always request read-locking of PCB for any pure SYN segment. This is further rework of 08d9c920275. Now we carry the knowledge of lock type all the way through tcp_input() and also into tcp_twcheck(). Ideally the rlocking for pure SYNs should propagate all the way into the alternative TCP stacks, but not yet today. This should close a race when socket is bind(2)-ed but not yet listen(2)-ed and a SYN-packet arrives racing with listen(2), discovered recently by pho@.
# 7b5053ce	16-Apr-2021	Gleb Smirnoff <glebius@FreeBSD.org>	tcp_input: remove comments and assertions about tcpbinfo locking They aren't valid since d40c0d47cd2.
# 9e644c23	18-Apr-2021	Michael Tuexen <tuexen@FreeBSD.org>	tcp: add support for TCP over UDP Adding support for TCP over UDP allows communication with TCP stacks which can be implemented in userspace without requiring special priviledges or specific support by the OS. This is joint work with rrs. Reviewed by: rrs Sponsored by: Netflix, Inc. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29469
# d1de2b05	17-Apr-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Rename rfc6675_pipe to sack.revised, and enable by default As full support of RFC6675 is in place, deprecating net.inet.tcp.rfc6675_pipe and enabling by default net.inet.tcp.sack.revised. Reviewed By: #transport, kbowling, rrs Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28702
# 8d5719aa	18-Mar-2021	Gleb Smirnoff <glebius@FreeBSD.org>	syncache: simplify syncache_add() KPI to return struct socket pointer directly, not overwriting the listen socket pointer argument. Not a functional change.
# 08d9c920	18-Mar-2021	Gleb Smirnoff <glebius@FreeBSD.org>	tcp_input/syncache: acquire only read lock on PCB for SYN,!ACK packets When packet is a SYN packet, we don't need to modify any existing PCB. Normally SYN arrives on a listening socket, we either create a syncache entry or generate syncookie, but we don't modify anything with the listening socket or associated PCB. Thus create a new PCB lookup mode - rlock if listening. This removes the primary contention point under SYN flood - the listening socket PCB. Sidenote: when SYN arrives on a synchronized connection, we still don't need write access to PCB to send a challenge ACK or just to drop. There is only one exclusion - tcptw recycling. However, existing entanglement of tcp_input + stacks doesn't allow to make this change small. Consider this patch as first approach to the problem. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D29576
# 90cca08e	08-Apr-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Prepare PRR to work with NewReno LossRecovery Add proper PRR vnet declarations for consistency. Also add pointer to tcpopt struct to tcp_do_prr_ack, in preparation for it to deal with non-SACK window reduction (after loss). No functional change. MFC after: 2 weeks Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29440
# b9f803b7	25-Mar-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Use PRR for ECN congestion recovery MFC after: 2 weeks Reviewed By: #transport, rrs Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28972
# eb3a59a8	25-Mar-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Refactor PRR code No functional change intended. MFC after: 2 weeks Reviewed By: #transport, rrs Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29411
# 0533fab8	25-Mar-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Perform simple fast retransmit when SACK Blocks are missing on SACK session MFC after: 2 weeks Reviewed By: #transport, rrs Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28634
# 40f41ece	22-Mar-2021	Michael Tuexen <tuexen@FreeBSD.org>	tcp: improve handling of SYN segments in SYN-SENT state Ensure that the stack does not generate a DSACK block for user data received on a SYN segment in SYN-SENT state. Reviewed by: rscheff MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D29376 Sponsored by: Netflix, Inc.
# e5313869	05-Mar-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Add prr_out in preparation for PRR/nonSACK and LRD Reviewed By: #transport, kbowling MFC after: 3 days Sponsored By: Netapp, Inc. Differential Revision: https://reviews.freebsd.org/D29058
# 4a8f3aad	05-Mar-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: remove incorrect reset of SACK variable in PRR Reviewed By: #transport, rrs, tuexen PR: 253848 MFC after: 3 days Sponsored By: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29083
# bb4a7d94	04-Mar-2021	Kristof Provost <kp@FreeBSD.org>	net: Introduce IPV6_DSCP(), IPV6_ECN() and IPV6_TRAFFIC_CLASS() macros Introduce convenience macros to retrieve the DSCP, ECN or traffic class bits from an IPv6 header. Use them where appropriate. Reviewed by: ae (previous version), rscheff, tuexen, rgrimes MFC after: 2 weeks Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D29056
# 0b0f8b35	01-Mar-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	calculate prr_out correctly when pipe < ssthresh Reviewed By: #transport, tuexen MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28998
# e9071000	28-Feb-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	Improve PRR initial transmission timing Reviewed By: tuexen, #transport MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28953
# 9e83a6a5	26-Feb-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	Include new data sent in PRR calculation Reviewed By: #transport, kbowling MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28941
# 2593f858	25-Feb-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	A TCP server has to take into consideration, if TCP_NOOPT is preventing the negotiation of TCP features. This affects most TCP options but adherance to RFC7323 with the timestamp option will prevent a session from getting established. PR: 253576 Reviewed By: tuexen, #transport MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28652
# 31d7a27c	25-Feb-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	PRR: Avoid accounting left-edge twice in partial ACK. Reviewed By: #transport, kbowling MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28819
# 48396dc7	25-Feb-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	Address two incorrect calculations and enhance readability of PRR code - address second instance of cwnd potentially becoming zero - fix sublte bug due to implicit int to uint typecase in max() - fix bug due to typo in hand-coded CEILING() function by using howmany() macro - use int instead of long, and add a missing long typecast - replace if conditionals with easier to read imax/imin (as in pseudocode) Reviewed By: #transport, kbowling MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28813
# a8e431e1	20-Feb-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	PRR: use accurate rfc6675_pipe when enabled Reviewed By: #transport, tuexen MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28816
# 853fd7a2	19-Feb-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	Ensure cwnd doesn't shrink to zero with PRR Under some circumstances, PRR may end up with a fully collapsed cwnd when finalizing the loss recovery. Reviewed By: #transport, kbowling Reported by: Liang Tian MFC after: 1 week Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28780
# 3c40e1d5	15-Feb-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	update the SACK loss recovery to RFC6675, with the following new features: - improved pipe calculation which does not degrade under heavy loss - engaging in Loss Recovery earlier under adverse conditions - Rescue Retransmission in case some of the trailing packets of a request got lost All above changes are toggled with the sysctl "rfc6675_pipe" (disabled by default). Reviewers: #transport, tuexen, lstewart, slavash, jtl, hselasky, kib, rgrimes, chengc_netapp.com, thj, #manpages, kbowling, #netapp, rscheff Reviewed By: #transport Subscribers: imp, melifaro MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D18985
# 8268d82c	15-Feb-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	Remove per-packet ifa refcounting from IPv6 fast path. Currently ip6_input() calls in6ifa_ifwithaddr() for every local packet, in order to check if the target ip belongs to the local ifa in proper state and increase its counters. in6ifa_ifwithaddr() references found ifa. With epoch changes, both `ip6_input()` and all other current callers of `in6ifa_ifwithaddr()` do not need this reference anymore, as epoch provides stability guarantee. Given that, update `in6ifa_ifwithaddr()` to allow it to return ifa without referencing it, while preserving option for getting referenced ifa if so desired. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28648
# 6a376af0	26-Jan-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	TCP PRR: Patch div/0 in tcp_prr_partialack With clearing of recover_fs in bc7ee8e5bc555, div/0 was observed while processing partial_acks. Suspect that rewind of an erraneous RTO may be causing this - with the above change, recover_fs would no longer retained at the last calculated value, and reset. But CC_RTO_ERR can reenable IN_RECOVERY(), without setting this again. Adding a safety net prior to the division in that function, which I missed in D28114.
# 84761f3d	26-Jan-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	Adjust line length in tcp_prr_partialack Summary: Wrap lines before column 80 in new prr code checked in recently. No functional changes. Reviewers: tuexen, rrs, jtl, mm, kbowling, #transport Reviewed By: tuexen, mm, #transport Subscribers: imp, melifaro Differential Revision: https://reviews.freebsd.org/D28329
# bc7ee8e5	19-Jan-2021	Richard Scheffenegger <srichard@netapp.com>	Address panic with PRR due to missed initialization of recover_fs Summary: When using the base stack in conjunction with RACK, it appears that infrequently, ++tp->t_dupacks is instantly larger than tcprexmtthresh. This leaves the recover flightsize (sackhint.recover_fs) uninitialized, leading to a div/0 panic. Address this by properly initializing the variable just prior to first use, if it is not properly initialized. In order to prevent stale information from a prior recovery to negatively impact the PRR calculations in this event, also clear recover_fs once loss recovery is finished. Finally, improve the readability of the initialization of recover_fs when t_dupacks == tcprexmtthresh by adjusting the indentation and using the max(1, snd_nxt - snd_una) macro. Reviewers: rrs, kbowling, tuexen, jtl, #transport, gnn!, jmg, manu, #manpages Reviewed By: rrs, kbowling, #transport Subscribers: bdrewery, andrew, rpokala, ae, emaste, bz, bcran, #linuxkpi, imp, melifaro Differential Revision: https://reviews.freebsd.org/D28114
# d2b3cedd	13-Jan-2021	Michael Tuexen <tuexen@FreeBSD.org>	tcp: add sysctl to tolerate TCP segments missing timestamps When timestamp support has been negotiated, TCP segements received without a timestamp should be discarded. However, there are broken TCP implementations (for example, stacks used by Omniswitch 63xx and 64xx models), which send TCP segments without timestamps although they negotiated timestamp support. This patch adds a sysctl variable which tolerates such TCP segments and allows to interoperate with broken stacks. Reviewed by: jtl@, rscheff@ Differential Revision: https://reviews.freebsd.org/D28142 Sponsored by: Netflix, Inc. PR: 252449 MFC after: 1 week
# cc3c3485	13-Jan-2021	Michael Tuexen <tuexen@FreeBSD.org>	tcp: fix handling of TCP RST segments missing timestamps A TCP RST segment should be processed even it is missing TCP timestamps. Reported by: dmgk@, kevans@ Reviewed by: rscheff@, dmgk@ Sponsored by: Netflix, Inc. MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D28143
# 0e1d7c25	04-Dec-2020	Richard Scheffenegger <rscheff@FreeBSD.org>	Add TCP feature Proportional Rate Reduction (PRR) - RFC6937 PRR improves loss recovery and avoids RTOs in a wide range of scenarios (ACK thinning) over regular SACK loss recovery. PRR is disabled by default, enable by net.inet.tcp.do_prr = 1. Performance may be impeded by token bucket rate policers at the bottleneck, where net.inet.tcp.do_prr_conservate = 1 should be enabled in addition. Submitted by: Aris Angelogiannopoulos Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D18892
# 75fcd27a	23-Nov-2020	Michael Tuexen <tuexen@FreeBSD.org>	Fix two occurences of a typo in a comment introduced in r367530. Reported by: lstewart@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27148
# 283c76c7	09-Nov-2020	Michael Tuexen <tuexen@FreeBSD.org>	RFC 7323 specifies that: * TCP segments without timestamps should be dropped when support for the timestamp option has been negotiated. * TCP segments with timestamps should be processed normally if support for the timestamp option has not been negotiated. This patch enforces the above. PR: 250499 Reviewed by: gnn, rrs MFC after: 1 week Sponsored by: Netflix, Inc Differential Revision: https://reviews.freebsd.org/D27148
# 4d0770f1	08-Nov-2020	Richard Scheffenegger <rscheff@FreeBSD.org>	Prevent premature SACK block transmission during loss recovery Under specific conditions, a window update can be sent with outdated SACK information. Some clients react to this by subsequently delaying loss recovery, making TCP perform very poorly. Reported by: chengc_netapp.com Reviewed by: rrs, jtl MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D24237
# 39a12f01	24-Oct-2020	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: move cwnd and ssthresh updates into cc modules This will pave the way of setting ssthresh differently in TCP CUBIC, according to RFC8312 section 4.7. No functional change, only code movement. Submitted by: chengc_netapp.com Reviewed by: rrs, tuexen, rscheff MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D26807
# 662c1305	01-Sep-2020	Mateusz Guzik <mjg@FreeBSD.org>	net: clean up empty lines in .c and .h files
# 1951fa79	25-Aug-2020	Michael Tuexen <tuexen@FreeBSD.org>	RFC 3465 defines a limit L used in TCP slow start for limiting the number of acked bytes as described in Section 2.2 of that document. This patch ensures that this limit is not also applied in congestion avoidance. Applying this limit also in congestion avoidance can result in using less bandwidth than allowed. Reported by: l.tian.email@gmail.com Reviewed by: rrs, rscheff MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D26120
# f359d6eb	13-Aug-2020	Richard Scheffenegger <rscheff@FreeBSD.org>	Improve SACK support code for RFC6675 and PRR Adding proper accounting of sacked_bytes and (per-ACK) delivered data to the SACK scoreboard. This will allow more aspects of RFC6675 to be implemented as well as Proportional Rate Reduction (RFC6937). Prior to this change, the pipe calculation controlled with net.inet.tcp.rfc6675_pipe was also susceptible to incorrect results when more than 3 (or 4) holes in the sequence space were present, which can no longer all fit into a single ACK's SACK option. Reviewed by: kbowling, rgrimes (mentor) Approved by: rgrimes (mentor, blanket) MFC after: 3 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D18624
# e854dd38	08-Jun-2020	Randall Stewart <rrs@FreeBSD.org>	An important statistic in determining if a server process (or client) is being delayed is to know the time to first byte in and time to first byte out. Currently we have no way to know these all we have is t_starttime. That (t_starttime) tells us what time the 3 way handshake completed. We don't know when the first request came in or how quickly we responded. Nor from a client perspective do we know how long from when we sent out the first byte before the server responded. This small change adds the ability to track the TTFB's. This will show up in BB logging which then can be pulled for later analysis. Note that currently the tracking is via the ticks variable of all three variables. This provides a very rough estimate (hz=1000 its 1ms). A follow-on set of work will be to change all three of these values into something with a much finer resolution (either microseconds or nanoseconds), though we may want to make the resolution configurable so that on lower powered machines we could still use the much cheaper ticks variable. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D24902
# f1ea4e41	03-Jun-2020	Randall Stewart <rrs@FreeBSD.org>	This fixes a couple of skyzaller crashes. Most of them have to do with TFO. Even the default stack had one of the issues: 1) We need to make sure for rack that we don't advance snd_nxt beyond iss when we are not doing fast open. We otherwise can get a bunch of SYN's sent out incorrectly with the seq number advancing. 2) When we complete the 3-way handshake we should not ever append to reassembly if the tlen is 0, if TFO is enabled prior to this fix we could still call the reasemmbly. Note this effects all three stacks. 3) Rack like its cousin BBR should track if a SYN is on a send map entry. 4) Both bbr and rack need to only consider len incremented on a SYN if the starting seq is iss, otherwise we don't increment len which may mean we return without adding a sendmap entry. This work was done in collaberation with Michael Tuexen, thanks for all the testing! Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D25000
# af2fb894	21-May-2020	Richard Scheffenegger <rscheff@FreeBSD.org>	With RFC3168 ECN, CWR SHOULD only be sent with new data Overly conservative data receivers may ignore the CWR flag on other packets, and keep ECE latched. This can result in continous reduction of the congestion window, and very poor performance when ECN is enabled. Reviewed by: rgrimes (mentor), rrs Approved by: rgrimes (mentor), tuexen (mentor) MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D23364
# 8e051165	21-May-2020	Richard Scheffenegger <rscheff@FreeBSD.org>	Retain only mutually supported TCP options after simultaneous SYN When receiving a parallel SYN in SYN-SENT state, remove all the options only we supported locally before sending the SYN,ACK. This addresses a consistency issue on parallel opens. Also, on such a parallel open, the stack could be coaxed into running with timestamps enabled, even if administratively disabled. Reviewed by: tuexen (mentor) Approved by: tuexen (mentor) MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D23371
# 6e16d877	21-May-2020	Richard Scheffenegger <rscheff@FreeBSD.org>	Handle ECN handshake in simultaneous open While testing simultaneous open TCP with ECN, found that negotiation fails to arrive at the expected final state. Reviewed by: tuexen (mentor) Approved by: tuexen (mentor), rgrimes (mentor) MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D23373
# b2ade6b1	29-Apr-2020	Richard Scheffenegger <rscheff@FreeBSD.org>	Correctly set up the initial TCP congestion window in all cases, by not including the SYN bit sequence space in cwnd related calculations. Snd_und is adjusted explicitly in all cases, outside the cwnd update, instead. This fixes an off-by-one conformance issue with regular TCP sessions not using Appropriate Byte Counting (RFC3465), sending one more packet during the initial window than expected. PR: 235256 Reviewed by: tuexen (mentor), rgrimes (mentor) Approved by: tuexen (mentor), rgrimes (mentor) MFC after: 3 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D19000
# bb410f9f	21-Apr-2020	Richard Scheffenegger <rscheff@FreeBSD.org>	revert rS360143 - Correctly set up initial cwnd due to syzkaller panics found Reported by: tuexen Approved by: tuexen (mentor) Sponsored by: NetApp, Inc.
# 73b76966	21-Apr-2020	Richard Scheffenegger <rscheff@FreeBSD.org>	Correctly set up the initial TCP congestion window in all cases, by adjust snd_una right after the connection initialization, to include the one byte in sequence space occupied by the SYN bit. This does not change the regular ACK processing, while making the BYTES_THIS_ACK macro to work properly. PR: 235256 Reviewed by: tuexen (mentor), rgrimes (mentor) Approved by: tuexen (mentor), rgrimes (mentor) MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D19000
# 7ca6e296	12-Mar-2020	Michael Tuexen <tuexen@FreeBSD.org>	Use KMOD_TCPSTAT_INC instead of TCPSTAT_INC for RACK and BBR, since these are kernel modules. Also add a KMOD_TCPSTAT_ADD and use that instead of TCPSTAT_ADD. Reviewed by: jtl@, rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D23904
# 7029da5c	26-Feb-2020	Pawel Biernacki <kaktus@FreeBSD.org>	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718
# a3574665	13-Feb-2020	Michael Tuexen <tuexen@FreeBSD.org>	sack_newdata and snd_recover hold the same value. Therefore, use only a single instance: use snd_recover also where sack_newdata was used. Submitted by: Richard Scheffenegger Differential Revision: https://reviews.freebsd.org/D18811
# 481be5de	12-Feb-2020	Randall Stewart <rrs@FreeBSD.org>	White space cleanup -- remove trailing tab's or spaces from any line. Sponsored by: Netflix Inc.
# 9cc711c9	25-Jan-2020	Michael Tuexen <tuexen@FreeBSD.org>	Sending CWR after an RTO is according to RFC 3168 generally required and not only for the DCTCP congestion control. Submitted by: Richard Scheffenegger Reviewed by: rgrimes, tuexen@, Cheng Cui MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23119
# a2d59694	25-Jan-2020	Michael Tuexen <tuexen@FreeBSD.org>	As a TCP client only enable ECN when the corresponding sysctl variable indicates that ECN should be negotiated for the client side. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, tuexen@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23228
# ee97681e	24-Jan-2020	Michael Tuexen <tuexen@FreeBSD.org>	Don't delay the ACK for a TCP segment with the CWR flag set. This allows the data sender to increase the CWND faster. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, tuexen@, Cheng Cui MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D22670
# 8f63a52b	24-Jan-2020	Michael Tuexen <tuexen@FreeBSD.org>	The server side of TCP fast open relies on the delayed ACK timer to allow including user data in the SYN-ACK. When DSACK support was added in r347382, an immediate ACK was sent even for the received SYN with user data. This patch fixes that and allows again to send user data with the SYN-ACK. Reported by: Jeremy Harris Reviewed by: Richard Scheffenegger, rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D23212
# 334fc582	08-Jan-2020	Bjoern A. Zeeb <bz@FreeBSD.org>	vnet: virtualise more network stack sysctls. Virtualise tcp_always_keepalive, TCP and UDP log_in_vain. All three are set in the netoptions startup script, which we would love to run for VNETs as well [1]. While virtualising the log_in_vain sysctls seems pointles at first for as long as the kernel message buffer is not virtualised, it at least allows an administrator to debug the base system or an individual jail if needed without turning the logging on for all jails running on a system. PR: 243193 [1] MFC after: 2 weeks
# 4ad24737	06-Jan-2020	Randall Stewart <rrs@FreeBSD.org>	This catches rack up in the recent changes to ECN and also commonizes the functions that both the freebsd and rack stack uses. Sponsored by:Netflix Inc Differential Revision: https://reviews.freebsd.org/D23052
# e11c9783	31-Dec-2019	Michael Tuexen <tuexen@FreeBSD.org>	Fix delayed ACK generation for DCTCP. Submitted by: Richard Scheffenegger Reviewed by: chengc@netapp.com, rgrimes@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22644
# 83a2839f	31-Dec-2019	Michael Tuexen <tuexen@FreeBSD.org>	Clear the flag indicating that the last received packet was marked CE also in the case where a packet not marked was received. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, tuexen@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D19143
# adc56f5a	02-Dec-2019	Edward Tomasz Napierala <trasz@FreeBSD.org>	Make use of the stats(3) framework in the TCP stack. This makes it possible to retrieve per-connection statistical information such as the receive window size, RTT, or goodput, using a newly added TCP_STATS getsockopt(3) option, and extract them using the stats_voistat_fetch(3) API. See the net/tcprtt port for an example consumer of this API. Compared to the existing TCP_INFO system, the main differences are that this mechanism is easy to extend without breaking ABI, and provides statistical information instead of raw "snapshots" of values at a given point in time. stats(3) is more generic and can be used in both userland and the kernel. Reviewed by: thj Tested by: thj Obtained from: Netflix Relnotes: yes Sponsored by: Klara Inc, Netflix Differential Revision: https://reviews.freebsd.org/D20655
# 3cf38784	01-Dec-2019	Michael Tuexen <tuexen@FreeBSD.org>	Move all ECN related flags from the flags to the flags2 field. This allows adding more ECN related flags in the future. No functional change intended. Submitted by: Richard Scheffenegger Reviewed by: rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22497
# b72e56e7	01-Dec-2019	Michael Tuexen <tuexen@FreeBSD.org>	This is an initial step in implementing the new congestion window validation as specified in RFC 7661. Submitted by: Richard Scheffenegger Reviewed by: rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D21798
# fa49a964	01-Dec-2019	Michael Tuexen <tuexen@FreeBSD.org>	In order for the TCP Handshake to support ECN++, and further ECN-related improvements, the ECN bits need to be exposed to the TCP SYNcache. This change is a minimal modification to the function headers, without any functional change intended. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22436
# a4adf6cc	30-Nov-2019	Bjoern A. Zeeb <bz@FreeBSD.org>	Fix m_pullup() problem after removing PULLDOWN_TESTs and KAME EXT_*macros. r354748-354750 replaced the KAME macros with m_pulldown() calls. Contrary to the rest of the network stack m_len checks before m_pulldown() were not put in placed (see r354748). Put these m_len checks in place for now (to go along with the style of the network stack since the initial commits). These are not put in for performance but to avoid an error scenario (even though it also will help performance at the moment as it avoid allocating an extra mbuf; not because of the unconditional function call). The observed error case went like this: (1) an mbuf with M_EXT arrives and we call m_pullup() unconditionally on it. (2) m_pullup() will call m_get() unless the requested length is larger than MHLEN (in which case it'll m_freem() the perfectly fine mbuf) and migrate the requested length of data and pkthdr into the new mbuf. (3) If m_get() succeeds, a further m_pullup() call going over MHLEN will fail. This was observed with failing auto-configuration as an RA packet of 200 bytes exceeded MHLEN and the m_pullup() called from nd6_ra_input() dropped the mbuf. (Re-)adding the m_len checks before m_pullup() calls avoids this problems with mbufs using external storage for now. MFC after: 3 weeks Sponsored by: Netflix
# 4e619b17	15-Nov-2019	Bjoern A. Zeeb <bz@FreeBSD.org>	IP6_EXTHDR_CHECK(): remove the last instances While r354748 removed almost all IP6_EXTHDR_CHECK() calls, these are not part of the PULLDOWN_TESTS. Equally convert these IP6_EXTHDR_CHECK()s here to m_pullup() and remove the extra check and m_pullup() in tcp_input() under isipv6 given tcp6_input() has done exactly that pullup already. MFC after: 8 weeks Sponsored by: Netflix
# a8fe77d8	12-Nov-2019	Bjoern A. Zeeb <bz@FreeBSD.org>	netinet: update mp to pass the proper value back In ip6_[direct_]input() we are looping over the extension headers to deal with the next header. We pass a pointer to an mbuf pointer to the handling functions. In certain cases the mbuf can be updated there and we need to pass the new one back. That missing in dest6_input() and route6_input(). In tcp6_input() we should also update it before we call tcp_input(). In addition to that mark the mbuf NULL all the times when we return that we are done with handling the packet and no next header should be checked (IPPROTO_DONE). This will eventually allow us to assert proper behaviour and catch the above kind of errors more easily, expecting *mp to always be set. This change is extracted from a larger patch and not an exhaustive change across the entire stack yet. PR: 240135 Reported by: prabhakar.lakhera gmail.com MFC after: 3 weeks Sponsored by: Netflix
# d40c0d47	07-Nov-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Now that all of the tcp_input() and all its branches are executed in the network epoch, we can greatly simplify synchronization. Remove all unneccesary epoch enters hidden under INP_INFO_RLOCK macro. Remove some unneccesary assertions and convert necessary ones into the NET_EPOCH_ASSERT macro.
# 503f4e47	07-Nov-2019	Bjoern A. Zeeb <bz@FreeBSD.org>	netinet*: variable cleanup In preparation for another change factor out various variable cleanups. These mainly include: (1) do not assign values to variables during declaration: this makes the code more readable and does allow for better grouping of variable declarations, (2) do not assign values to variables before need; e.g., if a variable is only used in the 2nd half of a function and we have multiple return paths before that, then do not set it before it is needed, and (3) try to avoid assigning the same value multiple times. MFC after: 3 weeks Sponsored by: Netflix
# 6d261981	09-Sep-2019	Michael Tuexen <tuexen@FreeBSD.org>	Only update SACK/DSACK lists when a non-empty segment was received. This fixes hitting a KASSERT with a valid packet exchange. Reviewed by: rrs@, Richard Scheffenegger MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21567
# ecc5b1d1	03-Sep-2019	Michael Tuexen <tuexen@FreeBSD.org>	Fix the SACK block generation in the base TCP stack by bringing it in sync with the RACK stack. Reviewed by: rrs@ MFC after: 5 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21513
# fe5dee73	02-Sep-2019	Michael Tuexen <tuexen@FreeBSD.org>	This patch improves the DSACK handling to conform with RFC 2883. The lowest SACK block is used when multiple Blocks would be elegible as DSACK blocks ACK blocks get reordered - while maintaining the ordering of SACK blocks not relevant in the DSACK context is maintained. Reviewed by: rrs@, tuexen@ Obtained from: Richard Scheffenegger MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D21038
# c4556b2f	12-Aug-2019	Andrey V. Elsukov <ae@FreeBSD.org>	Save ip_ttl value and restore it after checksum calculation. Since ipvoly is used for checksum calculation, part of original IP header is zeroed. This part includes ip_ttl field, that can be used later in IP_MINTTL socket option handling. PR: 239799 MFC after: 1 week
# b5a154d8	09-May-2019	Michael Tuexen <tuexen@FreeBSD.org>	Don't use C++ style comments. These where introduced in r347382. Reported by: ngie@
# 5acfd95c	09-May-2019	Michael Tuexen <tuexen@FreeBSD.org>	Receiver side DSACK implemenation. This adds initial support for RFC 2883. Submitted by: Richard Scheffenegger Reviewed by: rrs@ Differential Revision: https://reviews.freebsd.org/D19334
# 560c0586	21-Feb-2019	Michael Tuexen <tuexen@FreeBSD.org>	The receive buffer autoscaling for TCP is based on a linear growth, which is acceptable in the congestion avoidance phase, but not during slow start. The MTU is is also not taken into account. Use a method instead, which is based on exponential growth working also in slow start and being independent from the MTU. This is joint work with rrs@. Reviewed by: rrs@, Richard Scheffenegger Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18375
# 116ef4d6	31-Jan-2019	Michael Tuexen <tuexen@FreeBSD.org>	When handling SYN-ACK segments in the SYN-RCVD state, set tp->snd_wnd consistently. This inconsistency was observed when working on the bug reported in PR 235256, although it does not fix the reported issue. The fix for the PR will be a separate commit. PR: 235256 Reviewed by: rrs@, Richard Scheffenegger MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D19033
# bf7fcdb1	27-Jan-2019	Michael Tuexen <tuexen@FreeBSD.org>	Fix the detection of ECN-setup SYN-ACK packets. RFC 3168 defines an ECN-setup SYN-ACK packet as on with the ECE flags set and the CWR flags not set. The code was only checking if ECE flag is set. This patch adds the check to verify that the CWR flags is not set. Submitted by: Richard Scheffenegger Reviewed by: tuexen@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D18996
# 7dc90a1d	25-Jan-2019	Michael Tuexen <tuexen@FreeBSD.org>	Fix a bug in the restart window computation of TCP New Reno When implementing support for IW10, an update in the computation of the restart window used after an idle phase was missed. To minimize code duplication, implement the logic in tcp_compute_initwnd() and call it. This fixes a bug in NewReno, which was not aware of IW10. Submitted by: Richard Scheffenegger Reviewed by: tuexen@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D18940
# 93899d10	18-Oct-2018	Michael Tuexen <tuexen@FreeBSD.org>	The handling of RST segments in the SYN-RCVD state exists in the code paths. Both are not consistent and the one on the syn cache code does not conform to the relevant specifications (Page 69 of RFC 793 and Section 4.2 of RFC 5961). This patch fixes this: * The sequence numbers checks are fixed as specified on page Page 69 RFC 793. * The sysctl variable net.inet.tcp.insecure_rst is now honoured and the behaviour as specified in Section 4.2 of RFC 5961. Approved by: re (gjb@) Reviewed by: bz@, glebius@, rrs@, Differential Revision: https://reviews.freebsd.org/D17595 Sponsored by: Netflix, Inc.
# 384a5c3c	01-Oct-2018	Andrey V. Elsukov <ae@FreeBSD.org>	Add INP_INFO_WUNLOCK_ASSERT() macro and use it instead of INP_INFO_UNLOCK_ASSERT() in TCP-related code. For encapsulated traffic it is possible, that the code is running in net_epoch_preempt section, and INP_INFO_UNLOCK_ASSERT() is very strict assertion for such case. PR: 231428 Reviewed by: mmacy, tuexen Approved by: re (kib) Differential Revision: https://reviews.freebsd.org/D17335
# 5dff1c38	21-Aug-2018	Michael Tuexen <tuexen@FreeBSD.org>	Enabling the IPPROTO_IPV6 level socket option IPV6_USE_MIN_MTU on a TCP socket resulted in sending fragmented IPV6 packets. This is fixes by reducing the MSS to the appropriate value. In addtion, if the socket option is set before the handshake happens, announce this MSS to the peer. This is not stricly required, but done since TCP is conservative. PR: 173444 Reviewed by: bz@, rrs@ MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16796
# c28440db	19-Aug-2018	Randall Stewart <rrs@FreeBSD.org>	This change represents a substantial restructure of the way we reassembly inbound tcp segments. The old algorithm just blindly dropped in segments without coalescing. This meant that every segment could take up greater and greater room on the linked list of segments. This of course is now subject to a tighter limit (100) of segments which in a high BDP situation will cause us to be a lot more in-efficent as we drop segments beyond 100 entries that we receive. What this restructure does is cause the reassembly buffer to coalesce segments putting an emphasis on the two common cases (which avoid walking the list of segments) i.e. where we add to the back of the queue of segments and where we add to the front. We also have the reassembly buffer supporting a couple of debug options (black box logging as well as counters for code coverage). These are compiled out by default but can be added by uncommenting the defines. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D16626
# 8db239dc	30-Jul-2018	Michael Tuexen <tuexen@FreeBSD.org>	Fix some TCP fast open issues. The following issues are fixed: * Whenever a TCP server with TCP fast open enabled, calls accept(), recv(), send(), and close() before the TCP-ACK segment has been received, the TCP connection is just dropped and the reception of the TCP-ACK segment triggers the sending of a TCP-RST segment. * Whenever a TCP server with TCP fast open enabled, calls accept(), recv(), send(), send(), and close() before the TCP-ACK segment has been received, the first byte provided in the second send call is not transferred. * Whenever a TCP client with TCP fast open enabled calls sendto() followed by close() the TCP connection is just dropped. Reviewed by: jtl@, kbowling@, rrs@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16485
# 6138da62	30-Jul-2018	Michael Tuexen <tuexen@FreeBSD.org>	Add missing send/recv dtrace probes for TCP. These missing probe are mostly in the syncache and timewait code. Reviewed by: markj@, rrs@ MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16369
# a026a53a	10-Jul-2018	Michael Tuexen <tuexen@FreeBSD.org>	Use appropriate MSS value when populating the TCP FO client cookie cache When a client receives a SYN-ACK segment with a TFP fast open cookie, but without an MSS option, an MSS value from uninitialised stack memory is used. This patch ensures that in case no MSS option is included in the SYN-ACK, the appropriate value as given in RFC 7413 is used. Reviewed by: kbowling@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16175
# 6573d758	03-Jul-2018	Matt Macy <mmacy@FreeBSD.org>	epoch(9): allow preemptible epochs to compose - Add tracker argument to preemptible epochs - Inline epoch read path in kernel and tied modules - Change in_epoch to take an epoch as argument - Simplify tfb_tcp_do_segment to not take a ti_locked argument, there's no longer any benefit to dropping the pcbinfo lock and trying to do so just adds an error prone branchfest to these functions - Remove cases of same function recursion on the epoch as recursing is no longer free. - Remove the the TAILQ_ENTRY and epoch_section from struct thread as the tracker field is now stack or heap allocated as appropriate. Tested by: pho and Limelight Networks Reviewed by: kbowling at llnw dot com Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16066
# 9e58ff6f	18-Jun-2018	Matt Macy <mmacy@FreeBSD.org>	convert inpcbinfo hash and info rwlocks to epoch + mutex - Convert inpcbinfo info & hash locks to epoch for read and mutex for write - Garbage collect code that handled INP_INFO_TRY_RLOCK failures as INP_INFO_RLOCK which can no longer fail When running 64 netperfs sending minimal sized packets on a 2x8x2 reduces unhalted core cycles samples in rwlock rlock/runlock in udp_send from 51% to 3%. Overall packet throughput rate limited by CPU affinity and NIC driver design choices. On the receiver unhalted core cycles samples in in_pcblookup_hash went from 13% to to 1.6% Tested by LLNW and pho@ Reviewed by: jtl Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15686
# 10d20c84	07-May-2018	Matt Macy <mmacy@FreeBSD.org>	Fix spurious retransmit recovery on low latency networks TCP's smoothed RTT (SRTT) can be much larger than an actual observed RTT. This can be either because of hz restricting the calculable RTT to 10ms in VMs or 1ms using the default 1000hz or simply because SRTT recently incorporated a larger value. If an ACK arrives before the calculated badrxtwin (now + SRTT): tp->t_badrxtwin = ticks + (tp->t_srtt >> (TCP_RTT_SHIFT + 1)); We'll erroneously reset snd_una to snd_max. If multiple segments were dropped and this happens repeatedly the transmit rate will be limited to 1MSS per RTO until we've retransmitted all drops. Reported by: rstone Reviewed by: hiren, transport Approved by: sbruno MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D8556
# 2529f56e	22-Mar-2018	Jonathan T. Looney <jtl@FreeBSD.org>	Add the "TCP Blackbox Recorder" which we discussed at the developer summits at BSDCan and BSDCam in 2017. The TCP Blackbox Recorder allows you to capture events on a TCP connection in a ring buffer. It stores metadata with the event. It optionally stores the TCP header associated with an event (if the event is associated with a packet) and also optionally stores information on the sockets. It supports setting a log ID on a TCP connection and using this to correlate multiple connections that share a common log ID. You can log connections in different modes. If you are doing a coordinated test with a particular connection, you may tell the system to put it in mode 4 (continuous dump). Or, if you just want to monitor for errors, you can put it in mode 1 (ring buffer) and dump all the ring buffers associated with the connection ID when we receive an error signal for that connection ID. You can set a default mode that will be applied to a particular ratio of incoming connections. You can also manually set a mode using a socket option. This commit includes only basic probes. rrs@ has added quite an abundance of probes in his TCP development work. He plans to commit those soon. There are user-space programs which we plan to commit as ports. These read the data from the log device and output pcapng files, and then let you analyze the data (and metadata) in the pcapng files. Reviewed by: gnn (previous version) Obtained from: Netflix, Inc. Relnotes: yes Differential Revision: https://reviews.freebsd.org/D11085
# 18a75309	25-Feb-2018	Patrick Kelsey <pkelsey@FreeBSD.org>	Greatly reduce the number of #ifdefs supporting the TCP_RFC7413 kernel option. The conditional compilation support is now centralized in tcp_fastopen.h and tcp_var.h. This doesn't provide the minimum theoretical code/data footprint when TCP_RFC7413 is disabled, but nearly all the TFO code should wind up being removed by the optimizer, the additional footprint in the syncache entries is a single pointer, and the additional overhead in the tcpcb is at the end of the structure. This enables the TCP_RFC7413 kernel option by default in amd64 and arm64 GENERIC. Reviewed by: hiren MFC after: 1 month Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14048
# c560df6f	25-Feb-2018	Patrick Kelsey <pkelsey@FreeBSD.org>	This is an implementation of the client side of TCP Fast Open (TFO) [RFC7413]. It also includes a pre-shared key mode of operation in which the server requires the client to be in possession of a shared secret in order to successfully open TFO connections with that server. The names of some existing fastopen sysctls have changed (e.g., net.inet.tcp.fastopen.enabled -> net.inet.tcp.fastopen.server_enable). Reviewed by: tuexen MFC after: 1 month Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14047
# b2fe54bc	10-Feb-2018	Andrey V. Elsukov <ae@FreeBSD.org>	Reinitialize IP header length after checksum calculation. It is used later by TCP-MD5 code. This fixes the problem with broken TCP-MD5 over IPv4 when NIC has disabled TCP checksum offloading. PR: 223835 MFC after: 1 week
# 51369649	20-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.
# 3bdf4c42	11-Oct-2017	Gleb Smirnoff <glebius@FreeBSD.org>	Declare more TCP globals in tcp_var.h, so that alternative TCP stacks can use them. Gather all TCP tunables in tcp_var.h in one place and alphabetically sort them, to ease maintainance of the list. Don't copy and paste declarations in tcp_stacks/fastpath.c.
# 4da9052a	23-Aug-2017	Michael Tuexen <tuexen@FreeBSD.org>	Avoid TCP log messages which are false positives. The check for timestamps are too early to handle SYN-ACK correctly. So move it down after the corresponing processing has been done. PR: 216832 Obtained from: antonfb@hesiod.org MFC after: 1 week
# 43053c12	25-Jul-2017	Sean Bruno <sbruno@FreeBSD.org>	Revert r307901 - Inform CC modules about loss events. This was discussed between various transport@ members and it was requested to be reverted and discussed. Submitted by: Kevin Bowling <kevin.bowling@kev009.com> Reported by: lawrence Reviewed by: hiren Sponsored by: Limelight Networks
# 5d53981a	25-Jul-2017	Sean Bruno <sbruno@FreeBSD.org>	Revert r308180 - Set slow start threshold more accurrately on loss ... This was discussed between various transport@ members and it was requested to be reverted and discussed. Submitted by: kevin Reported by: lawerence Reviewed by: hiren
# 98732609	01-Jun-2017	Michael Tuexen <tuexen@FreeBSD.org>	Improve comments to describe what the code does. Reported by: jtl Sponsored by: Netflix, Inc.
# ebfd7534	26-Apr-2017	Michael Tuexen <tuexen@FreeBSD.org>	When a SYN-ACK is received in SYN-SENT state, RFC 793 requires the validation of SEG.ACK as the first step. If the ACK is not acceptable, a RST segment should be sent and the segment should be dropped. Up to now, the segment was partially processed. This patch moves the check for the SEG.ACK validation up to the front as required. Reviewed by: hiren, gnn MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D10424
# 013f4df6	12-Apr-2017	Michael Tuexen <tuexen@FreeBSD.org>	The sysctl variable net.inet.tcp.drop_synfin is not honored in all states, for example not in SYN-SENT. This patch adds code to check the sysctl variable in other states than LISTEN. Thanks to ae and gnn for providing comments. Reviewed by: gnn MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D9894
# e44c1887	10-Apr-2017	Steven Hartland <smh@FreeBSD.org>	Use estimated RTT for receive buffer auto resizing instead of timestamps Switched from using timestamps to RTT estimates when performing TCP receive buffer auto resizing, as not all hosts support / enable TCP timestamps. Disabled reset of receive buffer auto scaling when not in bulk receive mode, which gives an extra 20% performance increase. Also extracted auto resizing to a common method shared between standard and fastpath modules. With this AWS S3 downloads at ~17ms latency on a 1Gbps connection jump from ~3MB/s to ~100MB/s using the default settings. Reviewed by: lstewart, gnn MFC after: 2 weeks Relnotes: Yes Sponsored by: Multiplay Differential Revision: https://reviews.freebsd.org/D9668
# fbbd9655	28-Feb-2017	Warner Losh <imp@FreeBSD.org>	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96
# 5ede40dc	11-Feb-2017	Ryan Stone <rstone@FreeBSD.org>	Don't zero out srtt after excess retransmits If the TCP stack has retransmitted more than 1/4 of the total number of retransmits before a connection drop, it decides that its current RTT estimate is hopelessly out of date and decides to recalculate it from scratch starting with the next ACK. Unfortunately, it implements this by zeroing out the current RTT estimate. Drop this hack entirely, as it makes it significantly more difficult to debug connection issues. Instead check for excessive retransmits at the point where srtt is updated from an ACK being received. If we've exceeded 1/4 of the maximum retransmits, discard the previous srtt estimate and replace it with the latest rtt measurement. Differential Revision: https://reviews.freebsd.org/D9519 Reviewed by: gnn Sponsored by: Dell EMC Isilon
# fcf59617	06-Feb-2017	Andrey V. Elsukov <ae@FreeBSD.org>	Merge projects/ipsec into head/. Small summary ------------- o Almost all IPsec releated code was moved into sys/netipsec. o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel option IPSEC_SUPPORT added. It enables support for loading and unloading of ipsec.ko and tcpmd5.ko kernel modules. o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type support was removed. Added TCP/UDP checksum handling for inbound packets that were decapsulated by transport mode SAs. setkey(8) modified to show run-time NAT-T configuration of SA. o New network pseudo interface if_ipsec(4) added. For now it is build as part of ipsec.ko module (or with IPSEC kernel). It implements IPsec virtual tunnels to create route-based VPNs. o The network stack now invokes IPsec functions using special methods. The only one header file <netipsec/ipsec_support.h> should be included to declare all the needed things to work with IPsec. o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed. Now these protocols are handled directly via IPsec methods. o TCP_SIGNATURE support was reworked to be more close to RFC. o PF_KEY SADB was reworked: - now all security associations stored in the single SPI namespace, and all SAs MUST have unique SPI. - several hash tables added to speed up lookups in SADB. - SADB now uses rmlock to protect access, and concurrent threads can do SA lookups in the same time. - many PF_KEY message handlers were reworked to reflect changes in SADB. - SADB_UPDATE message was extended to support new PF_KEY headers: SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They can be used by IKE daemon to change SA addresses. o ipsecrequest and secpolicy structures were cardinally changed to avoid locking protection for ipsecrequest. Now we support only limited number (4) of bundled SAs, but they are supported for both INET and INET6. o INPCB security policy cache was introduced. Each PCB now caches used security policies to avoid SP lookup for each packet. o For inbound security policies added the mode, when the kernel does check for full history of applied IPsec transforms. o References counting rules for security policies and security associations were changed. The proper SA locking added into xform code. o xform code was also changed. Now it is possible to unregister xforms. tdb_xxx structures were changed and renamed to reflect changes in SADB/SPDB, and changed rules for locking and refcounting. Reviewed by: gnn, wblock Obtained from: Yandex LLC Relnotes: yes Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D9352
# 2b9c9984	03-Jan-2017	George V. Neville-Neil <gnn@FreeBSD.org>	Fix DTrace TCP tracepoints to not use mtod() as it is both unnecessary and dangerous. Those wanting data from an mbuf should use DTrace itself to get the data. PR: 203409 Reviewed by: hiren MFC after: 1 week Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D9035
# 4ddd5aad	02-Dec-2016	Michael Tuexen <tuexen@FreeBSD.org>	Fix the handling of TCP FIN-segments in the CLOSED state When a TCP segment with the FIN bit set was received in the CLOSED state, a TCP RST-ACK-segment is sent. When computing SEG.ACK for this, the FIN counts as one byte. This accounting was missing and is fixed by this patch. Reviewed by: hiren MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://svn.freebsd.org/base/head
# 35dfb8cb	19-Nov-2016	Michael Tuexen <tuexen@FreeBSD.org>	Ensure that TCP state changes to state-closing are reported via dtrace. This does not cover state changes from TIME-WAIT. Reviewed by: gnn MFC after: 3 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D8443
# 6779a1a1	17-Nov-2016	Michael Tuexen <tuexen@FreeBSD.org>	Notify the use via setting errno when a TCP RST segment is received either in the CLOSING or LAST-ACK state. Reviewed by: hiren MFC after: 3 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D8371
# e04310d5	01-Nov-2016	Hiren Panchasara <hiren@FreeBSD.org>	Set slow start threshold more accurately on loss to be flightsize/2 instead of cwnd/2 as recommended by RFC5681. (spotted by mmacy at nextbsd dot org) Restore pre-r307901 behavior of aligning ssthresh/cwnd on mss boundary. (spotted by slawa at zxy dot spb dot ru) Tested by: dim, Slawa <slawa at zxy dot spb dot ru> MFC after: 1 month X-MFC with: r307901 Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D8349
# 4e7f7553	24-Oct-2016	Hiren Panchasara <hiren@FreeBSD.org>	FreeBSD tcp stack used to inform respective congestion control module about the loss event but not use or obay the recommendations i.e. values set by it in some cases. Here is an attempt to solve that confusion by following relevant RFCs/drafts. Stack only sets congestion window/slow start threshold values when there is no CC module availalbe to take that action. All CC modules are inspected and updated when needed to take appropriate action on loss. tcp_stacks/fastpath module has been updated to adapt these changes. Note: Probably, the most significant change would be to not bring congestion window down to 1MSS on a loss signaled by 3-duplicate acks and letting respective CC decide that value. In collaboration with: Matt Macy <mmacy at nextbsd dot org> Discussed on: transport@ mailing list Reviewed by: jtl MFC after: 1 month Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D8225
# dd13b7d3	24-Oct-2016	Hiren Panchasara <hiren@FreeBSD.org>	Undo r307899. It needs a bit more work and proper commit log.
# 95d82360	24-Oct-2016	Hiren Panchasara <hiren@FreeBSD.org>	In Collaboration with: Matt Macy <mmacy at nextbsd dot com> Reviewed by: jtl Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D8225
# f5cf1e5f	18-Oct-2016	Julien Charbon <jch@FreeBSD.org>	Fix a double-free when an inp transitions to INP_TIMEWAIT state after having been dropped. This fixes enforces in_pcbdrop() logic in tcp_input(): "in_pcbdrop() is used by TCP to mark an inpcb as unused and avoid future packet delivery or event notification when a socket remains open but TCP has closed." PR: 203175 Reported by: Palle Girgensohn, Slawa Olhovchenkov Tested by: Slawa Olhovchenkov Reviewed by: Slawa Olhovchenkov Approved by: gnn, Slawa Olhovchenkov Differential Revision: https://reviews.freebsd.org/D8211 MFC after: 1 week Sponsored by: Verisign, inc
# 784ce8fa	17-Oct-2016	Hiren Panchasara <hiren@FreeBSD.org>	Make sure tcp_mss() has the same check as tcp_mss_update() to have t_maxseg set to at least 64. This is still just a coverup to avoid kernel panic and not an actual fix. PR: 213232 Reviewed by: glebius MFC after: 1 week Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D8272
# 09c305eb	14-Oct-2016	Patrick Kelsey <pkelsey@FreeBSD.org>	Fix cases where the TFO pending counter would leak references, and eventually, memory. Also renamed some tfo labels and added/reworked comments for clarity. Based on an initial patch from jtl. PR: 213424 Reviewed by: jtl MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D8235
# 6d172f58	14-Oct-2016	Jonathan T. Looney <jtl@FreeBSD.org>	The code currently resets the keepalive timer each time a packet is received on a TCP session that has entered the ESTABLISHED state. This results in a lot of calls to reset the keepalive timer. This patch changes the behavior so we set the keepalive timer for the keepalive idle time (TP_KEEPIDLE). When the keepalive timer fires, it will first check to see if the session has been idle for TP_KEEPIDLE ticks. If not, it will reschedule the keepalive timer for the time the session will have been idle for TP_KEEPIDLE ticks. For a session with regular communication, the keepalive timer should fire approximately once every TP_KEEPIDLE ticks. For sessions with irregular communication, the keepalive timer might fire more often. But, the disruption from a periodic keepalive timer should be less than the regular cost of resetting the keepalive timer on every packet. (FWIW, this change saved approximately 1.73% of the busy CPU cycles on a particular test system with a heavy TCP output load. Of course, the actual impact is very specific to the particular hardware and workload.) Reviewed by: gallatin, rrs MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D8243
# 68bd7ed1	12-Oct-2016	Jonathan T. Looney <jtl@FreeBSD.org>	The TFO server-side code contains some changes that are not conditioned on the TCP_RFC7413 kernel option. This change removes those few instructions from the packet processing path. While not strictly necessary, for the sake of consistency, I applied the new IS_FASTOPEN macro to all places in the packet processing path that used the (t_flags & TF_FASTOPEN) check. Reviewed by: hiren Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D8219
# 45274760	11-Oct-2016	Jonathan T. Looney <jtl@FreeBSD.org>	Currently, when tcp_input() receives a packet on a session that matches a TCPCB, it checks (so->so_options & SO_ACCEPTCONN) to determine whether or not the socket is a listening socket. However, this causes the code to access a different cacheline. If we first check if the socket is in the LISTEN state, we can avoid accessing so->so_options when processing packets received for ESTABLISHED sessions. If INVARIANTS is defined, the code still needs to access both variables to check that so->so_options is consistent with the state. Reviewed by: gallatin MFC after: 1 week Sponsored by: Netflix
# bd79708d	11-Oct-2016	Jonathan T. Looney <jtl@FreeBSD.org>	In the TCP stack, the hhook(9) framework provides hooks for kernel modules to add actions that run when a TCP frame is sent or received on a TCP session in the ESTABLISHED state. In the base tree, this functionality is only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd, and cc_vegas congestion control modules. Presently, we incur overhead to check for hooks each time a TCP frame is sent or received on an ESTABLISHED TCP session. This change adds a new compile-time option (TCP_HHOOK) to determine whether to include the hhook(9) framework for TCP. To retain backwards compatibility, I added the TCP_HHOOK option to every configuration file that already defined "options INET". (Therefore, this patch introduces no functional change. In order to see a functional difference, you need to compile a custom kernel without the TCP_HHOOK option.) This change will allow users to easily exclude this functionality from their kernel, should they wish to do so. Note that any users who use a custom kernel configuration and use one of the congestion control modules listed above will need to add the TCP_HHOOK option to their kernel configuration. Reviewed by: rrs, lstewart, hiren (previous version), sjg (makefiles only) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D8185
# 3ac12506	06-Oct-2016	Jonathan T. Looney <jtl@FreeBSD.org>	Remove "long" variables from the TCP stack (not including the modular congestion control framework). Reviewed by: gnn, lstewart (partial) Sponsored by: Juniper Networks, Netflix Differential Revision: (multiple) Tested by: Limelight, Netflix
# 1d7ee746	29-Sep-2016	Kurt Lidl <lidl@FreeBSD.org>	Properly preserve ip_tos bits for IPv4 packets Restructure code slightly to save ip_tos bits earlier. Fix the bug where the ip_tos field is zeroed out before assigning to the iptos variable. Restore the ip_tos and ip_ver fields only if they have been zeroed during the pseudo-header checksum calculation. Reviewed by: cem, gnn, hiren MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D8053
# 4b7b743c	25-Aug-2016	Lawrence Stewart <lstewart@FreeBSD.org>	Pass the number of segments coalesced by LRO up the stack by repurposing the tso_segsz pkthdr field during RX processing, and use the information in TCP for more correct accounting and as a congestion control input. This is only a start, and an audit of other uses for the data is left as future work. Reviewed by: gallatin, rrs Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D7564
# 4c105402	08-Jun-2016	Andrey V. Elsukov <ae@FreeBSD.org>	Cleanup unneded include "opt_ipfw.h". It was used for conditional build IPFIREWALL_FORWARD support. But IPFIREWALL_FORWARD option was removed a long time ago.
# 883054b4	19-May-2016	Don Lewis <truckman@FreeBSD.org>	Change net.inet.tcp.ecn.enable sysctl mib from a binary off/on control to a three way setting. 0 - Totally disable ECN. (no change) 1 - Enable ECN if incoming connections request it. Outgoing connections will request ECN. (no change from present != 0 setting) 2 - Enable ECN if incoming connections request it. Outgoing conections will not request ECN. Change the default value of net.inet.tcp.ecn.enable from 0 to 2. Linux version 2.4.20 and newer, Solaris, and Mac OS X 10.5 and newer have similar capabilities. The actual values above match Linux, and the default matches the current Linux default. Reviewed by: eadler MFC after: 1 month MFH: yes Sponsored by: https://reviews.freebsd.org/D6386
# f59d975e	17-May-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Tiny refactor of r294869/r296881: use defines to mask the VNET() macro. Suggested by: bz
# a4641f4e	03-May-2016	Pedro F. Giffuni <pfg@FreeBSD.org>	sys/net*: minor spelling fixes. No functional change.
# b8c2cd15	21-Apr-2016	Jonathan T. Looney <jtl@FreeBSD.org>	Prevent underflows in tp->snd_wnd if the remote side ACKs more than tp->snd_wnd. This can happen, for example, when the remote side responds to a window probe by ACKing the one byte it contains. Differential Revision: https://reviews.freebsd.org/D5625 Reviewed by: hiren Obtained from: Juniper Networks (earlier version) MFC after: 2 weeks Sponsored by: Juniper Networks
# d4d32b9f	16-Mar-2016	Hans Petter Selasky <hselasky@FreeBSD.org>	Fix kernel build after adding new sysctl asserts in r296933.
# bf840a17	14-Mar-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Redo r294869. The array of counters for TCP states doesn't belong to struct tcpstat, because the structure can be zeroed out by netstat(1) -z, and of course running connection counts shouldn't be touched. Place running connection counts into separate array, and provide separate read-only sysctl oid for it.
# 4644fda3	27-Jan-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Rename netinet/tcp_cc.h to netinet/cc/cc.h. Discussed with: lstewart
# 2de3e790	21-Jan-2016	Gleb Smirnoff <glebius@FreeBSD.org>	- Rename cc.h to more meaningful tcp_cc.h. - Declare it a kernel only include, which it already is. - Don't include tcp.h implicitly from tcp_cc.h
# b66d74c1	21-Jan-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Cleanup TCP files from unnecessary interface related includes.
# 0c39d38d	06-Jan-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Historically we have two fields in tcpcb to describe sender MSS: t_maxopd, and t_maxseg. This dualism emerged with T/TCP, but was not properly cleaned up after T/TCP removal. After all permutations over the years the result is that t_maxopd stores a minimum of peer offered MSS and MTU reduced by minimum protocol header. And t_maxseg stores (t_maxopd - TCPOLEN_TSTAMP_APPA) if timestamps are in action, or is equal to t_maxopd otherwise. That's a very rough estimate of MSS reduced by options length. Throughout the code it was used in places, where preciseness was not important, like cwnd or ssthresh calculations. With this change: - t_maxopd goes away. - t_maxseg now stores MSS not adjusted by options. - new function tcp_maxseg() is provided, that calculates MSS reduced by options length. The functions gives a better estimate, since it takes into account SACK state as well. Reviewed by: jtl Differential Revision: https://reviews.freebsd.org/D3593
# 2d8868db	29-Dec-2015	Jonathan T. Looney <jtl@FreeBSD.org>	When checking the inp_ip_minttl restriction for IPv6 packets, don't check the IPv4 header. CID: 1017920 Differential Revision: https://reviews.freebsd.org/D4727 Reviewed by: bz MFC after: 2 weeks Sponsored by: Juniper Networks
# 281a0fd4	24-Dec-2015	Patrick Kelsey <pkelsey@FreeBSD.org>	Implementation of server-side TCP Fast Open (TFO) [RFC7413]. TFO is disabled by default in the kernel build. See the top comment in sys/netinet/tcp_fastopen.c for implementation particulars. Reviewed by: gnn, jch, stas MFC after: 3 days Sponsored by: Verisign, Inc. Differential Revision: https://reviews.freebsd.org/D4350
# 55bceb1e	15-Dec-2015	Randall Stewart <rrs@FreeBSD.org>	First cut of the modularization of our TCP stack. Still to do is to clean up the timer handling using the async-drain. Other optimizations may be coming to go with this. Whats here will allow differnet tcp implementations (one included). Reviewed by: jtl, hiren, transports Sponsored by: Netflix Inc. Differential Revision: D4055
# 021eaf79	08-Dec-2015	Hiren Panchasara <hiren@FreeBSD.org>	One of the ways to detect loss is to count duplicate acks coming back from the other end till it reaches predetermined threshold which is 3 for us right now. Once that happens, we trigger fast-retransmit to do loss recovery. Main problem with the current implementation is that we don't honor SACK information well to detect whether an incoming ack is a dupack or not. RFC6675 has latest recommendations for that. According to it, dupack is a segment that arrives carrying a SACK block that identifies previously unknown information between snd_una and snd_max even if it carries new data, changes the advertised window, or moves the cumulative acknowledgment point. With the prevalence of Selective ACK (SACK) these days, improper handling can lead to delayed loss recovery. With the fix, new behavior looks like following: 0) th_ack < snd_una --> ignore Old acks are ignored. 1) th_ack == snd_una, !sack_changed --> ignore Acks with SACK enabled but without any new SACK info in them are ignored. 2) th_ack == snd_una, window == old_window --> increment Increment on a good dupack. 3) th_ack == snd_una, window != old_window, sack_changed --> increment When SACK enabled, it's okay to have advertized window changed if the ack has new SACK info. 4) th_ack > snd_una --> reset to 0 Reset to 0 when left edge moves. 5) th_ack > snd_una, sack_changed --> increment Increment if left edge moves but there is new SACK info. Here, sack_changed is the indicator that incoming ack has previously unknown SACK info in it. Note: This fix is not fully compliant to RFC6675. That may require a few changes to current implementation in order to keep per-sackhole dupack counter and change to the way we mark/handle sack holes. PR: 203663 Reviewed by: jtl MFC after: 3 weeks Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D4225
# 054d38e3	04-Nov-2015	Hiren Panchasara <hiren@FreeBSD.org>	Improve the sysctl node name. X-MFC with: r290122 Sponsored by: Limelight Networks
# 12eeb81f	28-Oct-2015	Hiren Panchasara <hiren@FreeBSD.org>	Calculate the correct amount of bytes that are in-flight for a connection as suggested by RFC 6675. Currently differnt places in the stack tries to guess this in suboptimal ways. The main problem is that current calculations don't take sacked bytes into account. Sacked bytes are the bytes receiver acked via SACK option. This is suboptimal because it assumes that network has more outstanding (unacked) bytes than the actual value and thus sends less data by setting congestion window lower than what's possible which in turn may cause slower recovery from losses. As an example, one of the current calculations looks something like this: snd_nxt - snd_fack + sackhint.sack_bytes_rexmit New proposal from RFC 6675 is: snd_max - snd_una - sackhint.sacked_bytes + sackhint.sack_bytes_rexmit which takes sacked bytes into account which is a new addition to the sackhint struct. Only thing we are missing from RFC 6675 is isLost() i.e. segment being considered lost and thus adjusting pipe based on that which makes this calculation a bit on conservative side. The approach is very simple. We already process each ack with sack info in tcp_sack_doack() and extract sack blocks/holes out of it. We'd now also track this new variable sacked_bytes which keeps track of total sacked bytes reported. One downside to this approach is that we may get incorrect count of sacked_bytes if the other end decides to drop sack info in the ack because of memory pressure or some other reasons. But in this (not very likely) case also the pipe calculation would be conservative which is okay as opposed to being aggressive in sending packets into the network. Next step is to use this more accurate pipe estimation to drive congestion window adjustments. In collaboration with: rrs Reviewed by: jason_eggnet dot com, rrs MFC after: 2 weeks Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D3971
# 356c7958	27-Oct-2015	Hiren Panchasara <hiren@FreeBSD.org>	Add sysctl tunable net.inet.tcp.initcwnd_segments to specify initial congestion window in number of segments on fly. It is set to 10 segments by default. Remove net.inet.tcp.experimental.initcwnd10 which is now redundant. Also remove the parent node net.inet.tcp.experimental as it's not needed anymore and also because it was not well thought out. Differential Revision: https://reviews.freebsd.org/D3858 In collaboration with: lstewart Reviewed by: gnn (prev version), rwatson, allanjude, wblock (man page) MFC after: 2 weeks Relnotes: yes Sponsored by: Limelight Networks
# 86a996e6	13-Oct-2015	Hiren Panchasara <hiren@FreeBSD.org>	There are times when it would be really nice to have a record of the last few packets and/or state transitions from each TCP socket. That would help with narrowing down certain problems we see in the field that are hard to reproduce without understanding the history of how we got into a certain state. This change provides just that. It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is destroyed, the list is freed. I thought this was likely to be more performance-friendly than saving copies of the tcpcb. Plus, with the packets, you should be able to reverse-engineer what happened to the tcpcb. To enable the feature, you will need to compile a kernel with the TCPPCAP option. Even then, the feature defaults to being deactivated. You can activate it by setting a positive value for the number of captured packets. You can do that on either a global basis or on a per-socket basis (via a setsockopt call). There is no way to get the packets out of the kernel other than using kmem or getting a coredump. I thought that would help some of the legal/privacy concerns regarding such a feature. However, it should be possible to add a future effort to export them in PCAP format. I tested this at low scale, and found that there were no mbuf leaks and the peak mbuf usage appeared to be unchanged with and without the feature. The main performance concern I can envision is the number of mbufs that would be used on systems with a large number of sockets. If you save five packets per direction per socket and have 3,000 sockets, that will consume at least 30,000 mbufs just to keep these packets. I tried to reduce the concerns associated with this by limiting the number of clusters (not mbufs) that could be used for this feature. Again, in my testing, that appears to work correctly. Differential Revision: D3100 Submitted by: Jonathan Looney <jlooney at juniper dot net> Reviewed by: gnn, hiren
# 62d4443f	06-Oct-2015	Hiren Panchasara <hiren@FreeBSD.org>	Add a comment specifying how we implement rfc3042. Differential Revision: D3746 MFC after: 1 week Sponsored by: Limelight Networks
# 1558cb24	26-Sep-2015	Alexander V. Chernikov <melifaro@FreeBSD.org>	Eliminate nd6_nud_hint() and its TCP bindings. Initially function was introduced in r53541 (KAME initial commit) to "provide hints from upper layer protocols that indicate a connection is making "forward progress"" (quote from RFC 2461 7.3.1 Reachability Confirmation). However, it was converted to do nothing (e.g. just return) in r122922 (tcp_hostcache implementation) back in 2003. Some defines were moved to tcp_var.h in r169541. Then, it was broken (for non-corner cases) by r186119 (L2<>L3 split) in 2008 (NULL ifp in nd6_lookup). So, right now this code is broken and has no "real" base users. Differential Revision: https://reviews.freebsd.org/D3699
# 5d06879a	13-Sep-2015	George V. Neville-Neil <gnn@FreeBSD.org>	dd DTrace probe points, translators and a corresponding script to provide the TCPDEBUG functionality with pure DTrace. Reviewed by: rwatson MFC after: 2 weeks Sponsored by: Limelight Networks Differential Revision: D3530
# ff9b006d	02-Aug-2015	Julien Charbon <jch@FreeBSD.org>	Decompose TCP INP_INFO lock to increase short-lived TCP connections scalability: - The existing TCP INP_INFO lock continues to protect the global inpcb list stability during full list traversal (e.g. tcp_pcblist()). - A new INP_LIST lock protects inpcb list actual modifications (inp allocation and free) and inpcb global counters. It allows to use TCP INP_INFO_RLOCK lock in critical paths (e.g. tcp_input()) and INP_INFO_WLOCK only in occasional operations that walk all connections. PR: 183659 Differential Revision: https://reviews.freebsd.org/D2599 Reviewed by: jhb, adrian Tested by: adrian, nitroboost-gmail.com Sponsored by: Verisign, Inc.
# 4741bfcb	29-Jul-2015	Patrick Kelsey <pkelsey@FreeBSD.org>	Revert r265338, r271089 and r271123 as those changes do not handle non-inline urgent data and introduce an mbuf exhaustion attack vector similar to FreeBSD-SA-15:15.tcp, but not requiring VNETs. Address the issue described in FreeBSD-SA-15:15.tcp. Reviewed by: glebius Approved by: so Approved by: jmallett (mentor) Security: FreeBSD-SA-15:15.tcp Sponsored by: Norse Corp, Inc.
# d57724fd	17-Jul-2015	Patrick Kelsey <pkelsey@FreeBSD.org>	Check TCP timestamp option flag so that the automatic receive buffer scaling code does not use an uninitialized timestamp echo reply value from the stack when timestamps are not enabled. Differential Revision: https://reviews.freebsd.org/D3060 Reviewed by: hiren Approved by: jmallett (mentor) MFC after: 3 days Sponsored by: Norse Corp, Inc.
# 4c3972f0a	22-Jun-2015	Hiren Panchasara <hiren@FreeBSD.org>	Reverting r284710. Today I learned: iff == if and only if. Suggested by: many
# 26f2eb69	22-Jun-2015	Hiren Panchasara <hiren@FreeBSD.org>	Fix a typo: s/iff/if/ Sponsored by: Limelight Networks
# c52102dd	19-May-2015	Hiren Panchasara <hiren@FreeBSD.org>	Correct the wording as we are increasing the window size. Reviewed by: jhb Sponsored by: Limelight Networks
# 64807b30	12-Jan-2015	Hiren Panchasara <hiren@FreeBSD.org>	DCTCP (Data Center TCP) implementation. DCTCP congestion control algorithm aims to maximise throughput and minimise latency in data center networks by utilising the proportion of Explicit Congestion Notification (ECN) marked packets received from capable hardware as a congestion signal. Highlights: Implemented as a mod_cc(4) module. ECN (Explicit congestion notification) processing is done differently from RFC3168. Takes one-sided DCTCP into consideration where only one of the sides is using DCTCP and other is using standard ECN. IETF draft: http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00 Thesis report by Midori Kato: https://eggert.org/students/kato-thesis.pdf Submitted by: Midori Kato <katoon@sfc.wide.ad.jp> and Lars Eggert <lars@netapp.com> with help and modifications from hiren Differential Revision: https://reviews.freebsd.org/D604 Reviewed by: gnn
# 44eb8bbe	11-Dec-2014	Andrey V. Elsukov <ae@FreeBSD.org>	Do not count security policy violation twice. ipsec*_in_reject() do this by their own. Obtained from: Yandex LLC Sponsored by: Yandex LLC
# c2529042	01-Dec-2014	Hans Petter Selasky <hselasky@FreeBSD.org>	Start process of removing the use of the deprecated "M_FLOWID" flag from the FreeBSD network code. The flag is still kept around in the "sys/mbuf.h" header file, but does no longer have any users. Instead the "m_pkthdr.rsstype" field in the mbuf structure is now used to decide the meaning of the "m_pkthdr.flowid" field. To modify the "m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX" macros as defined in the "sys/mbuf.h" header file. This patch introduces new behaviour in the transmit direction. Previously network drivers checked if "M_FLOWID" was set in "m_flags" before using the "m_pkthdr.flowid" field. This check has now now been replaced by checking if "M_HASHTYPE_GET(m)" is different from "M_HASHTYPE_NONE". In the future more hashtypes will be added, for example hashtypes for hardware dedicated flows. "M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is valid and has no particular type. This change removes the need for an "if" statement in TCP transmit code checking for the presence of a valid flowid value. The "if" statement mentioned above is now a direct variable assignment which is then later checked by the respective network drivers like before. Additional notes: - The SCTP code changes will be committed as a separate patch. - Removal of the "M_FLOWID" flag will also be done separately. - The FreeBSD version has been bumped. MFC after: 1 month Sponsored by: Mellanox Technologies
# 651e4e6a	30-Nov-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Merge from projects/sendfile: extend protocols API to support sending not ready data: o Add new flag to pru_send() flags - PRUS_NOTREADY. o Add new protocol method pru_ready(). Sponsored by: Nginx, Inc. Sponsored by: Netflix
# cfa6009e	12-Nov-2014	Gleb Smirnoff <glebius@FreeBSD.org>	In preparation of merging projects/sendfile, transform bare access to sb_cc member of struct sockbuf to a couple of inline functions: sbavail() and sbused() Right now they are equal, but once notion of "not ready socket buffer data", will be checked in, they are going to be different. Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 3e88eb90	08-Nov-2014	Andrey V. Elsukov <ae@FreeBSD.org>	Remove ip6_getdstifaddr() and all functions to work with auxiliary data. It isn't safe to keep unreferenced ifaddrs. Use in6ifa_ifwithaddr() to determine ifaddr corresponding to destination address. Since currently we keep addresses with embedded scope zone, in6ifa_ifwithaddr is called with zero zoneid and marked with XXX. Also remove route and lle lookups from ip6_input. Use in6ifa_ifwithaddr() instead. Sponsored by: Yandex LLC
# 6df8a710	07-Nov-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Remove SYSCTL_VNET_* macros, and simply put CTLFLAG_VNET where needed. Sponsored by: Nginx, Inc.
# 9fd573c3	22-Sep-2014	Hans Petter Selasky <hselasky@FreeBSD.org>	Improve transmit sending offload, TSO, algorithm in general. The current TSO limitation feature only takes the total number of bytes in an mbuf chain into account and does not limit by the number of mbufs in a chain. Some kinds of hardware is limited by two factors. One is the fragment length and the second is the fragment count. Both of these limits need to be taken into account when doing TSO. Else some kinds of hardware might have to drop completely valid mbuf chains because they cannot loaded into the given hardware's DMA engine. The new way of doing TSO limitation has been made backwards compatible as input from other FreeBSD developers and will use defaults for values not set. Reviewed by: adrian, rmacklem Sponsored by: Mellanox Technologies MFC after: 1 week
# 3220a212	16-Sep-2014	Gleb Smirnoff <glebius@FreeBSD.org>	FreeBSD-SA-14:19.tcp raised attention to the state of our stack towards blind SYN/RST spoofed attack. Originally our stack used in-window checks for incoming SYN/RST as proposed by RFC793. Later, circa 2003 the RST attack was mitigated using the technique described in P. Watson "Slipping in the window" paper [1]. After that, the checks were only relaxed for the sake of compatibility with some buggy TCP stacks. First, r192912 introduced the vulnerability, just fixed by aforementioned SA. Second, r167310 had slightly relaxed the default RST checks, instead of utilizing net.inet.tcp.insecure_rst sysctl. In 2010 a new technique for mitigation of these attacks was proposed in RFC5961 [2]. The idea is to send a "challenge ACK" packet to the peer, to verify that packet arrived isn't spoofed. If peer receives challenge ACK it should regenerate its RST or SYN with correct sequence number. This should not only protect against attacks, but also improve communication with broken stacks, so authors of reverted r167310 and r192912 won't be disappointed. [1] http://bandwidthco.com/whitepapers/netforensics/tcpip/TCP Reset Attacks.pdf [2] http://www.rfc-editor.org/rfc/rfc5961.txt Changes made: o Revert r167310. o Implement "challenge ACK" protection as specificed in RFC5961 against RST attack. On by default. - Carefully preserve r138098, which handles empty window edge case, not described by the RFC. - Update net.inet.tcp.insecure_rst description. o Implement "challenge ACK" protection as specificed in RFC5961 against SYN attack. On by default. - Provide net.inet.tcp.insecure_syn sysctl, to turn off RFC5961 protection. The changes were tested at Netflix. The tested box didn't show any anomalies compared to control box, except slightly increased number of TCP connection in LAST_ACK state. Reviewed by: rrs Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 831ad37e	16-Sep-2014	Xin LI <delphij@FreeBSD.org>	Fix Denial of Service in TCP packet processing. Submitted by: glebius Security: FreeBSD-SA-14:19.tcp
# a7c7f2a7	04-Sep-2014	John Baldwin <jhb@FreeBSD.org>	In tcp_input(), don't acquire the pcbinfo global write lock for SYN packets targeting a listening socket. Permit to reduce TCP input processing starvation in context of high SYN load (e.g. short-lived TCP connections or SYN flood). Submitted by: Julien Charbon <jcharbon@verisign.com> Reviewed by: adrian, hiren, jhb, Mike Bentkofsky
# f7469d3e	09-Aug-2014	Hiren Panchasara <hiren@FreeBSD.org>	Improve comments by listing a criteria for automatic increment of receive socket buffer. Reviewed by: jmg
# 8f5a8818	07-Aug-2014	Kevin Lo <kevlo@FreeBSD.org>	Merge 'struct ip6protosw' and 'struct protosw' into one. Now we have only one protocol switch structure that is shared between ipv4 and ipv6. Phabric: D476 Reviewed by: jhb
# cc412412	02-Jul-2014	Hiren Panchasara <hiren@FreeBSD.org>
# 3150f357	24-May-2014	Bjoern A. Zeeb <bz@FreeBSD.org>	Remove the prototpye for the static inline function tcp_signature_verify_input(). The function is defined before first use already. MFC after: 2 weeks
# 5688fa66	23-May-2014	Bjoern A. Zeeb <bz@FreeBSD.org>	Remove the prototypes for things that are no longer file local but were moved to the header file. Pointy hat to: clang \|\| bz MFC after: 2 weeks X-MFC with: r266596 Reported by: gcc build of sparc64
# 255cd9fd	23-May-2014	Bjoern A. Zeeb <bz@FreeBSD.org>	Move the tcp_fields_to_host() and tcp_fields_to_net() (inline) functions to the tcp_var.h header file in order to avoid further duplication with upcoming commits. Reviewed by: np MFC after: 2 weeks
# 2f719932	18-May-2014	Adrian Chadd <adrian@FreeBSD.org>	Ensure that the flowid hashtype is assigned to the inp if the flowid is also assigned.
# e407b67b	04-May-2014	Gleb Smirnoff <glebius@FreeBSD.org>	The FreeBSD-SA-14:08.tcp was a lesson on not doing acrobatics with mixing on stack memory and UMA memory in one linked list. Thus, rewrite TCP reassembly code in terms of memory usage. The algorithm remains unchanged. We actually do not need extra memory to build a reassembly queue. Arriving mbufs are always packet header mbufs. So we got the length of data as pkthdr.len. We got m_nextpkt for linkage. And we need only one pointer to point at the tcphdr, use PH_loc for that. In tcpcb the t_segq fields becomes mbuf pointer. The t_segqlen field now counts not packets, but bytes in the queue. This gives us more precision when comparing to socket buffer limits. Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 85536381	02-Apr-2014	Hiren Panchasara <hiren@FreeBSD.org>	Improve readability of comments for DELAY_ACK() macro.
# 153edc50	25-Mar-2014	Hiren Panchasara <hiren@FreeBSD.org>	Correct the comments as support for RFC 1644 has been removed for a long time.
# 62b90589	28-Jan-2014	Peter Wemm <peter@FreeBSD.org>	Adjust r239672 from rrs and r258821 from eadler. By definition, the very first FIN is not a duplicate. Process it normally and don't feed it to congestion control as though it were a dupe. Don't prevent CC from seeing later dupe acks while in a half close state.
# b8b4cfcd	25-Dec-2013	Sergey Kandaurov <pluknet@FreeBSD.org>	Draft-ietf-tcpm-initcwnd-05 became RFC6928. MFC after: 1 week
# 5f30ec9b	01-Dec-2013	Eitan Adler <eadler@FreeBSD.org>	In a situation where: - The remote host sends a FIN - in an ACK for a sequence number for which an ACK has already been received - There is still unacked data on route to the remote host - The packet does not contain a window update The packet may be dropped without processing the FIN flag. PR: kern/99188 Submitted by: Staffan Ulfberg <staffan@ulfberg.se> Discussed with: andre MFC after: never
# d9fae5ab	26-Nov-2013	Andriy Gapon <avg@FreeBSD.org>	dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE In its stead use the Solaris / illumos approach of emulating '-' (dash) in probe names with '__' (two consecutive underscores). Reviewed by: markj MFC after: 3 weeks
# fa22ce15	25-Nov-2013	Adrian Chadd <adrian@FreeBSD.org>	Convert over the TCP probes to use mtod() rather than directly dereferencing m->m_data. Sponsored by: Netflix, Inc.
# 54366c0b	25-Nov-2013	Attilio Rao <attilio@FreeBSD.org>	- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging option, unbreak the lock tracing release semantic by embedding calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined version of the releasing functions for mutex, rwlock and sxlock. Failing to do so skips the lockstat_probe_func invokation for unlocking. - As part of the LOCKSTAT support is inlined in mutex operation, for kernel compiled without lock debugging options, potentially every consumer must be compiled including opt_kdtrace.h. Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES is linked there and it is only used as a compile-time stub [0]. [0] immediately shows some new bug as DTRACE-derived support for debug in sfxge is broken and it was never really tested. As it was not including correctly opt_kdtrace.h before it was never enabled so it was kept broken for a while. Fix this by using a protection stub, leaving sfxge driver authors the responsibility for fixing it appropriately [1]. Sponsored by: EMC / Isilon storage division Discussed with: rstone [0] Reported by: rstone [1] Discussed with: philip
# 76039bc8	26-Oct-2013	Gleb Smirnoff <glebius@FreeBSD.org>	The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare to this event, adding if_var.h to files that do need it. Also, include all includes that now are included due to implicit pollution via if_var.h Sponsored by: Netflix Sponsored by: Nginx, Inc.
# c1e5a6e5	22-Oct-2013	Andre Oppermann <andre@FreeBSD.org>	The TCP delayed ACK logic isn't aware of LRO passing up large aggregated segments thinking it received only one segment. This causes it to enable the delay the ACK for 100ms to wait for another segment which may never come because all the data was received already. Doing delayed ACK for LRO segments is bogus for two reasons: a) it pushes us further away from acking every other packet; b) it introduces additional delay in responding to the sender. The latter is especially bad because it is in the nature of LRO to aggregated all segments of a burst with no more coming until an ACK is sent back. Change the delayed ACK logic to detect LRO segments by being larger than the MSS for this connection and issuing an immediate ACK for them to keep the ACK clock ticking without interruption. Reported by: julian, cperciva Tested by: cperciva Reviewed by: lstewart MFC after: 3 days
# c11a15bf	08-Oct-2013	Gleb Smirnoff <glebius@FreeBSD.org>	When processing ACK in tcp_do_segment, use sbcut_locked() instead of sbdrop_locked() to cut acked mbufs from the socket buffer. Free this chain a batch manner after the socket buffer lock is dropped. This measurably reduces contention on socket buffer. Sponsored by: Netflix Sponsored by: Nginx, Inc. Approved by: re (marius)
# 57f60867	25-Aug-2013	Mark Johnston <markj@FreeBSD.org>	Implement the ip, tcp, and udp DTrace providers. The probe definitions use dynamic translation so that their arguments match the definitions for these providers in Solaris and illumos. Thus, existing scripts for these providers should work unmodified on FreeBSD. Tested by: gnn, hiren MFC after: 1 month
# 6794f460	23-Jul-2013	Andrey V. Elsukov <ae@FreeBSD.org>	Remove the large part of struct ipsecstat. Only few fields of this structure is used, but they already have equal fields in the struct newipsecstat, that was introduced with FAST_IPSEC and then was merged together with old ipsecstat structure. This fixes kernel stack overflow on some architectures after migration ipsecstat to PCPU counters. Reported by: Taku YAMAMOTO, Maciej Milewski
# 07dacf03	09-Jul-2013	Andre Oppermann <andre@FreeBSD.org>	Extend debug logging of TCP timestamp related specification violations. Update related comments and style.
# 5da0521f	09-Jul-2013	Andrey V. Elsukov <ae@FreeBSD.org>	Use new macros to implement ipstat and tcpstat using PCPU counters. Change interface of kread_counters() similar ot kread() in the netstat(1).
# 42a253e6	21-Jun-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Fix kmod_*stat_inc() after r249276. The incorrect code actually increased the pointer, not the memory it points to. In collaboration with: kib Reported & tested by: Ian FREISLICH <ianf clue.co.za> Sponsored by: Nginx, Inc.
# 6659296c	20-Jun-2013	Andrey V. Elsukov <ae@FreeBSD.org>	Use IPSECSTAT_INC() and IPSEC6STAT_INC() macros for ipsec statistics accounting. MFC after: 2 weeks
# 3c914c54	02-Jun-2013	Andre Oppermann <andre@FreeBSD.org>	Allow drivers to specify a maximum TSO length in bytes if they are limited in the amount of data they can handle at once. Drivers can set ifp->if_hw_tsomax before calling ether_ifattach() to change the limit. The lowest allowable size is IP_MAXPACKET / 8 (8192 bytes) as anything less wouldn't be very useful anymore. The upper limit is still at IP_MAXPACKET (65536 bytes). Raising it requires further auditing of the IPv4/v6 code path's as the length field in the IP header would overflow leading to confusion in firewalls and others packet handler on the real size of the packet. The placement into "struct ifnet" is a bit hackish but the best place that was found. When the stack/driver boundary is updated it should be handled in a better way. Submitted by: cperciva (earlier version) Reviewed by: cperciva Tested by: cperciva MFC after: 1 week (using spare struct members to preserve ABI)
# 5628dd08	23-Apr-2013	Andre Oppermann <andre@FreeBSD.org>	When doing RFC3042 limited transmit on the first on second duplicate ACK make sure we actually have new data to send. This prevents us from sending unneccessary pure ACKs. Reported by: Matt Miller <matt@matthewjmiller.net> Tested by: Matt Miller <matt@matthewjmiller.net> MFC after: 2 weeks
# 982c1675	09-Apr-2013	Andre Oppermann <andre@FreeBSD.org>	Fix a race condition on tcp listen socket teardown with pending connections in the accept queue and contiguous new incoming SYNs. Compared to the original submitters patch I've moved the test next to the SYN handling to have it together in a logical unit and reworded the comment explaining the issue. Submitted by: Matt Miller <matt@matthewjmiller.net> Submitted by: Juan Mojica <jmojica@gmail.com> Reviewed by: Matt Miller (changes) Tested by: pho MFC after: 1 week
# 4a21e86e	09-Apr-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Fix VIMAGE build.
# 5923c293	08-Apr-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Merge from projects/counters: TCP/IP stats. Convert 'struct ipstat' and 'struct tcpstat' to counter(9). This speeds up IP forwarding at extreme packet rates, and makes accounting more precise. Sponsored by: Nginx, Inc.
# ce7ad664	29-Mar-2013	Ed Maste <emaste@FreeBSD.org>	Keep fwd_tag around for subsequent pcb lookups For TIMEWAIT handling tcp_input may have to jump back for an additional pass through pcblookup. Prior to this change the fwd_tag had been discarded after the first lookup, so a new connection attempt delivered locally via 'ipfw fwd' would fail to find a match. As of r248886 the tag will be detached and freed when passed to the socket buffer.
# 5b648e79	22-Jan-2013	Lawrence Stewart <lstewart@FreeBSD.org>	Simplify and fix a bug in cc_ack_received()'s "are we congestion window limited" logic (refer to [1] for associated discussion). snd_cwnd and snd_wnd are unsigned long and on 64 bit hosts, min() will truncate them to 32 bits and could therefore potentially corrupt the result (although under normal operation, neither variable should legitmately exceed 32 bits). [1] http://lists.freebsd.org/pipermail/freebsd-net/2013-January/034297.html Submitted by: jhb MFC after: 1 week
# b8056fae	18-Dec-2012	Gleb Smirnoff <glebius@FreeBSD.org>	Fix !INET6 build after r244365.
# dd029d52	18-Dec-2012	Gleb Smirnoff <glebius@FreeBSD.org>	Clear correct flag in INET6 case.
# f4912745	17-Dec-2012	Andrey V. Elsukov <ae@FreeBSD.org>	Since we use different flags to detect tcp forwarding, and we share the same code for IPv4 and IPv6 in tcp_input, we should check both M_IP_NEXTHOP and M_IP6_NEXTHOP flags. MFC after: 3 days
# 78a7880f	12-Dec-2012	Gleb Smirnoff <glebius@FreeBSD.org>	Fix a crash in tcp_input(), that happens when mbuf has a fwd_tag on it, but later after processing and freeing the tag, we need to jump back again to the findpcb label. Since the fwd_tag pointer wasn't NULL we tried to process and free the tag for second time. Reported & tested by: Pawel Tyll <ptyll nitronet.pl> MFC after: 3 days
# 60ee3bb2	05-Nov-2012	Andre Oppermann <andre@FreeBSD.org>	Back out r242262. The simplified window change/update logic wasn't complete and ready for production use. PR: kern/173309
# ffdbf9da	01-Nov-2012	Andrey V. Elsukov <ae@FreeBSD.org>	Remove the recently added sysctl variable net.pfil.forward. Instead, add protocol specific mbuf flags M_IP_NEXTHOP and M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup only when this flag is set. Suggested by: andre
# 09440655	28-Oct-2012	Andre Oppermann <andre@FreeBSD.org>	Increase the initial CWND to 10 segments as defined in IETF TCPM draft-ietf-tcpm-initcwnd-05. It explains why the increased initial window improves the overall performance of many web services without risking congestion collapse. As long as it remains a draft it is placed under a sysctl marking it as experimental: net.inet.tcp.experimental.initcwnd10 = 1 When it becomes an official RFC soon the sysctl will be changed to the RFC number and moved to net.inet.tcp. This implementation differs from the RFC draft in that it is a bit more conservative in the case of packet loss on SYN or SYN\|ACK because we haven't reduced the default RTO to 1 second yet. Also the restart window isn't yet increased as allowed. Both will be adjusted with upcoming changes. Is is enabled by default. In Linux it is enabled since kernel 3.0. MFC after: 2 weeks
# 79ce26a0	28-Oct-2012	Andre Oppermann <andre@FreeBSD.org>	Simplify and enhance the window change/update acceptance logic, especially in the presence of bi-directional data transfers. snd_wl1 tracks the right edge, including data in the reassembly queue, of valid incoming data. This makes it like rcv_nxt plus reassembly. It never goes backwards to prevent older, possibly reordered segments from updating the window. snd_wl2 tracks the left edge of sent data. This makes it a duplicate of snd_una. However joining them right now is difficult due to separate update dependencies in different places in the code flow. snd_wnd tracks the current advertized send window by the peer. In tcp_output() the effective window is calculated by subtracting the already in-flight data, snd_nxt less snd_una, from it. ACK's become the main clock of window updates and will always update the window when the left edge of what we sent is advanced. The ACK clock is the primary signaling mechanism in ongoing data transfers. This works reliably even in the presence of reordering, reassembly and retransmitted segments. The ACK clock is most important because it determines how much data we are allowed to inject into the network. Zero window updates get us out of persistence mode are crucial. Here a segment that neither moves ACK nor SEQ but enlarges WND is accepted. When the ACK clock is not active (that is we're not or no longer sending any data) any segment that moves the extended right SEQ edge, including out-of-order segments, updates the window. This gives us updates especially during ping-pong transfers where the peer isn't done consuming the already acknowledged data from the receive buffer while responding with data. The SSH protocol is a prime candidate to benefit from the improved bi-directional window update logic as it has its own windowing mechanism on top of TCP and is frequently sending back protocol ACK's. Tcpdump provided by: darrenr Tested by: darrenr MFC after: 2 weeks
# 4faaea55	28-Oct-2012	Andre Oppermann <andre@FreeBSD.org>	Allow arbitrary MSS sizes and don't mind about the cluster size anymore. We've got more cluster sizes for quite some time now and the orginally imposed limits and the previously codified thoughts on efficiency gains are no longer true. MFC after: 2 weeks
# cf8f04f4	28-Oct-2012	Andre Oppermann <andre@FreeBSD.org>	When SYN or SYN/ACK had to be retransmitted RFC5681 requires us to reduce the initial CWND to one segment. This reduction got lost some time ago due to a change in initialization ordering. Additionally in tcp_timer_rexmt() avoid entering fast recovery when we're still in TCPS_SYN_SENT state. MFC after: 2 weeks
# 22efabd4	28-Oct-2012	Andre Oppermann <andre@FreeBSD.org>	Adjust the initial default CWND upon connection establishment to the new and increased values specified by RFC5681 Section 3.1. The even larger initial CWND per RFC3390, if enabled, is not affected. MFC after: 2 weeks
# c1de64a4	25-Oct-2012	Andrey V. Elsukov <ae@FreeBSD.org>	Remove the IPFIREWALL_FORWARD kernel option and make possible to turn on the related functionality in the runtime via the sysctl variable net.pfil.forward. It is turned off by default. Sponsored by: Yandex LLC Discussed with: net@ MFC after: 2 weeks
# 8ad458a4	23-Oct-2012	Gleb Smirnoff <glebius@FreeBSD.org>	Do not reduce ip_len by size of IP header in the ip_input() before passing a packet to protocol input routines. For several protocols this mean that now protocol needs to do subtraction itself, and for another half this means that we do not need to add header length back to the packet. Make ip_stripoptions() to adjust ip_len, since now we enter this function with a packet header whose ip_len does represent length of entire packet, not payload only.
# 8f134647	22-Oct-2012	Gleb Smirnoff <glebius@FreeBSD.org>	Switch the entire IPv4 stack to keep the IP packet header in network byte order. Any host byte order processing is done in local variables and host byte order values are never[1] written to a packet. After this change a packet processed by the stack isn't modified at all[2] except for TTL. After this change a network stack hacker doesn't need to scratch his head trying to figure out what is the byte order at the given place in the stack. [1] One exception still remains. The raw sockets convert host byte order before pass a packet to an application. Probably this would remain for ages for compatibility. [2] The ip_input() still subtructs header len from ip->ip_len, but this is planned to be fixed soon. Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru> Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>
# 105bd211	12-Oct-2012	Gleb Smirnoff <glebius@FreeBSD.org>	In ip_stripoptions(): - Remove unused argument and incorrect comment. - Fixup ip_len after stripping.
# ec03d543	25-Aug-2012	Randall Stewart <rrs@FreeBSD.org>	This small change takes care of a race condition that can occur when both sides close at the same time. If that occurs, without this fix the connection enters FIN1 on both sides and they will forever send FIN\|ACK at each other until the connection times out. This is because we stopped processing the FIN\|ACK and thus did not advance the sequence and so never ACK'd each others FIN. This fix adjusts it so we do process the FIN properly and the race goes away ;-) MFC after: 1 month
# 0989f56c	22-Jul-2012	Robert Watson <rwatson@FreeBSD.org>	Update some stale comments regarding tcbinfo locking in the TCP input path: read locks on tcbinfo are no longer used, so won't happen. No functional change. MFC after: 3 days
# 09fe6320	19-Jun-2012	Navdeep Parhar <np@FreeBSD.org>	- Updated TOE support in the kernel. - Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs. These are available as t3_tom and t4_tom modules that augment cxgb(4) and cxgbe(4) respectively. The cxgb/cxgbe drivers continue to work as usual with or without these extra features. - iWARP driver for Terminator 3 ASIC (kernel verbs). T4 iWARP in the works and will follow soon. Build-tested with make universe. 30s overview ============ What interfaces support TCP offload? Look for TOE4 and/or TOE6 in the capabilities of an interface: # ifconfig -m \| grep TOE Enable/disable TCP offload on an interface (just like any other ifnet capability): # ifconfig cxgbe0 toe # ifconfig cxgbe0 -toe Which connections are offloaded? Look for toe4 and/or toe6 in the output of netstat and sockstat: # netstat -np tcp \| grep toe # sockstat -46c \| grep toe Reviewed by: bz, gnn Sponsored by: Chelsio communications. MFC after: ~3 months (after 9.1, and after ensuring MFC is feasible)
# 77d396fd	04-Jun-2012	Maksim Yevmenkin <emax@FreeBSD.org>	Plug more refcount leaks and possible NULL deref for interface address list. Submitted by: scottl@ MFC after: 3 days
# 356ab07e	28-May-2012	Bjoern A. Zeeb <bz@FreeBSD.org>	It turns out that too many drivers are not only parsing the L2/3/4 headers for TSO but also for generic checksum offloading. Ideally we would only have one common function shared amongst all drivers, and perhaps when updating them for IPv6 we should introduce that. Eventually we should provide the meta information along with mbufs to avoid (re-)parsing entirely. To not break IPv6 (checksums and offload) and to be able to MFC the changes without risking to hurt 3rd party drivers, duplicate the v4 framework, as other OSes have done as well. Introduce interface capability flags for TX/RX checksum offload with IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6 flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6 fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and add an alias for CSUM_DATA_VALID_IPV6. This pretty much brings IPv6 handling in line with IPv4. TSO is still handled in a different way and not via if_hwassist. Update ifconfig to allow (un)setting of the new capability flags. Update loopback to announce the new capabilities and if_hwassist flags. Individual driver updates will have to follow, as will SCTP. Reported by: gallatin, dim, .. Reviewed by: gallatin (glanced at?) MFC after: 3 days X-MFC with: r235961,235959,235958
# 45747ba5	24-May-2012	Bjoern A. Zeeb <bz@FreeBSD.org>	MFp4 bz_ipv6_fast: Add code to handle pre-checked TCP checksums as indicated by mbuf flags to save the entire computation for validation if not needed. In the IPv6 TCP output path only compute the pseudo-header checksum, set the checksum offset in the mbuf field along the appropriate flag as done in IPv4. In tcp_respond() just initialize the IPv6 payload length to 0 as ip6_output() will properly set it. Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems Reviewed by: gnn (as part of the whole) MFC After: 3 days
# 3a9391def	24-May-2012	Bjoern A. Zeeb <bz@FreeBSD.org>	MFp4 bz_ipv6_fast: Factor out the tcp_hc_getmtu() call. As the comments say it applies to both v4 and v6, so only write it once making it easier to read the protocol family specifc code. Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems Reviewed by: gnn (as part of the whole) MFC After: 3 days
# ef341ee1	16-Apr-2012	Gleb Smirnoff <glebius@FreeBSD.org>	When we receive an ICMP unreach need fragmentation datagram, we take proposed MTU value from it and update the TCP host cache. Then tcp_mss_update() is called on the corresponding tcpcb. It finds the just allocated entry in the TCP host cache and updates MSS on the tcpcb. And then we do a fast retransmit of what we have in the tcp send buffer. This sequence gets broken if the TCP host cache is exausted. In this case allocation fails, and later called tcp_mss_update() finds nothing in cache. The fast retransmit is done with not reduced MSS and is immidiately replied by remote host with new ICMP datagrams and the cycle repeats. This ping-pong can go up to wirespeed. To fix this: - tcp_mss_update() gets new parameter - mtuoffer, that is like offer, but needs to have min_protoh subtracted. - tcp_mtudisc() as notification method renamed to tcp_mtudisc_notify(). - tcp_mtudisc() now accepts not a useless error argument, but proposed MTU value, that is passed to tcp_mss_update() as mtuoffer. Reported by: az Reported by: Andrey Zonov <andrey zonov.org> Reviewed by: andre (previous version of patch)
# d8951c8a	15-Feb-2012	Bjoern A. Zeeb <bz@FreeBSD.org>	Fix PAWS (Protect Against Wrapped Sequence numbers) in cases when hz >> 1000 and thus getting outside the timestamp clock frequenceny of 1ms < x < 1s per tick as mandated by RFC1323, leading to connection resets on idle connections. Always use a granularity of 1ms using getmicrouptime() making all but relevant callouts independent of hz. Use getmicrouptime(), not getmicrotime() as the latter may make a jump possibly breaking TCP nfsroot mounts having our timestamps move forward for more than 24.8 days in a second without having been idle for that long. PR: kern/61404 Reviewed by: jhb, mav, rrs Discussed with: silby, lstewart Sponsored by: Sandvine Incorporated (originally in 2011) MFC after: 6 weeks
# 9077f387	05-Feb-2012	Gleb Smirnoff <glebius@FreeBSD.org>	Add new socket options: TCP_KEEPINIT, TCP_KEEPIDLE, TCP_KEEPINTVL and TCP_KEEPCNT, that allow to control initial timeout, idle time, idle re-send interval and idle send count on a per-socket basis. Reviewed by: andre, bz, lstewart
# 1e96ae81	05-Jan-2012	John Baldwin <jhb@FreeBSD.org>	Remove the assertion from tcp_input() that rcv_nxt is always greater than or equal to rcv_adv and fix tcp_twstart() to handle this case by assuming the last window was zero rather than a negative value. The code in tcp_input() already safely handled this case. It can happen due to delayed ACKs along with a remote sender that sends data beyond the window we previously advertised. If we have room in our socket buffer for the extra data beyond the advertised window, we will accept it. However, if the ACK for that segment is delayed, then we will not effectively fixup rcv_adv to account for that extra data until the next segment arrives and forces out an ACK. When that next segment arrives, rcv_nxt will be beyond rcv_adv. Tested by: pjd MFC after: 1 week
# 6472ac3d	07-Nov-2011	Ed Schouten <ed@FreeBSD.org>	Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs. The SYSCTL_NODE macro defines a list that stores all child-elements of that node. If there's no SYSCTL_DECL macro anywhere else, there's no reason why it shouldn't be static.
# ddd0c4a9	02-Nov-2011	Sergey Kandaurov <pluknet@FreeBSD.org>	Restore sysctl names for tcp_sendspace/tcp_recvspace. They seem to be changed unintentionally in r226437, and there were no any mentions of renaming in commit log message. Reported by: Anton Yuzhaninov <citrin citrin ru>
# fba0cea1	16-Oct-2011	Bjoern A. Zeeb <bz@FreeBSD.org>	Add syntactic sugar missed in r226437 and then not added either when moving things around in r226448 but desperately needed to always make things compile successfully. MFC after: 1 week
# 873789cb	16-Oct-2011	Andre Oppermann <andre@FreeBSD.org>	Move the tcp_sendspace and tcp_recvspace sysctl's from the middle of tcp_usrreq.c to the top of tcp_output.c and tcp_input.c respectively next to the socket buffer autosizing controls. MFC after: 1 week
# 9ec4a4cc	16-Oct-2011	Andre Oppermann <andre@FreeBSD.org>	Remove the ss_fltsz and ss_fltsz_local sysctl's which have long been superseded by the RFC3390 initial CWND sizing. Also remove the remnants of TCP_METRICS_CWND which used the TCP hostcache to set the initial CWND in a non-RFC compliant way. MFC after: 1 week
# e233e2ac	16-Oct-2011	Andre Oppermann <andre@FreeBSD.org>	VNET virtualize tcp_sendspace/tcp_recvspace and change the type to INT. A long is not necessary as the TCP window is limited to 2**30. A larger initial window isn't useful. MFC after: 1 week
# 4af309c8	06-Oct-2011	Attilio Rao <attilio@FreeBSD.org>	For the INP_TIMEWAIT case, there is no valid tcpcb object tied to the inpcb object. Skip the TCP_SIGNATURE check in that case as it is consistent with the output path (no TCP_SIGNATURE for outcoming packets in TIMEWAIT state) and also because for TIMEWAIT state the verify may be less effective. Sponsored by: Sandvine Incorporated Reported by: rwatson No objections by: rwatson MFC after: 3 days
# b233773b	25-Aug-2011	Bjoern A. Zeeb <bz@FreeBSD.org>	Increase the defaults for the maximum socket buffer limit, and the maximum TCP send and receive buffer limits from 256kB to 2MB. For sb_max_adj we need to add the cast as already used in the sysctl handler to not overflow the type doing the maths. Note that this is just the defaults. They will allow more memory to be consumed per socket/connection if needed but not change the default "idle" memory consumption. All values are still tunable by sysctls. Suggested by: gnn Discussed on: arch (Mar and Aug 2011) MFC after: 3 weeks Approved by: re (kib)
# 6f697424	20-Aug-2011	Bjoern A. Zeeb <bz@FreeBSD.org>	Fix compilation in case of defined(INET) && defined(IPFIREWALL_FORWARD) but no INET6. Reported by: avg Tested by: avg MFC after: 4 weeks X-MFC with: r225044 Approved by: re (kib)
# 8a006adb	20-Aug-2011	Bjoern A. Zeeb <bz@FreeBSD.org>	Add support for IPv6 to ipfw fwd: Distinguish IPv4 and IPv6 addresses and optional port numbers in user space to set the option for the correct protocol family. Add support in the kernel for carrying the new IPv6 destination address and port. Add support to TCP and UDP for IPv6 and fix UDP IPv4 to not change the address in the IP header. Add support for IPv6 forwarding to a non-local destination. Add a regession test uitilizing VIMAGE to check all 20 possible combinations I could think of. Obtained from: David Dolson at Sandvine Incorporated (original version for ipfw fwd IPv6 support) Sponsored by: Sandvine Incorporated PR: bin/117214 MFC after: 4 weeks Approved by: re (kib)
# d3c1f003	04-Jun-2011	Robert Watson <rwatson@FreeBSD.org>	Add _mbuf() variants of various inpcb-related interfaces, including lookup, hash install, etc. For now, these are arguments are unused, but as we add RSS support, we will want to use hashes extracted from mbufs, rather than manually calculated hashes of header fields, due to the expensive of the software version of Toeplitz (and similar hashes). Add notes that it would be nice to be able to pass mbufs into lookup routines in pf(4), optimising firewall lookup in the same way, but the code structure there doesn't facilitate that currently. (In principle there is no reason this couldn't be MFCed -- the change extends rather than modifies the KBI. However, it won't be useful without other previous possibly less MFCable changes.) Reviewed by: bz Sponsored by: Juniper Networks, Inc.
# fa046d87	30-May-2011	Robert Watson <rwatson@FreeBSD.org>	Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
# 5891ebd6	14-May-2011	John Baldwin <jhb@FreeBSD.org>	Oops, fix order of sequence numbers in KASSERT()'s to catch negative receive windows to match the labels in the panic message. Submitted by: trociny
# f701e30d	02-May-2011	John Baldwin <jhb@FreeBSD.org>	Handle a rare edge case with nearly full TCP receive buffers. If a TCP buffer fills up causing the remote sender to enter into persist mode, but there is still room available in the receive buffer when a window probe arrives (either due to window scaling, or due to the local application very slowing draining data from the receive buffer), then the single byte of data in the window probe is accepted. However, this can cause rcv_nxt to be greater than rcv_adv. This condition will only last until the next ACK packet is pushed out via tcp_output(), and since the previous ACK advertised a zero window, the ACK should be pushed out while the TCP pcb is write-locked. During the window while rcv_nxt is greather than rcv_adv, a few places would compute the remaining receive window via rcv_adv - rcv_nxt. However, this value was then (uint32_t)-1. On a 64 bit machine this could expand to a positive 2^32 - 1 when cast to a long. In particular, when calculating the receive window in tcp_output(), the result would be that the receive window was computed as 2^32 - 1 resulting in advertising a far larger window to the remote peer than actually existed. Fix various places that compute the remaining receive window to either assert that it is not negative (i.e. rcv_nxt <= rcv_adv), or treat the window as full if rcv_nxt is greather than rcv_adv. Reviewed by: bz MFC after: 1 month
# 29bd2010	30-Apr-2011	Bjoern A. Zeeb <bz@FreeBSD.org>	Fix a mismerge from p4 in that in_localaddr() is not available without INET. Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems MFC after: 4 days
# b287c6c7	30-Apr-2011	Bjoern A. Zeeb <bz@FreeBSD.org>	Make the TCP code compile without INET. Sort #includes and add #ifdef INETs. Add some comments at #endifs given more nestedness. To make the compiler happy, some default initializations were added in accordance with the style on the files. Reviewed by: gnn Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems MFC after: 4 days
# 672dc4ae	29-Apr-2011	John Baldwin <jhb@FreeBSD.org>	TCP reuses t_rxtshift to determine the backoff timer used for both the persist state and the retransmit timer. However, the code that implements "bad retransmit recovery" only checks t_rxtshift to see if an ACK has been received in during the first retransmit timeout window. As a result, if ticks has wrapped over to a negative value and a socket is in the persist state, it can incorrectly treat an ACK from the remote peer as a "bad retransmit recovery" and restore saved values such as snd_ssthresh and snd_cwnd. However, if the socket has never had a retransmit timeout, then these saved values will be zero, so snd_ssthresh and snd_cwnd will be set to 0. If the socket is in fast recovery (this can be caused by excessive duplicate ACKs such as those fixed by 220794), then each ACK that arrives triggers either NewReno or SACK partial ACK handling which clamps snd_cwnd to be no larger than snd_ssthresh. In effect, the socket's send window is permamently stuck at 0 even though the remote peer is advertising a much larger window and pending data is only sent via TCP window probes (so one byte every few seconds). Fix this by adding a new TCP pcb flag (TF_PREVVALID) that indicates that the various snd_*_prev fields in the pcb are valid and only perform "bad retransmit recovery" if this flag is set in the pcb. The flag is set on the first retransmit timeout that occurs and is cleared on subsequent retransmit timeouts or when entering the persist state. Reviewed by: bz MFC after: 2 weeks
# 2903309a	25-Apr-2011	Attilio Rao <attilio@FreeBSD.org>	Add the possibility to verify MD5 hash of incoming TCP packets. As long as this is a costy function, even when compiled in (along with the option TCP_SIGNATURE), it can be disabled via the net.inet.tcp.signature_verify_input sysctl. Sponsored by: Sandvine Incorporated Reviewed by: emaste, bz MFC after: 2 weeks
# 891b8ed4	12-Apr-2011	Lawrence Stewart <lstewart@FreeBSD.org>	Use the full and proper company name for Swinburne University of Technology throughout the source tree. Requested by: Grenville Armitage, Director of CAIA at Swinburne University of Technology MFC after: 3 days
# 766282cb	29-Mar-2011	John Baldwin <jhb@FreeBSD.org>	Clamp the initial advertised receive window when responding to a SYN/ACK to the maximum allowed window. Growing the window too large would cause an underflow in the calculations in tcp_output() to decide if a window update should be sent which would prevent the persist timer from being started if data was pending and the other end of the connection advertised an initial window size of 0. PR: kern/154006 Submitted by: Stefan `Sec` Zehl sec 42 org Reviewed by: bz MFC after: 1 week
# d64a46ea	09-Jan-2011	Lawrence Stewart <lstewart@FreeBSD.org>	Reset the last_sack_ack SACK hint for TCP input processing to ensure that the hint is 0 when no SACK data is received to update the hint with. This was accidentally omitted from r216753. Sponsored by: FreeBSD Foundation MFC after: 10 weeks X-MFC with: 216753
# 79e955ed	07-Jan-2011	John Baldwin <jhb@FreeBSD.org>	Trim extra spaces before tabs.
# 39bc9de5	27-Dec-2010	Lawrence Stewart <lstewart@FreeBSD.org>	- Add some helper hook points to the TCP stack. The hooks allow Khelp modules to access inbound/outbound events and associated data for established TCP connections. The hooks only run if at least one hook function is registered for the hook point, ensuring the impact on the stack is effectively nil when no TCP Khelp modules are loaded. struct tcp_hhook_data is passed as contextual data to any registered Khelp module hook functions. - Add an OSD (Object Specific Data) pointer to struct tcpcb to allow Khelp modules to associate per-connection data with the TCP control block. - Bump __FreeBSD_version and add a note to UPDATING regarding to ABI changes introduced by this commit and r216753. In collaboration with: David Hayes <dahayes at swin edu au> and Grenville Armitage <garmitage at swin edu au> Sponsored by: FreeBSD Foundation Reviewed by: bz, others along the way MFC after: 3 months
# 6157935f	01-Dec-2010	Lawrence Stewart <lstewart@FreeBSD.org>	Set ssthresh appropriately on RTO. This change was accidentally not ported from the pre modular CC stack. Sponsored by: FreeBSD Foundation Submitted by: David Hayes <dahayes at swin edu au> MFC after: 9 weeks X-MFC with: r215166
# dbc42409	11-Nov-2010	Lawrence Stewart <lstewart@FreeBSD.org>	This commit marks the first formal contribution of the "Five New TCP Congestion Control Algorithms for FreeBSD" FreeBSD Foundation funded project. More details about the project are available at: http://caia.swin.edu.au/freebsd/5cc/ - Add a KPI and supporting infrastructure to allow modular congestion control algorithms to be used in the net stack. Algorithms can maintain per-connection state if required, and connections maintain their own algorithm pointer, which allows different connections to concurrently use different algorithms. The TCP_CONGESTION socket option can be used with getsockopt()/setsockopt() to programmatically query or change the congestion control algorithm respectively from within an application at runtime. - Integrate the framework with the TCP stack in as least intrusive a manner as possible. Care was also taken to develop the framework in a way that should allow integration with other congestion aware transport protocols (e.g. SCTP) in the future. The hope is that we will one day be able to share a single set of congestion control algorithm modules between all congestion aware transport protocols. - Introduce a new congestion recovery (TF_CONGRECOVERY) state into the TCP stack and use it to decouple the meaning of recovery from a congestion event and recovery from packet loss (TF_FASTRECOVERY) a la RFC2581. ECN and delay based congestion control protocols don't generally need to recover from packet loss and need a different way to note a congestion recovery episode within the stack. - Remove the net.inet.tcp.newreno sysctl, which simplifies some portions of code and ensures the stack always uses the appropriate mechanisms for recovering from packet loss during a congestion recovery episode. - Extract the NewReno congestion control algorithm from the TCP stack and massage it into module form. NewReno is always built into the kernel and will remain the default algorithm for the forseeable future. Implementations of additional different algorithms will become available in the near future. - Bump __FreeBSD_version to 900025 and note in UPDATING that rebuilding code that relies on the size of "struct tcpcb" is required. Many thanks go to the Cisco University Research Program Fund at Community Foundation Silicon Valley and the FreeBSD Foundation. Their support of our work at the Centre for Advanced Internet Architectures, Swinburne University of Technology is greatly appreciated. In collaboration with: David Hayes <dahayes at swin edu au> and Grenville Armitage <garmitage at swin edu au> Sponsored by: Cisco URP, FreeBSD Foundation Reviewed by: rpaulo Tested by: David Hayes (and many others over the years) MFC after: 3 months
# a7d5f7eb	19-Oct-2010	Jamie Gritton <jamie@FreeBSD.org>	A new jail(8) with a configuration file, to replace the work currently done by /etc/rc.d/jail.
# 1c18314d	16-Sep-2010	Andre Oppermann <andre@FreeBSD.org>	Remove the TCP inflight bandwidth limiter as announced in r211315 to give way for the pluggable congestion control framework. It is the task of the congestion control algorithm to set the congestion window and amount of inflight data without external interference. In 'struct tcpcb' the variables previously used by the inflight limiter are renamed to spares to keep the ABI intact and to have some more space for future extensions. In 'struct tcp_info' the variable 'tcpi_snd_bwnd' is not removed to preserve the ABI. It is always set to 0. In siftr.c in 'struct pkt_node' the variable 'snd_bwnd' is not removed to preserve the ABI. It is always set to 0. These unused variable in the various structures may be reused in the future or garbage collected before the next release or at some other point when an ABI change happens anyway for other reasons. No MFC is planned. The inflight bandwidth limiter stays disabled by default in the other branches but remains available.
# 8502ec25	26-Aug-2010	Andre Oppermann <andre@FreeBSD.org>	Use timestamp modulo comparison macro for automatic receive buffer scaling to correctly handle wrapping of ticks value. MFC after: 1 week
# b7d747ec	18-Aug-2010	Andre Oppermann <andre@FreeBSD.org>	Untangle the net.inet.tcp.log_in_vain and net.inet.tcp.log_debug sysctl's and remove any side effects. Both sysctl's share the same backend infrastructure and due to the way it was implemented enabling net.inet.tcp.log_in_vain would also cause log_debug output to be generated. This was surprising and eventually annoying to the user. The log output backend is kept the same but a little shim is inserted to properly separate log_in_vain and log_debug and to remove any side effects. PR: kern/137317 MFC after: 1 week
# 4fc9f6b8	01-Jun-2010	Robert Watson <rwatson@FreeBSD.org>	Merge r204806 from head to stable/8: Wrap use of rw_try_upgrade() on pcbinfo with macro INP_INFO_TRY_UPGRADE() to match other pcbinfo locking macros. Approved by: re (bz)
# 480d7c6c	06-May-2010	Bjoern A. Zeeb <bz@FreeBSD.org>	MFC r207369: MFP4: @176978-176982, 176984, 176990-176994, 177441 "Whitspace" churn after the VIMAGE/VNET whirls. Remove the need for some "init" functions within the network stack, like pim6_init(), icmp_init() or significantly shorten others like ip6_init() and nd6_init(), using static initialization again where possible and formerly missed. Move (most) variables back to the place they used to be before the container structs and VIMAGE_GLOABLS (before r185088) and try to reduce the diff to stable/7 and earlier as good as possible, to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9. This also removes some header file pollution for putatively static global variables. Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are no longer needed. Reviewed by: jhb Discussed with: rwatson Sponsored by: The FreeBSD Foundation Sponsored by: CK Software GmbH
# 82cea7e6	29-Apr-2010	Bjoern A. Zeeb <bz@FreeBSD.org>	MFP4: @176978-176982, 176984, 176990-176994, 177441 "Whitspace" churn after the VIMAGE/VNET whirls. Remove the need for some "init" functions within the network stack, like pim6_init(), icmp_init() or significantly shorten others like ip6_init() and nd6_init(), using static initialization again where possible and formerly missed. Move (most) variables back to the place they used to be before the container structs and VIMAGE_GLOABLS (before r185088) and try to reduce the diff to stable/7 and earlier as good as possible, to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9. This also removes some header file pollution for putatively static global variables. Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are no longer needed. Reviewed by: jhb Discussed with: rwatson Sponsored by: The FreeBSD Foundation Sponsored by: CK Software GmbH MFC after: 6 days
# 55f05ae7	17-Apr-2010	Rui Paulo <rpaulo@FreeBSD.org>	MFC r206456: Honor the CE bit even when the CWR bit is set. PR: 145600 Submitted by: Richard Scheffenegger <rs at netapp.com>
# 9c251892	09-Apr-2010	Rui Paulo <rpaulo@FreeBSD.org>	Honor the CE bit even when the CWR bit is set. PR: 145600 Submitted by: Richard Scheffenegger <rs at netapp.com> MFC after: 1 week
# 66f80e90	06-Mar-2010	Robert Watson <rwatson@FreeBSD.org>	Wrap use of rw_try_upgrade() on pcbinfo with macro INP_INFO_TRY_UPGRADE() to match other pcbinfo locking macros. MFC after: 1 week
# 3e5cbaa4	09-Oct-2009	Robert Watson <rwatson@FreeBSD.org>	Merge r197814 from head to stable/8: Remove tcp_input lock statistics; these are intended for debugging only and are not intended to ship in 8.0 as they dirty additional cache lines in a performance-critical per-packet path. Approved by: re (kib, bz)
# f41dd6dc	08-Oct-2009	Robert Watson <rwatson@FreeBSD.org>	Merge r197795 from head to stable/8: In tcp_input(), we acquire a global write lock at first only if a segment is likely to trigger a TCP state change (i.e., FIN/RST/SYN). If we later have to upgrade the lock, we acquire an inpcb reference and drop both global/inpcb locks before reacquiring in-order. In that gap, the connection may transition into TIMEWAIT, so we need to loop back and reevaluate the inpcb after relocking. Reported by: Kamigishi Rei <spambox at haruhiism.net> Reviewed by: bz Approved by: re (kib)
# f681a5fd	06-Oct-2009	Robert Watson <rwatson@FreeBSD.org>	Remove tcp_input lock statistics; these are intended for debugging only and are not intended to ship in 8.0 as they dirty additional cache lines in a performance-critical per-packet path. MFC after: 3 days
# 883e9bc4	05-Oct-2009	Robert Watson <rwatson@FreeBSD.org>	In tcp_input(), we acquire a global write lock at first only if a segment is likely to trigger a TCP state change (i.e., FIN/RST/SYN). If we later have to upgrade the lock, we acquire an inpcb reference and drop both global/inpcb locks before reacquiring in-order. In that gap, the connection may transition into TIMEWAIT, so we need to loop back and reevaluate the inpcb after relocking. MFC after: 3 days Reported by: Kamigishi Rei <spambox at haruhiism.net> Reviewed by: bz
# 315e3e38	02-Aug-2009	Robert Watson <rwatson@FreeBSD.org>	Many network stack subsystems use a single global data structure to hold all pertinent statatistics for the subsystem. These structures are sometimes "borrowed" by kernel modules that require a place to store statistics for similar events. Add KPI accessor functions for statistics structures referenced by kernel modules so that they no longer encode certain specifics of how the data structures are named and stored. This change is intended to make it easier to move to per-CPU network stats following 8.0-RELEASE. The following modules are affected by this change: if_bridge if_cxgb if_gif ip_mroute ipdivert pf In practice, most of these statistics consumers should, in fact, maintain their own statistics data structures rather than borrowing structures from the base network stack. However, that change is too agressive for this point in the release cycle. Reviewed by: bz Approved by: re (kib)
# 530c0060	01-Aug-2009	Robert Watson <rwatson@FreeBSD.org>	Merge the remainder of kern_vimage.c and vimage.h into vnet.c and vnet.h, we now use jails (rather than vimages) as the abstraction for virtualization management, and what remained was specific to virtual network stacks. Minor cleanups are done in the process, and comments updated to reflect these changes. Reviewed by: bz Approved by: re (vimage blanket)
# 7973fba3	28-Jul-2009	Julian Elischer <julian@FreeBSD.org>	Somewhere along the line accept sockets stopped honoring the FIB selected for them. Fix this. Reviewed by: ambrisko Approved by: re (kib) MFC after: 3 days
# eddfbb76	14-Jul-2009	Robert Watson <rwatson@FreeBSD.org>	Build on Jeff Roberson's linker-set based dynamic per-CPU allocator (DPCPU), as suggested by Peter Wemm, and implement a new per-virtual network stack memory allocator. Modify vnet to use the allocator instead of monolithic global container structures (vinet, ...). This change solves many binary compatibility problems associated with VIMAGE, and restores ELF symbols for virtualized global variables. Each virtualized global variable exists as a "reference copy", and also once per virtual network stack. Virtualized global variables are tagged at compile-time, placing the in a special linker set, which is loaded into a contiguous region of kernel memory. Virtualized global variables in the base kernel are linked as normal, but those in modules are copied and relocated to a reserved portion of the kernel's vnet region with the help of a the kernel linker. Virtualized global variables exist in per-vnet memory set up when the network stack instance is created, and are initialized statically from the reference copy. Run-time access occurs via an accessor macro, which converts from the current vnet and requested symbol to a per-vnet address. When "options VIMAGE" is not compiled into the kernel, normal global ELF symbols will be used instead and indirection is avoided. This change restores static initialization for network stack global variables, restores support for non-global symbols and types, eliminates the need for many subsystem constructors, eliminates large per-subsystem structures that caused many binary compatibility issues both for monitoring applications (netstat) and kernel modules, removes the per-function INIT_VNET_*() macros throughout the stack, eliminates the need for vnet_symmap ksym(2) munging, and eliminates duplicate definitions of virtualized globals under VIMAGE_GLOBALS. Bump __FreeBSD_version and update UPDATING. Portions submitted by: bz Reviewed by: bz, zec Discussed with: gnn, jamie, jeff, jhb, julian, sam Suggested by: peter Approved by: re (kensmith)
# 8c0fec80	23-Jun-2009	Robert Watson <rwatson@FreeBSD.org>	Modify most routines returning 'struct ifaddr *' to return references rather than pointers, requiring callers to properly dispose of those references. The following routines now return references: ifaddr_byindex ifa_ifwithaddr ifa_ifwithbroadaddr ifa_ifwithdstaddr ifa_ifwithnet ifaof_ifpforaddr ifa_ifwithroute ifa_ifwithroute_fib rt_getifa rt_getifa_fib IFP_TO_IA ip_rtaddr in6_ifawithifp in6ifa_ifpforlinklocal in6ifa_ifpwithaddr in6_ifadd carp_iamatch6 ip6_getdstifaddr Remove unused macro which didn't have required referencing: IFP_TO_IA6 This closes many small races in which changes to interface or address lists while an ifaddr was in use could lead to use of freed memory (etc). In a few cases, add missing if_addr_list locking required to safely acquire references. Because of a lack of deep copying support, we accept a race in which an in6_ifaddr pointed to by mbuf tags and extracted with ip6_getdstifaddr() doesn't hold a reference while in transmit. Once we have mbuf tag deep copy support, this can be fixed. Reviewed by: bz Obtained from: Apple, Inc. (portions) MFC after: 6 weeks (portions)
# 6dfb8b31	16-Jun-2009	John Baldwin <jhb@FreeBSD.org>	Fix edge cases with ticks wrapping from INT_MAX to INT_MIN in the handling of the per-tcpcb t_badtrxtwin. Submitted by: bde
# 1a0e7cfc	11-Jun-2009	John Baldwin <jhb@FreeBSD.org>	Trim extra ()'s. Submitted by: bde
# 0e8cc7e7	10-Jun-2009	John Baldwin <jhb@FreeBSD.org>	Change a few members of tcpcb that store cached copies of ticks to be ints instead of unsigned longs. This fixes a few overflow edge cases on 64-bit platforms. Specifically, if an idle connection receives a packet shortly before 2^31 clock ticks of uptime (about 25 days with hz=1000) and the keep alive timer fires after 2^31 clock ticks, the keep alive timer will think that the connection has been idle for a very long time and will immediately drop the connection instead of sending a keep alive probe. Reviewed by: silby, gnn, lstewart MFC after: 1 week
# bcf11e8d	05-Jun-2009	Robert Watson <rwatson@FreeBSD.org>	Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd
# f93bfb23	02-Jun-2009	Robert Watson <rwatson@FreeBSD.org>	Add internal 'mac_policy_count' counter to the MAC Framework, which is a count of the number of registered policies. Rather than unconditionally locking sockets before passing them into MAC, lock them in the MAC entry points only if mac_policy_count is non-zero. This avoids locking overhead for a number of socket system calls when no policies are registered, eliminating measurable overhead for the MAC Framework for the socket subsystem when there are no active policies. Possibly socket locks should be acquired by policies if they are required for socket labels, which would further avoid locking overhead when there are policies but they don't require labeling of sockets, or possibly don't even implement socket controls. Obtained from: TrustedBSD Project
# 81ad7eb0	27-May-2009	Zachary Loafman <zml@FreeBSD.org>	Correct handling of SYN packets that are to the left of the current window of an ESTABLISHED connection. Reviewed by: net@, gnn Approved by: dfr (mentor)
# 78b50714	11-Apr-2009	Robert Watson <rwatson@FreeBSD.org>	Update stats in struct tcpstat using two new macros, TCPSTAT_ADD() and TCPSTAT_INC(), rather than directly manipulating the fields across the kernel. This will make it easier to change the implementation of these statistics, such as using per-CPU versions of the data structures. MFC after: 3 days
# 80cb9f21	10-Apr-2009	Kip Macy <kmacy@FreeBSD.org>	Import "flowid" support for serializing flows across transmit queues Reviewed by: rwatson and jeli
# ad71fe3c	15-Mar-2009	Robert Watson <rwatson@FreeBSD.org>	Correct a number of evolved problems with inp_vflag and inp_flags: certain flags that should have been in inp_flags ended up in inp_vflag, meaning that they were inconsistently locked, and in one case, interpreted. Move the following flags from inp_vflag to gaps in the inp_flags space (and clean up the inp_flags constants to make gaps more obvious to future takers): INP_TIMEWAIT INP_SOCKREF INP_ONESBCAST INP_DROPPED Some aspects of this change have no effect on kernel ABI at all, as these are UDP/TCP/IP-internal uses; however, netstat and sockstat detect INP_TIMEWAIT when listing TCP sockets, so any MFC will need to take this into account. MFC after: 1 week (or after dependencies are MFC'd) Reviewed by: bz
# 24cb0f22	14-Jan-2009	Lawrence Stewart <lstewart@FreeBSD.org>	Add TCP Appropriate Byte Counting (RFC 3465) support to kernel. The new behaviour is on by default, and can be disabled by setting the net.inet.tcp.rfc3465 sysctl to 0 to obtain previous behaviour. The patch changes struct tcpcb in sys/netinet/tcp_var.h which breaks the ABI. Bump __FreeBSD_version to 800061 accordingly. User space tools that rely on the size of struct tcpcb (e.g. sockstat) need to be recompiled. Reviewed by: rpaulo, gnn Approved by: gnn, kmacy (mentors) Sponsored by: FreeBSD Foundation
# dcdb4371	16-Dec-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Use inc_flags instead of the inc_isipv6 alias which so far had been the only flag with random usage patterns. Switch inc_flags to be used as a real bit field by using INC_ISIPV6 with bitops to check for the 'isipv6' condition. While here fix a place or two where in case of v4 inc_flags were not properly initialized before.[1] Found by: rwatson during review [1] Discussed with: rwatson Reviewed by: rwatson MFC after: 4 weeks
# d15fb965	09-Dec-2008	Robert Watson <rwatson@FreeBSD.org>	Enhance one comment relating to recent TCP locking changes, and fix a typo in another. MFC after: 6 weeks
# 252ca428	08-Dec-2008	Robert Watson <rwatson@FreeBSD.org>	Move from solely write-locking the global tcbinfo in tcp_input() to read-locking in the TCP input path, allowing greater TCP input parallelism where multiple ithreads or ithread and netisr are able to run in parallel. Previously, most TCP input paths held a write lock on the global tcbinfo lock, effectively serializing TCP input. Before looking up the connection, acquire a write lock if a potentially state-changing flag is set on the TCP segment header (FIN, RST, SYN), and otherwise a read lock. We may later have to upgrade to a write lock in certain cases (ACKs received by the syncache or during TIMEWAIT) in order to support global state transitions, but this is never required for steady-state packets. Upgrading from a write lock to a read lock must be done as a trylock operation to avoid deadlocks, and actually violates the lock order as the tcbinfo lock preceeds the inpcb lock held at the time of upgrade. If the trylock fails, we bump the refcount on the inpcb, drop both locks, and re-acquire in-order. If another thread has freed the connection while the locks are dropped, we free the inpcb and repeat the lookup (this should hardly ever or never happen in practice). For now, maintain a number of new counters measuring how many times various cases execute, and in particular whether various optimistic assumptions about when read locks can be used, whether upgrades are done using the fast path, and whether connections close in practice in the above-described race, actually occur. MFC after: 6 weeks Discussed with: kmacy Reviewed by: bz, gnn, kmacy Tested by: kmacy
# 4b79449e	02-Dec-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Rather than using hidden includes (with cicular dependencies), directly include only the header files needed. This reduces the unneeded spamming of various headers into lots of files. For now, this leaves us with very few modules including vnet.h and thus needing to depend on opt_route.h. Reviewed by: brooks, gnn, des, zec, imp Sponsored by: The FreeBSD Foundation
# 97021c24	26-Nov-2008	Marko Zec <zec@FreeBSD.org>	Merge more of currently non-functional (i.e. resolving to whitespace) macros from p4/vimage branch. Do a better job at enclosing all instantiations of globals scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks. De-virtualize and mark as const saorder_state_alive and saorder_state_any arrays from ipsec code, given that they are never updated at runtime, so virtualizing them would be pointless. Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
# 44e33a07	19-Nov-2008	Marko Zec <zec@FreeBSD.org>	Change the initialization methodology for global variables scheduled for virtualization. Instead of initializing the affected global variables at instatiation, assign initial values to them in initializer functions. As a rule, initialization at instatiation for such variables should never be introduced again from now on. Furthermore, enclose all instantiations of such global variables in #ifdef VIMAGE_GLOBALS blocks. Essentialy, this change should have zero functional impact. In the next phase of merging network stack virtualization infrastructure from p4/vimage branch, the new initialization methology will allow us to switch between using global variables and their counterparts residing in virtualization containers with minimum code churn, and in the long run allow us to intialize multiple instances of such container structures. Discussed at: devsummit Strassburg Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
# 8e5c87f4	06-Nov-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Fix typo and while here another one. Reviewed by: keramida Reported by: keramida MFC after: 2 months (with r184720)
# 91d6cfa6	06-Nov-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Fix a bug introduced with r182851 splitting tcp_mss() into tcp_mss() and tcp_mss_update() so that tcp_mtudisc() could re-use the same code. Move the TSO logic back to tcp_mss() and out of tcp_mss_update(). We tried to avoid that initially but if were are called from tcp_output() with EMSGSIZE, we cleared the TSO flag on the tcpcb there, called into tcp_mtudisc() and tcp_mss_update() which then would reenable TSO on the tcpcb based on TSO capabilities of the interface as learnt in tcp_maxmtu/6(). So if TSO was enabled on the (possibly new) outgoing interface it was turned back on, which lead to an endless loop between tcp_output() and tcp_mtudisc() until we overflew the stack. Reported by: kmacy MFC after: 2 months (along with r182851)
# 6f01cac6	05-Nov-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Fix a bug introduced with r182851 splitting tcp_mss() into tcp_mss() and tcp_mss_update() so that tcp_mtudisc() could re-use the same code. In case we return early and got a metricptr to pass the hostcache info back to the caller we need to initialize the data to a defined state (zero it) as tcp_hc_get() would do if there was no hit. Without that the caller would check on random stack garbage which could lead to undefined results. This only affected tcp_mss() if there was no routing entry for the peer, tcp_mtudisc() was not affected. MFC after: 2 months (along with r182851)
# dd8ac7f9	26-Oct-2008	Robert Watson <rwatson@FreeBSD.org>	In both dropwithreset paths in tcp_input.c, drop the tcbinfo lock sooner to decomplicate locking and eliminate the need for a rather chatty comment about why we have to handle the global lock in a special way for the benefit of ipfw and pf cred rules. MFC after: 3 days
# 4c95fd23	26-Oct-2008	Robert Watson <rwatson@FreeBSD.org>	Remove endearing but syntactically unnecessary "return;" statements directly before the final closeing brackets of some TCP functions. MFC after: 3 days
# d7f03759	19-Oct-2008	Ulf Lilleengen <lulf@FreeBSD.org>	- Import the HEAD csup code which is the basis for the cvsmode work.
# 6c8286e4	07-Oct-2008	Robert Watson <rwatson@FreeBSD.org>	Don't pass curthread to sbreserve_locked() in tcp_do_segment(), as the netisr or ithread's socket buffer size limit is not the right limit to use. Instead, pass NULL as the other two calls to sbreserve_locked() in the TCP input path (tcp_mss()) do. In practice, this is a no-op, as ithreads and the netisr run without a process limit on socket buffer use, and a NULL thread pointer leads to not using the process's limit, if any. However, if tcp_input() is called in other contexts that do have limits, this may prevent the incorrect limit from being used. MFC after: 3 days
# 8b615593	02-Oct-2008	Marko Zec <zec@FreeBSD.org>	Step 1.5 of importing the network stack virtualization infrastructure from the vimage project, as per plan established at devsummit 08/08: http://wiki.freebsd.org/Image/Notes200808DevSummit Introduce INIT_VNET_() initializer macros, VNET_FOREACH() iterator macros, and CURVNET_SET() context setting macros, all currently resolving to NOPs. Prepare for virtualization of selected SYSCTL objects by introducing a family of SYSCTL_V_() macros, currently resolving to their global counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT(). Move selected #defines from sys/sys/vimage.h to newly introduced header files specific to virtualized subsystems (sys/net/vnet.h, sys/netinet/vinet.h etc.). All the changes are verified to have zero functional impact at this point in time by doing MD5 comparision between pre- and post-change object files(). () netipsec/keysock.c did not validate depending on compile time options. Implemented by: julian, bz, brooks, zec Reviewed by: julian, bz, brooks, kris, rwatson, ... Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
# 014ea782	25-Sep-2008	Robert Watson <rwatson@FreeBSD.org>	As a follow-on to r183323, correct another case where ip_output() was called without an inpcb pointer despite holding the tcbinfo global lock, which lead to a deadlock or panic when ipfw tried to further acquire it recursively. Reported by: Stefan Ehmann <shoesoft at gmx dot net> MFC after: 3 days
# a0ca0871	24-Sep-2008	Robert Watson <rwatson@FreeBSD.org>	When dropping a packet and issuing a reset during TCP segment handling, unconditionally drop the tcbinfo lock (after all, we assert it lines before), but call tcp_dropwithreset() under both inpcb and inpcbinfo locks only if we pass in an tcpcb. Otherwise, if the pointer is NULL, firewall code may later recurse the global tcbinfo lock trying to look up an inpcb. This is an instance where a layering violation leads not only potentially to code reentrace and recursion, but also to lock recursion, and was revealed by the conversion to rwlocks because acquiring a read lock on an rwlock already held with a write lock is forbidden. When these locks were mutexes, they simply recursed. Reported by: Stefan Ehmann <shoesoft at gmx dot net> MFC after: 3 days
# c10eb6d1	09-Sep-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Work around an integer division resulting in 0 and thus the congestion window not being incremented, if cwnd > maxseg^2. As suggested in RFC2581 increment the cwnd by 1 in this case. See http://caia.swin.edu.au/reports/080829A/CAIA-TR-080829A.pdf for more details. Submitted by: Alana Huebner, Lawrence Stewart, Grenville Armitage (caia.swin.edu.au) Reviewed by: dwmalone, gnn, rpaulo MFC After: 3 days
# 3cee92e0	07-Sep-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Split tcp_mss() in tcp_mss() and tcp_mss_update() where the former calls the latter. Merge tcp_mss_update() with code from tcp_mtudisc() basically doing the same thing. This gives us one central place where we calcuate and check mss values to update t_maxopd (maximum mss + options length) instead of two slightly different but almost equal implementations to maintain. PR: kern/118455 Reviewed by: silby (back in March) MFC after: 2 months
# ac957cd2	19-Aug-2008	Julian Elischer <julian@FreeBSD.org>	A bunch of formatting fixes brough to light by, or created by the Vimage commit a few days ago.
# 603724d3	17-Aug-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Commit step 1 of the vimage project, (network stack) virtualization work done by Marko Zec (zec@). This is the first in a series of commits over the course of the next few weeks. Mark all uses of global variables to be virtualized with a V_ prefix. Use macros to map them back to their global names for now, so this is a NOP change only. We hope to have caught at least 85-90% of what is needed so we do not invalidate a lot of outstanding patches again. Obtained from: //depot/projects/vimage-commit2/... Reviewed by: brooks, des, ed, mav, julian, jamie, kris, rwatson, zec, ... (various people I forgot, different versions) md5 (with a bit of help) Sponsored by: NLnet Foundation, The FreeBSD Foundation X-MFC after: never V_Commit_Message_Reviewed_By: more people than the patch
# f2512ba1	31-Jul-2008	Rui Paulo <rpaulo@FreeBSD.org>	MFp4 (//depot/projects/tcpecn/): TCP ECN support. Merge of my GSoC 2006 work for NetBSD. TCP ECN is defined in RFC 3168. Partly reviewed by: dwmalone, silby Obtained from: NetBSD
# 8b07e49a	09-May-2008	Julian Elischer <julian@FreeBSD.org>	Add code to allow the system to handle multiple routing tables. This particular implementation is designed to be fully backwards compatible and to be MFC-able to 7.x (and 6.x) Currently the only protocol that can make use of the multiple tables is IPv4 Similar functionality exists in OpenBSD and Linux. From my notes: ----- One thing where FreeBSD has been falling behind, and which by chance I have some time to work on is "policy based routing", which allows different packet streams to be routed by more than just the destination address. Constraints: ------------ I want to make some form of this available in the 6.x tree (and by extension 7.x) , but FreeBSD in general needs it so I might as well do it in -current and back port the portions I need. One of the ways that this can be done is to have the ability to instantiate multiple kernel routing tables (which I will now refer to as "Forwarding Information Bases" or "FIBs" for political correctness reasons). Which FIB a particular packet uses to make the next hop decision can be decided by a number of mechanisms. The policies these mechanisms implement are the "Policies" referred to in "Policy based routing". One of the constraints I have if I try to back port this work to 6.x is that it must be implemented as a EXTENSION to the existing ABIs in 6.x so that third party applications do not need to be recompiled in timespan of the branch. This first version will not have some of the bells and whistles that will come with later versions. It will, for example, be limited to 16 tables in the first commit. Implementation method, Compatible version. (part 1) ------------------------------- For this reason I have implemented a "sufficient subset" of a multiple routing table solution in Perforce, and back-ported it to 6.x. (also in Perforce though not always caught up with what I have done in -current/P4). The subset allows a number of FIBs to be defined at compile time (8 is sufficient for my purposes in 6.x) and implements the changes needed to allow IPV4 to use them. I have not done the changes for ipv6 simply because I do not need it, and I do not have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it. Other protocol families are left untouched and should there be users with proprietary protocol families, they should continue to work and be oblivious to the existence of the extra FIBs. To understand how this is done, one must know that the current FIB code starts everything off with a single dimensional array of pointers to FIB head structures (One per protocol family), each of which in turn points to the trie of routes available to that family. The basic change in the ABI compatible version of the change is to extent that array to be a 2 dimensional array, so that instead of protocol family X looking at rt_tables[X] for the table it needs, it looks at rt_tables[Y][X] when for all protocol families except ipv4 Y is always 0. Code that is unaware of the change always just sees the first row of the table, which of course looks just like the one dimensional array that existed before. The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign() are all maintained, but refer only to the first row of the array, so that existing callers in proprietary protocols can continue to do the "right thing". Some new entry points are added, for the exclusive use of ipv4 code called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(), which have an extra argument which refers the code to the correct row. In addition, there are some new entry points (currently called rtalloc_fib() and friends) that check the Address family being looked up and call either rtalloc() (and friends) if the protocol is not IPv4 forcing the action to row 0 or to the appropriate row if it IS IPv4 (and that info is available). These are for calling from code that is not specific to any particular protocol. The way these are implemented would change in the non ABI preserving code to be added later. One feature of the first version of the code is that for ipv4, the interface routes show up automatically on all the FIBs, so that no matter what FIB you select you always have the basic direct attached hosts available to you. (rtinit() does this automatically). You CAN delete an interface route from one FIB should you want to but by default it's there. ARP information is also available in each FIB. It's assumed that the same machine would have the same MAC address, regardless of which FIB you are using to get to it. This brings us as to how the correct FIB is selected for an outgoing IPV4 packet. Firstly, all packets have a FIB associated with them. if nothing has been done to change it, it will be FIB 0. The FIB is changed in the following ways. Packets fall into one of a number of classes. 1/ locally generated packets, coming from a socket/PCB. Such packets select a FIB from a number associated with the socket/PCB. This in turn is inherited from the process, but can be changed by a socket option. The process in turn inherits it on fork. I have written a utility call setfib that acts a bit like nice.. setfib -3 ping target.example.com # will use fib 3 for ping. It is an obvious extension to make it a property of a jail but I have not done so. It can be achieved by combining the setfib and jail commands. 2/ packets received on an interface for forwarding. By default these packets would use table 0, (or possibly a number settable in a sysctl(not yet)). but prior to routing the firewall can inspect them (see below). (possibly in the future you may be able to associate a FIB with packets received on an interface.. An ifconfig arg, but not yet.) 3/ packets inspected by a packet classifier, which can arbitrarily associate a fib with it on a packet by packet basis. A fib assigned to a packet by a packet classifier (such as ipfw) would over-ride a fib associated by a more default source. (such as cases 1 or 2). 4/ a tcp listen socket associated with a fib will generate accept sockets that are associated with that same fib. 5/ Packets generated in response to some other packet (e.g. reset or icmp packets). These should use the FIB associated with the packet being reponded to. 6/ Packets generated during encapsulation. gif, tun and other tunnel interfaces will encapsulate using the FIB that was in effect withthe proces that set up the tunnel. thus setfib 1 ifconfig gif0 [tunnel instructions] will set the fib for the tunnel to use to be fib 1. Routing messages would be associated with their process, and thus select one FIB or another. messages from the kernel would be associated with the fib they refer to and would only be received by a routing socket associated with that fib. (not yet implemented) In addition Netstat has been edited to be able to cope with the fact that the array is now 2 dimensional. (It looks in system memory using libkvm (!)). Old versions of netstat see only the first FIB. In addition two sysctls are added to give: a) the number of FIBs compiled in (active) b) the default FIB of the calling process. Early testing experience: ------------------------- Basically our (IronPort's) appliance does this functionality already using ipfw fwd but that method has some drawbacks. For example, It can't fully simulate a routing table because it can't influence the socket's choice of local address when a connect() is done. Testing during the generating of these changes has been remarkably smooth so far. Multiple tables have co-existed with no notable side effects, and packets have been routes accordingly. ipfw has grown 2 new keywords: setfib N ip from anay to any count ip from any to any fib N In pf there seems to be a requirement to be able to give symbolic names to the fibs but I do not have that capacity. I am not sure if it is required. SCTP has interestingly enough built in support for this, called VRFs in Cisco parlance. it will be interesting to see how that handles it when it suddenly actually does something. Where to next: -------------------- After committing the ABI compatible version and MFCing it, I'd like to proceed in a forward direction in -current. this will result in some roto-tilling in the routing code. Firstly: the current code's idea of having a separate tree per protocol family, all of the same format, and pointed to by the 1 dimensional array is a bit silly. Especially when one considers that there is code that makes assumptions about every protocol having the same internal structures there. Some protocols don't WANT that sort of structure. (for example the whole idea of a netmask is foreign to appletalk). This needs to be made opaque to the external code. My suggested first change is to add routing method pointers to the 'domain' structure, along with information pointing the data. instead of having an array of pointers to uniform structures, there would be an array pointing to the 'domain' structures for each protocol address domain (protocol family), and the methods this reached would be called. The methods would have an argument that gives FIB number, but the protocol would be free to ignore it. When the ABI can be changed it raises the possibilty of the addition of a fib entry into the "struct route". Currently, the structure contains the sockaddr of the desination, and the resulting fib entry. To make this work fully, one could add a fib number so that given an address and a fib, one can find the third element, the fib entry. Interaction with the ARP layer/ LL layer would need to be revisited as well. Qing Li has been working on this already. This work was sponsored by Ironport Systems/Cisco Reviewed by: several including rwatson, bz and mlair (parts each) Obtained from: Ironport systems/Cisco
# 8501a69c	17-Apr-2008	Robert Watson <rwatson@FreeBSD.org>	Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to explicitly select write locking for all use of the inpcb mutex. Update some pcbinfo lock assertions to assert locked rather than write-locked, although in practice almost all uses of the pcbinfo rwlock main exclusive, and all instances of inpcb lock acquisition are exclusive. This change should introduce (ideally) little functional change. However, it lays the groundwork for significantly increased parallelism in the TCP/IP code. MFC after: 3 months Tested by: kris (superset of committered patch)
# 7a3244cc	06-Apr-2008	Robert Watson <rwatson@FreeBSD.org>	Add further TCP inpcb locking assertions to some TCP input code paths. MFC after: 1 month
# c3b02504	02-Mar-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Some "cleanup" of tcp_mss(): - Move the assigment of the socket down before we first need it. No need to do it at the beginning and then drop out the function by one of the returns before using it 100 lines further down. - Use t_maxopd which was assigned the "tcp_mssdflt" for the corrrect AF already instead of another #ifdef ? : #endif block doing the same. - Remove an unneeded (duplicate) assignment of mss to t_maxseg just before we possibly change mss and re-do the assignment without using t_maxseg in between. Reviewed by: silby No objections: net@ (silence) MFC after: 5 days
# af92e6cf	01-Mar-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Fix indentation (whitespace changes only). MFC after: 6 days
# 30d239bc	24-Oct-2007	Robert Watson <rwatson@FreeBSD.org>	Merge first in a series of TrustedBSD MAC Framework KPI changes from Mac OS X Leopard--rationalize naming for entry points to the following general forms: mac_<object>_<method/action> mac_<object>_check_<method/action> The previous naming scheme was inconsistent and mostly reversed from the new scheme. Also, make object types more consistent and remove spaces from object types that contain multiple parts ("posix_sem" -> "posixsem") to make mechanical parsing easier. Introduce a new "netinet" object type for certain IPv4/IPv6-related methods. Also simplify, slightly, some entry point names. All MAC policy modules will need to be recompiled, and modules not updates as part of this commit will need to be modified to conform to the new KPI. Sponsored by: SPARTA (original patches against Mac OS X) Obtained from: TrustedBSD Project, Apple Computer
# 4b421e2d	07-Oct-2007	Mike Silbersack <silby@FreeBSD.org>	Add FBSDID to all files in netinet so that people can more easily include file version information in bug reports. Approved by: re (kensmith)
# e31d8aa3	06-Oct-2007	Mike Silbersack <silby@FreeBSD.org>	Improve the debugging message: TCP: [X.X.X.X]:X to [X.X.X.X]:X tcpflags 0x18<PUSH,ACK>; tcp_do_segment: FIN_WAIT_2: Received data after socket was closed, sending RST and removing tcpcb So that it also includes how many bytes of data were received. It now looks like this: TCP: [X.X.X.X]:X to [X.X.X.X]:X tcpflags 0x18<PUSH,ACK>; tcp_do_segment: FIN_WAIT_2: Received X bytes of data after socket was closed, sending RST and removing tcpcb Approved by: re (gnn)
# a2589465	10-Sep-2007	Ken Smith <kensmith@FreeBSD.org>	Make sure that either inp is NULL or we have obtained a lock on it before jumping to dropunlock to avoid a panic. While here move the calls to ipsec4_in_reject() and ipsec6_in_reject() so they are after we obtain the lock on inp. Original patch to avoid panic: pjd Review of locking adjustments: gnn, sam Approved by: re (rwatson)
# 218cbbea	30-Jul-2007	Dag-Erling Smørgrav <des@FreeBSD.org>	Make tcpstates[] static, and make sure TCPSTATES is defined before <netinet/tcp_fsm.h> is included into any compilation unit that needs tcpstates[]. Also remove incorrect extern declarations and TCPDEBUG conditionals. This allows kernels both with and without TCPDEBUG to build, and unbreaks the tinderbox. Approved by: re (rwatson)
# 24face54	28-Jul-2007	Matt Jacob <mjacob@FreeBSD.org>	Fix compilation problems- tcpstates is only available if TCPDEBUG is set. Approved by: re (in spirit)
# 773673c1	27-Jul-2007	Andre Oppermann <andre@FreeBSD.org>	Provide a sysctl to toggle reporting of TCP debug logging: sys.net.inet.tcp.log_debug = 1 It defaults to enabled for the moment and is to be turned off for the next release like other diagnostics from development branches. It is important to note that sysctl sys.net.inet.tcp.log_in_vain uses the same logging function as log_debug. Enabling of the former also causes the latter to engage, but not vice versa. Use consistent terminology in tcp log messages: "ignored" means a segment contains invalid flags/information and is dropped without changing state or issuing a reply. "rejected" means a segments contains invalid flags/information but is causing a reply (usually RST) and may cause a state change. Approved by: re (rwatson)
# 19bc77c5	28-Jul-2007	Andre Oppermann <andre@FreeBSD.org>	o Move all detailed checks for RST in LISTEN state from tcp_input() to syncache_rst(). o Fix tests for flag combinations of RST and SYN, ACK, FIN. Before a RST for a connection in syncache did not properly free the entry. o Add more detailed logging. Approved by: re (rwatson)
# c325962b	26-Jul-2007	Mike Silbersack <silby@FreeBSD.org>	Export the contents of the syncache to netstat. Approved by: re (kensmith) MFC after: 2 weeks
# 564aab1f	25-Jul-2007	Andre Oppermann <andre@FreeBSD.org>	Fix comments in tcp_do_segment(). Approved by: re (kensmith)
# 9fb5d4c0	04-Jul-2007	Peter Wemm <peter@FreeBSD.org>	Fix cast-qualifiers warning when INET6 is not present Approved by: re (rwatson)
# b2630c29	02-Jul-2007	George V. Neville-Neil <gnn@FreeBSD.org>	Commit the change from FAST_IPSEC to IPSEC. The FAST_IPSEC option is now deprecated, as well as the KAME IPsec code. What was FAST_IPSEC is now IPSEC. Approved by: re Sponsored by: Secure Computing
# 2cb64cb2	01-Jul-2007	George V. Neville-Neil <gnn@FreeBSD.org>	Commit IPv6 support for FAST_IPSEC to the tree. This commit includes only the kernel files, the rest of the files will follow in a second commit. Reviewed by: bz Approved by: re Supported by: Secure Computing
# f194524f	10-Jun-2007	Andre Oppermann <andre@FreeBSD.org>	Fix a case in tcp_do_segment() where tcp_update_sack_list() would be called with an incorrect segment end value. tcp_reass() may trim segments when they overlap with already existing ones in the reassembly queue. Instead of saving the segment end value before the call to tcp_reass() compute it on the fly based on the effective segment length afterwards. This bug was not really problematic as no information got lost and the eventual SACK information computation was correct nontheless. MFC after: 1 week
# e8949f74	10-Jun-2007	Andre Oppermann <andre@FreeBSD.org>	Fix style for comments, be more verbose and add some more.
# 5396d0f8	09-Jun-2007	Andre Oppermann <andre@FreeBSD.org>	Remove some bogosity from the SYN_SENT case in tcp_do_segment and simplify handling of the send/receive window scaling. No change in effective behavour. RFC1323 requires the window field in a SYN (i.e., a <SYN> or <SYN,ACK>) segment itself never be scaled. Noticed by: yar
# 8d573cc1	28-May-2007	Andre Oppermann <andre@FreeBSD.org>	Make log messages more verbose and simpler to understand for non-experts. Update comments to be more conscious, verbose and fully reflect reality.
# e885b205	28-May-2007	Andre Oppermann <andre@FreeBSD.org>	Fix indentation of the syncache_expand() section in tcp_input().
# a160e630	28-May-2007	Andre Oppermann <andre@FreeBSD.org>	Refactor and rewrite in parts the SYN handling code on listen sockets in tcp_input(): o tighten the checks on allowed TCP flags to be RFC793 and tcp-secure conform o log check failures to syslog at LOG_DEBUG level o rearrange the code flow to be easier to follow o add KASSERTs to validate assumptions of the code flow Add sysctl net.inet.tcp.syncache.rst_on_sock_fail defaulting to enable that controls the behavior on socket creation failure for a otherwise successful 3-way handshake. The socket creation can fail due to global memory shortage, listen queue limits and file descriptor limits. The sysctl allows to chose between two options to deal with this. One is to send a reset to the other endpoint to notify it about the failure (default). The other one is to ignore and treat the failure as a transient error and have the other endpoint retransmit for another try. Reviewed by: rwatson (in general)
# df541e5f	18-May-2007	Andre Oppermann <andre@FreeBSD.org>	Add tcp_log_addrs() function to generate and standardized TCP log line for use thoughout the tcp subsystem. It is IPv4 and IPv6 aware creates a line in the following format: "TCP: [1.2.3.4]:50332 to [1.2.3.4]:80 tcpflags <RST>" A "\n" is not included at the end. The caller is supposed to add further information after the standard tcp log header. The function returns a NUL terminated string which the caller has to free(s, M_TCPLOG) after use. All memory allocation is done with M_NOWAIT and the return value may be NULL in memory shortage situations. Either struct in_conninfo \|\| (struct tcphdr && (struct ip \|\| struct ip6_hdr) have to be supplied. Due to ip[6].h header inclusion limitations and ordering issues the struct ip and struct ip6_hdr parameters have to be casted and passed as void * pointers. tcp_log_addrs(struct in_conninfo inc, struct tcphdr th, void ip4hdr, void ip6hdr) Usage example: struct ip ip; char tcplog; if (tcplog = tcp_log_addrs(NULL, th, (void *)ip, NULL)) { log(LOG_DEBUG, "%s; %s: Connection attempt to closed port\n", tcplog, __func__); free(s, M_TCPLOG); }
# 2104448f	16-May-2007	Andre Oppermann <andre@FreeBSD.org>	Move TIME_WAIT related functions and timer handling from files other than repo copied tcp_subr.c into tcp_timewait.c#1.284: tcp_input.c#1.350 tcp_timewait() -> tcp_twcheck() tcp_timer.c#1.92 tcp_timer_2msl_reset() -> tcp_tw_2msl_reset() tcp_timer.c#1.92 tcp_timer_2msl_stop() -> tcp_tw_2msl_stop() tcp_timer.c#1.92 tcp_timer_2msl_tw() -> tcp_tw_2msl_scan() This is a mechanical move with appropriate renames and making them static if used only locally. The tcp_tw_2msl_scan() cleanup function is still run from the tcp_slowtimo() in tcp_timer.c.
# ec9c7553	13-May-2007	Andre Oppermann <andre@FreeBSD.org>	Complete the (mechanical) move of the TCP reassembly and timewait functions from their origininal place to their own files. TCP Reassembly from tcp_input.c -> tcp_reass.c TCP Timewait from tcp_subr.c -> tcp_timewait.c
# f2565d68	10-May-2007	Robert Watson <rwatson@FreeBSD.org>	Move universally to ANSI C function declarations, with relatively consistent style(9)-ish layout.
# d30d90dc	09-May-2007	Maxim Konovalov <maxim@FreeBSD.org>	o Fix style(9) bugs introduced in the last commit. Pointed out by: bde
# 10fe523e	09-May-2007	Maxim Konovalov <maxim@FreeBSD.org>	o Unbreak "options TCPDEBUG" && "nooptions INET6" kernel build. PR: kern/112517 Submitted by: vd
# 3529149e	06-May-2007	Andre Oppermann <andre@FreeBSD.org>	Use existing TF_SACK_PERMIT flag in struct tcpcb t_flags field instead of a decdicated sack_enable int for this bool. Change all users accordingly.
# 0ca3f933	06-May-2007	Andre Oppermann <andre@FreeBSD.org>	o Remove redundant tcp reassembly check in header prediction code o Rearrange code to make intent in TCPS_SYN_SENT case more clear o Assorted style cleanup o Comment clarification for tcp_dropwithreset()
# c5ad39b9	06-May-2007	Andre Oppermann <andre@FreeBSD.org>	Reorder the TCP header prediction test to check for the most volatile values first to spend less time on a fallback to normal processing.
# 679d9708	06-May-2007	Andre Oppermann <andre@FreeBSD.org>	Remove the defunct remains of the TCPS_TIME_WAIT cases from tcp_do_segment and change it to a void function. We use a compressed structure for TCPS_TIME_WAIT to save memory. Any late late segments arriving for such a connection is handled directly in the TW code.
# 1cd6eadf	04-May-2007	Robert Watson <rwatson@FreeBSD.org>	Tweak comment at end of tcp_input() when calling into tcp_do_segment(): the pcbinfo lock will be released as well, not just the pcb lock.
# 9fa198be	23-Apr-2007	Andre Oppermann <andre@FreeBSD.org>	o Fix INP lock leak in the minttl case o Remove indirection in the decision of unlocking inp o Further annotation of locking in tcp_input()
# df47e437	20-Apr-2007	Andre Oppermann <andre@FreeBSD.org>	o Remove unncessary TOF_SIGLEN flag from struct tcpopt o Correctly set to->to_signature in tcp_dooptions() o Update comments
# 7824d002	20-Apr-2007	Andre Oppermann <andre@FreeBSD.org>	Add more KASSERT's.
# 4d6e7130	20-Apr-2007	Andre Oppermann <andre@FreeBSD.org>	Remove bogus check for accept queue length and associated failure handling from the incoming SYN handling section of tcp_input(). Enforcement of the accept queue limits is done by sonewconn() after the 3WHS is completed. It is not necessary to have an earlier check before a connection request enters the SYN cache awaiting the full handshake. It rather limits the effectiveness of the syncache by preventing legit and illegit connections from entering it and having them shaken out before we hit the real limit which may have vanished by then. Change return value of syncache_add() to void. No status communication is required.
# e207f800	20-Apr-2007	Andre Oppermann <andre@FreeBSD.org>	Simplifly syncache_expand() and clarify its semantics. Zero is returned when the ACK is invalid and doesn't belong to any registered connection, either in syncache or through SYN cookies. True but a NULL struct socket is returned when the 3WHS completed but the socket could not be created due to insufficient resources or limits reached. For both cases an RST is sent back in tcp_input(). A logic error leading to a panic is fixed where syncache_expand() would free the mbuf on socket allocation failure but tcp_input() later supplies it to tcp_dropwithreset() to issue a RST to the peer. Reported by: kris (the panic)
# 215c8d75	15-Apr-2007	Robert Watson <rwatson@FreeBSD.org>	Remove unused variable tcbinfo_mtx.
# b8152ba7	11-Apr-2007	Andre Oppermann <andre@FreeBSD.org>	Change the TCP timer system from using the callout system five times directly to a merged model where only one callout, the next to fire, is registered. Instead of callout_reset(9) and callout_stop(9) the new function tcp_timer_activate() is used which then internally manages the callout. The single new callout is a mutex callout on inpcb simplifying the locking a bit. tcp_timer() is the called function which handles all race conditions in one place and then dispatches the individual timer functions. Reviewed by: rwatson (earlier version)
# 995a7717	04-Apr-2007	Andre Oppermann <andre@FreeBSD.org>	Add INP_INFO_UNLOCK_ASSERT() and use it in tcp_input(). Also add some further INP_INFO_WLOCK_ASSERT() while there.
# 0c38fd0a	04-Apr-2007	Andre Oppermann <andre@FreeBSD.org>	Move last tcpcb initialization for the inbound connection case from tcp_input() to syncache_socket() where it belongs and the majority of it already happens. The "tp->snd_up = tp->snd_una" is removed as it is done with the tcp_sendseqinit() macro a few lines earlier.
# 5dd9dfef	04-Apr-2007	Andre Oppermann <andre@FreeBSD.org>	Retire unused TCP_SACK_DEBUG.
# b728e902	04-Apr-2007	Andre Oppermann <andre@FreeBSD.org>	In tcp_dooptions() skip over SACK options if it is a SYN segment.
# 1929eae1	27-Mar-2007	Andre Oppermann <andre@FreeBSD.org>	When blackholing do a 'dropunlock' in the new world order to prevent the INP_INFO_LOCK from leaking. Reported by: ache Found by: rwatson
# 14739780	24-Mar-2007	Maxim Konovalov <maxim@FreeBSD.org>	o Use a define for a buffer size. Prodded by: db o Add missed vars for TCPDEBUG in tcp_do_segment(). Prodded by: tinderbox
# 302ce8d6	23-Mar-2007	Andre Oppermann <andre@FreeBSD.org>	Split tcp_input() into its two functional parts: o tcp_input() now handles TCP segment sanity checks and preparations including the INPCB lookup and syncache. o tcp_do_segment() handles all data and ACK processing and is IPv4/v6 agnostic. Change all KASSERT() messages to ("%s: ", __func__). The changes in this commit are primarily of mechanical nature and no functional changes besides the function split are made. Discussed with: rwatson
# 4dfdffe9	23-Mar-2007	Andre Oppermann <andre@FreeBSD.org>	Tidy up some code to conform better to surroundings and style(9), 0 = NULL and space/tab.
# fc30a251	23-Mar-2007	Andre Oppermann <andre@FreeBSD.org>	Bring SACK option handling in tcp_dooptions() in line with all other options and ajust users accordingly.
# ad3f9ab3	21-Mar-2007	Andre Oppermann <andre@FreeBSD.org>	ANSIfy function declarations and remove register keywords for variables. Consistently apply style to all function declarations.
# b10fbdea	21-Mar-2007	Andre Oppermann <andre@FreeBSD.org>	Tidy up IPFIREWALL_FORWARD sections and comments.
# 794235b7	21-Mar-2007	Andre Oppermann <andre@FreeBSD.org>	Update and clarify comments in first section of tcp_input().
# db33b3e6	21-Mar-2007	Andre Oppermann <andre@FreeBSD.org>	Tidy up the ACCEPTCONN section of tcp_input(), ajust comments and remove old dead T/TCP code.
# 574b6964	21-Mar-2007	Andre Oppermann <andre@FreeBSD.org>	Tidy up tcp_log_in_vain and blackhole.
# 85c49791	21-Mar-2007	Andre Oppermann <andre@FreeBSD.org>	Make TCP_DROP_SYNFIN a standard part of TCP. Disabled by default it doesn't impede normal operation negatively and is only a few lines of code. It's close relatives blackhole and log_in_vain aren't options either.
# e406f5a1	21-Mar-2007	Andre Oppermann <andre@FreeBSD.org>	Remove tcp_minmssoverload DoS detection logic. The problem it tried to protect us from wasn't really there and it only bloats the code. Should the problem surface in the future we can simply resurrect it from cvs history.
# 6489fe65	19-Mar-2007	Andre Oppermann <andre@FreeBSD.org>	Match up SYSCTL declaration style.
# 02a1a643	15-Mar-2007	Andre Oppermann <andre@FreeBSD.org>	Consolidate insertion of TCP options into a segment from within tcp_output() and syncache_respond() into its own generic function tcp_addoptions(). tcp_addoptions() is alignment agnostic and does optimal packing in all cases. In struct tcpopt rename to_requested_s_scale to just to_wscale. Add a comment with quote from RFC1323: "The Window field in a SYN (i.e., a <SYN> or <SYN,ACK>) segment itself is never scaled." Reviewed by: silby, mohans, julian Sponsored by: TCP/IP Optimization Fundraise 2005
# 95ad8418	07-Mar-2007	Qing Li <qingli@FreeBSD.org>	This patch is provided to fix a couple of deployment issues observed in the field. In one situation, one end of the TCP connection sends a back-to-back RST packet, with delayed ack, the last_ack_sent variable has not been update yet. When tcp_insecure_rst is turned off, the code treats the RST as invalid because last_ack_sent instead of rcv_nxt is compared against th_seq. Apparently there is some kind of firewall that sits in between the two ends and that RST packet is the only RST packet received. With short lived HTTP connections, the symptom is a large accumulation of connections over a short period of time . The +/-(1) factor is to take care of implementations out there that generate RST packets with these types of sequence numbers. This behavior has also been observed in live environments. Reviewed by: silby, Mike Karels MFC after: 1 week
# 4a32dc29	28-Feb-2007	Mohan Srinivasan <mohans@FreeBSD.org>	In the SYN_SENT case, Initialize the snd_wnd before the call to tcp_mss(). The TCP hostcache logic in tcp_mss() depends on the snd_wnd being initialized.
# 7c72af87	26-Feb-2007	Mohan Srinivasan <mohans@FreeBSD.org>	Reap FIN_WAIT_2 connections marked SOCANTRCVMORE faster. This mitigate potential issues where the peer does not close, potentially leaving thousands of connections in FIN_WAIT_2. This is controlled by a new sysctl fast_finwait2_recycle, which is disabled by default. Reviewed by: gnn, silby.
# afdb4274	20-Feb-2007	Robert Watson <rwatson@FreeBSD.org>	Rename two identically named log_in_vain variables: tcp_input.c's static log_in_vain to tcp_log_in_vain, and udp_usrreq's global log_in_vain to udp_log_in_vain. MFC after: 1 week
# 6741ecf5	01-Feb-2007	Andre Oppermann <andre@FreeBSD.org>	Auto sizing TCP socket buffers. Normally the socket buffers are static (either derived from global defaults or set with setsockopt) and do not adapt to real network conditions. Two things happen: a) your socket buffers are too small and you can't reach the full potential of the network between both hosts; b) your socket buffers are too big and you waste a lot of kernel memory for data just sitting around. With automatic TCP send and receive socket buffers we can start with a small buffer and quickly grow it in parallel with the TCP congestion window to match real network conditions. FreeBSD has a default 32K send socket buffer. This supports a maximal transfer rate of only slightly more than 2Mbit/s on a 100ms RTT trans-continental link. Or at 200ms just above 1Mbit/s. With TCP send buffer auto scaling and the default values below it supports 20Mbit/s at 100ms and 10Mbit/s at 200ms. That's an improvement of factor 10, or 1000%. For the receive side it looks slightly better with a default of 64K buffer size. New sysctls are: net.inet.tcp.sendbuf_auto=1 (enabled) net.inet.tcp.sendbuf_inc=8192 (8K, step size) net.inet.tcp.sendbuf_max=262144 (256K, growth limit) net.inet.tcp.recvbuf_auto=1 (enabled) net.inet.tcp.recvbuf_inc=16384 (16K, step size) net.inet.tcp.recvbuf_max=262144 (256K, growth limit) Tested by: many (on HEAD and RELENG_6) Approved by: re MFC after: 1 month
# 1d54aa3b	11-Dec-2006	Bjoern A. Zeeb <bz@FreeBSD.org>	MFp4: 92972, 98913 + one more change In ip6_sprintf no longer use and return one of eight static buffers for printing/logging ipv6 addresses. The caller now has to hand in a sufficiently large buffer as first argument.
# aed55708	22-Oct-2006	Robert Watson <rwatson@FreeBSD.org>	Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead. This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd. Obtained from: TrustedBSD Project Sponsored by: SPARTA
# e16fa5ca	25-Sep-2006	John-Mark Gurney <jmg@FreeBSD.org>	fix calculating to_tsecr... This prevents the rtt calculations from going all wonky...
# f1edc3bd	23-Sep-2006	Bruce M Simpson <bms@FreeBSD.org>	Always set the IP version in the TCP input path, to preserve the header field for possible later IPSEC SPD lookup, even when the kernel is built without 'options INET6'. PR: kern/57760 MFC after: 1 week Submitted by: Joachim Schueth
# bf6d304a	13-Sep-2006	Andre Oppermann <andre@FreeBSD.org>	Rewrite of TCP syncookies to remove locking requirements and to enhance functionality: - Remove a rwlock aquisition/release per generated syncookie. Locking is now integrated with the bucket row locking of syncache itself and syncookies no longer add any additional lock overhead. - Syncookie secrets are different for and stored per syncache buck row. Secrets expire after 16 seconds and are reseeded on-demand. - The computational overhead for syncookie generation and verification is one MD5 hash computation as before. - Syncache can be turned off and run with syncookies only by setting the sysctl net.inet.tcp.syncookies_only=1. This implementation extends the orginal idea and first implementation of FreeBSD by using not only the initial sequence number field to store information but also the timestamp field if present. This way we can keep track of the entire state we need to know to recreate the session in its original form. Almost all TCP speakers implement RFC1323 timestamps these days. For those that do not we still have to live with the known shortcomings of the ISN only SYN cookies. The use of the timestamp field causes the timestamps to be randomized if syncookies are enabled. The idea of SYN cookies is to encode and include all necessary information about the connection setup state within the SYN-ACK we send back and thus to get along without keeping any local state until the ACK to the SYN-ACK arrives (if ever). Everything we need to know should be available from the information we encoded in the SYN-ACK. A detailed description of the inner working of the syncookies mechanism is included in the comments in tcp_syncache.c. Reviewed by: silby (slightly earlier version) Sponsored by: TCP/IP Optimization Fundraise 2005
# 751dea29	07-Sep-2006	Ruslan Ermilov <ru@FreeBSD.org>	Back when we had T/TCP support, we used to apply different timeouts for TCP and T/TCP connections in the TIME_WAIT state, and we had two separate timed wait queues for them. Now that is has gone, the timeout is always 2*MSL again, and there is no reason to keep two queues (the first was unused anyway!). Also, reimplement the remaining queue using a TAILQ (it was technically impossible before, with two queues).
# 233dcce1	06-Sep-2006	Andre Oppermann <andre@FreeBSD.org>	First step of TSO (TCP segmentation offload) support in our network stack. o add IFCAP_TSO[46] for drivers to announce this capability for IPv4 and IPv6 o add CSUM_TSO flag to mbuf pkthdr csum_flags field o add tso_segsz field to mbuf pkthdr o enhance ip_output() packet length check to allow for large TSO packets o extend tcp_maxmtu[46]() with a flag pointer to pass interface capabilities o adjust all callers of tcp_maxmtu[46]() accordingly Discussed on: -current, -net Sponsored by: TCP/IP Optimization Fundraise 2005
# 464469c7	11-Aug-2006	Mohan Srinivasan <mohans@FreeBSD.org>	Fixes an edge case bug in timewait handling where ticks rolling over causing the timewait expiry to be exactly 0 corrupts the timewait queues (and that entry). Reviewed by: silby
# 421d8aa6	29-Jun-2006	Bjoern A. Zeeb <bz@FreeBSD.org>	Use INPLOOKUP_WILDCARD instead of just 1 more consistently. OKed by: rwatson (some weeks ago)
# 8bfb1918	26-Jun-2006	Andre Oppermann <andre@FreeBSD.org>	Some cleanups and janitorial work to tcp_syncache: o don't assign remote/local host/port information manually between provided struct in_conninfo and struct syncache, bcopy() it instead o rename sc_tsrecent to sc_tsreflect in struct syncache to better capture the purpose of this field o rename sc_request_r_scale to sc_requested_r_scale for ditto reasons o fix IPSEC error case printf's to report correct function name o in syncache_socket() only transpose enhanced tcp options parameters to struct tcpcb when the inpcb doesn't has TF_NOOPT set o in syncache_respond() reorder stack variables o in syncache_respond() remove bogus KASSERT() No functional changes. Sponsored by: TCP/IP Optimization Fundraise 2005
# f72167f4	26-Jun-2006	Andre Oppermann <andre@FreeBSD.org>	Some cleanups and janitorial work to tcp_dooptions(): o redefine the parameter 'is_syn' to 'flags', add TO_SYN flag and adjust its usage accordingly o update the comments to the tcp_dooptions() invocation in tcp_input():after_listen to reflect reality o move the logic checking the echoed timestamp out of tcp_dooptions() to the only place that uses it next to the invocation described in the previous item o adjust parsing of TCPOPT_SACK_PERMITTED to use the same style as the others o add comments in to struct tcpopt.to_flags #defines No functional changes. Sponsored by: TCP/IP Optimization Fundraise 2005
# 5e1aa279	18-Jun-2006	David Malone <dwmalone@FreeBSD.org>	When we receive an out-of-window SYN for an "ESTABLISHED" connection, ACK the SYN as required by RFC793, rather than ignoring it. NetBSD have had a similar change since 1999. PR: 93236 Submitted by: Grant Edwards <grante@visi.com> MFC after: 1 month
# 351630c4	17-Jun-2006	Andre Oppermann <andre@FreeBSD.org>	Add locking to TCP syncache and drop the global tcpinfo lock as early as possible for the syncache_add() case. The syncache timer no longer aquires the tcpinfo lock and timeout/retransmit runs can happen in parallel with bucket granularity. On a P4 the additional locks cause a slight degression of 0.7% in tcp connections per second. When IP and TCP input are deserialized and can run in parallel this little overhead can be neglected. The syncookie handling still leaves room for improvement and its random salts may be moved to the syncache bucket head structures to remove the second lock operation currently required for it. However this would be a more involved change from the way syncookies work at the moment. Reviewed by: rwatson Tested by: rwatson, ps (earlier version) Sponsored by: TCP/IP Optimization Fundraise 2005
# 4f590175	21-Apr-2006	Paul Saab <ps@FreeBSD.org>	Allow for nmbclusters and maxsockets to be increased via sysctl. An eventhandler is used to update all the various zones that depend on these values.
# 3cbe7faf	09-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Modify tcp_timewait() to accept an inpcb reference, not a tcptw reference. For now, we allow the possibility that the in_ppcb pointer in the inpcb may be NULL if a timewait socket has had its tcptw structure recycled. This allows tcp_timewait() to consistently unlock the inpcb. Reported by: Kazuaki Oda <kaakun at highway dot ne dot jp> MFC after: 3 months
# a460ae4b	05-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Don't unlock a timewait structure if the pointer is NULL in tcp_timewait(). This corrects a bug (or lack of fixing of a bug) in tcp_input.c:1.295. Submitted by: Kazuaki Oda <kaakun at highway dot ne dot jp> MFC after: 3 months
# ae0e7143	03-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Before dereferencing intotw() when INP_TIMEWAIT, check for inp_ppcb being NULL. We currently do allow this to happen, but may want to remove that possibility in the future. This case can occur when a socket is left open after TCP wraps up, and the timewait state is recycled. This will be cleaned up in the future. Found by: Kazuaki Oda <kaakun at highway dot ne dot jp> MFC after: 3 months
# afa39e25	03-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Change inp_ppcb from caddr_t to void , fix/remove associated related casts. Consistently use intotw() to cast inp_ppcb pointers to struct tcptw pointers. Consistently use intotcpcb() to cast inp_ppcb pointers to struct tcpcb * pointers. Don't assign tp to the results to intotcpcb() during variable declation at the top of functions, as that is before the asserts relating to locking have been performed. Do this later in the function after appropriate assertions have run to allow that operation to be conisdered safe. MFC after: 3 months
# 623dce13	01-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Update TCP for infrastructural changes to the socket/pcb refcount model, pru_abort(), pru_detach(), and in_pcbdetach(): - Universally support and enforce the invariant that so_pcb is never NULL, converting dozens of unnecessary NULL checks into assertions, and eliminating dozens of unnecessary error handling cases in protocol code. - In some cases, eliminate unnecessary pcbinfo locking, as it is no longer required to ensure so_pcb != NULL. For example, the receive code no longer requires the pcbinfo lock, and the send code only requires it if building a new connection on an otherwise unconnected socket triggered via sendto() with an address. This should significnatly reduce tcbinfo lock contention in the receive and send cases. - In order to support the invariant that so_pcb != NULL, it is now necessary for the TCP code to not discard the tcpcb any time a connection is dropped, but instead leave the tcpcb until the socket is shutdown. This case is handled by setting INP_DROPPED, to substitute for using a NULL so_pcb to indicate that the connection has been dropped. This requires the inpcb lock, but not the pcbinfo lock. - Unlike all other protocols in the tree, TCP may need to retain access to the socket after the file descriptor has been closed. Set SS_PROTOREF in tcp_detach() in order to prevent the socket from being freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether or not it needs to free the socket when the connection finally does close. The typical case where this occurs is if close() is called on a TCP socket before all sent data in the send socket buffer has been transmitted or acknowledged. If INP_SOCKREF is found when the connection is dropped, we release the inpcb, tcpcb, and socket instead of flagging INP_DROPPED. - Abort and detach protocol switch methods no longer return failures, nor attempt to free sockets, as the socket layer does this. - Annotate the existence of a long-standing race in the TCP timer code, in which timers are stopped but not drained when the socket is freed, as waiting for drain may lead to deadlocks, or have to occur in a context where waiting is not permitted. This race has been handled by testing to see if the tcpcb pointer in the inpcb is NULL (and vice versa), which is not normally permitted, but may be true of a inpcb and tcpcb have been freed. Add a counter to test how often this race has actually occurred, and a large comment for each instance where we compare potentially freed memory with NULL. This will have to be fixed in the near future, but requires is to further address how to handle the timer shutdown shutdown issue. - Several TCP calls no longer potentially free the passed inpcb/tcpcb, so no longer need to return a pointer to indicate whether the argument passed in is still valid. - Un-macroize debugging and locking setup for various protocol switch methods for TCP, as it lead to more obscurity, and as locking becomes more customized to the methods, offers less benefit. - Assert copyright on tcp_usrreq.c due to significant modifications that have been made as part of this work. These changes significantly modify the memory management and connection logic of our TCP implementation, and are (as such) High Risk Changes, and likely to contain serious bugs. Please report problems to the current@ mailing list ASAP, ideally with simple test cases, and optionally, packet traces. MFC after: 3 months
# 1c53f806	25-Mar-2006	Robert Watson <rwatson@FreeBSD.org>	Explicitly assert socket pointer is non-NULL in tcp_input() so as to provide better debugging information. Prefer explicit comparison to NULL for tcpcb pointers rather than treating them as booleans. MFC after: 1 month
# 464fcfbc	28-Feb-2006	Andre Oppermann <andre@FreeBSD.org>	Rework TCP window scaling (RFC1323) to properly scale the send window right from the beginning and partly clean up the differences in handling between SYN_SENT and SYN_RCVD (syncache). Further changes to this code to come. This is a first incremental step to a general overhaul and streamlining of the TCP code. PR: kern/15095 PR: kern/92690 (partly) Reviewed by: qingli (and tested with ANVL) Sponsored by: TCP/IP Optimization Fundraise 2005
# 4b8e98d6	23-Feb-2006	Qing Li <qingli@FreeBSD.org>	This patch fixes the problem where the current TCP code can not handle simultaneous open. Both the bug and the patch were verified using the ANVL test suite. PR: kern/74935 Submitted by: qingli (before I became committer) Reviewed by: andre MFC after: 5 days
# 8e8aab7a	18-Feb-2006	Andre Oppermann <andre@FreeBSD.org>	Remove unneeded includes and provide more accurate description to others. Submitted by: garys PR: kern/86437
# eaf80179	16-Feb-2006	Andre Oppermann <andre@FreeBSD.org>	Have TCP Inflight disable itself if the RTT is below a certain threshold. Inflight doesn't make sense on a LAN as it has trouble figuring out the maximal bandwidth because of the coarse tick granularity. The sysctl net.inet.tcp.inflight.rttthresh specifies the threshold in milliseconds below which inflight will disengage. It defaults to 10ms. Tested by: Joao Barros <joao.barros-at-gmail.com>, Rich Murphey <rich-at-whiteoaklabs.com> Sponsored by: TCP/IP Optimization Fundraise 2005
# 02707462	18-Jan-2006	Andre Oppermann <andre@FreeBSD.org>	Do not derefence the ip header pointer in the IPv6 case. This fixes a bug in the previous commit. Found by: Coverity Prevent(tm) Coverity ID: CID253 Sponsored by: TCP/IP Optimization Fundraise 2005 MFC after: 3 days
# 34f83c52	14-Jan-2006	George V. Neville-Neil <gnn@FreeBSD.org>	Check the correct TTL in both the IPv6 and IPv4 cases. Submitted by: glebius Reviewed by: gnn, bz Found with: Coverity Prevent(tm)
# ef39adf0	18-Nov-2005	Andre Oppermann <andre@FreeBSD.org>	Consolidate all IP Options handling functions into ip_options.[ch] and include ip_options.h into all files making use of IP Options functions. From ip_input.c rev 1.306: ip_dooptions(struct mbuf m, int pass) save_rte(m, option, dst) ip_srcroute(m0) ip_stripoptions(m, mopt) From ip_output.c rev 1.249: ip_insertoptions(m, opt, phlen) ip_optcopy(ip, jp) ip_pcbopts(struct inpcb inp, int optname, struct mbuf *m) No functional changes in this commit. Discussed with: rwatson Sponsored by: TCP/IP Optimization Fundraise 2005
# a65e12b0	19-Oct-2005	Robert Watson <rwatson@FreeBSD.org>	Convert if (tp->t_state == TCPS_LISTEN) panic() into a KASSERT. MFC after: 2 weeks
# 4d3b1346	23-Aug-2005	Paul Saab <ps@FreeBSD.org>	Remove a KASSERT in the sack path that fails because of a interaction between sack and a bug in the "bad retransmit recovery" logic. This is a workaround, the underlying bug will be fixed later. Submitted by: Mohan Srinivasan, Noritoshi Demizu
# 936cd18d	22-Aug-2005	Andre Oppermann <andre@FreeBSD.org>	Add socketoption IP_MINTTL. May be used to set the minimum acceptable TTL a packet must have when received on a socket. All packets with a lower TTL are silently dropped. Works on already connected/connecting and listening sockets for RAW/UDP/TCP. This option is only really useful when set to 255 preventing packets from outside the directly connected networks reaching local listeners on sockets. Allows userland implementation of 'The Generalized TTL Security Mechanism (GTSM)' according to RFC3682. Examples of such use include the Cisco IOS BGP implementation command "neighbor ttl-security". MFC after: 2 weeks Sponsored by: TCP/IP Optimization Fundraise 2005
# d7587117	05-Jul-2005	Paul Saab <ps@FreeBSD.org>	Fix for a bug in newreno partial ack handling where if a large amount of data is partial acked, snd_cwnd underflows, causing a burst. Found, Submitted by: Noritoshi Demizu Approved by: re
# 482ac968	01-Jul-2005	Paul Saab <ps@FreeBSD.org>	Fix for a bug in the change that defers sack option processing until after PAWS checks. The symptom of this is an inconsistency in the cached sack state, caused by the fact that the sack scoreboard was not being updated for an ACK handled in the header prediction path. Found by: Andrey Chernov. Submitted by: Noritoshi Demizu, Raja Mukerji. Approved by: re
# 69e03620	01-Jul-2005	Paul Saab <ps@FreeBSD.org>	Fix for a SACK crash caused by a bug in tcp_reass(). tcp_reass() does not clear tlen and frees the mbuf (leaving th pointing at freed memory), if the data segment is a complete duplicate. This change works around that bug. A fix for the tcp_reass() bug will appear later (that bug is benign for now, as neither th nor tlen is referenced in tcp_input() after the call to tcp_reass()). Found by: Pawel Jakub Dawidek. Submitted by: Raja Mukerji, Noritoshi Demizu. Approved by: re
# 0a389eab	29-Jun-2005	Simon L. B. Nielsen <simon@FreeBSD.org>	Fix ipfw packet matching errors with address tables. The ipfw tables lookup code caches the result of the last query. The kernel may process multiple packets concurrently, performing several concurrent table lookups. Due to an insufficient locking, a cached result can become corrupted that could cause some addresses to be incorrectly matched against a lookup table. Submitted by: ru Reviewed by: csjp, mlaier Security: CAN-2005-2019 Security: FreeBSD-SA-05:13.ipfw Correct bzip2 permission race condition vulnerability. Obtained from: Steve Grubb via RedHat Security: CAN-2005-0953 Security: FreeBSD-SA-05:14.bzip2 Approved by: obrien Correct TCP connection stall denial of service vulnerability. A TCP packets with the SYN flag set is accepted for established connections, allowing an attacker to overwrite certain TCP options. Submitted by: Noritoshi Demizu Reviewed by: andre, Mohan Srinivasan Security: CAN-2005-2068 Security: FreeBSD-SA-05:15.tcp Approved by: re (security blanket), cperciva
# 5a53ca16	27-Jun-2005	Paul Saab <ps@FreeBSD.org>	- Postpone SACK option processing until after PAWS checks. SACK option processing is now done in the ACK processing case. - Merge tcp_sack_option() and tcp_del_sackholes() into a new function called tcp_sack_doack(). - Test (SEG.ACK < SND.MAX) before processing the ACK. Submitted by: Noritoshi Demizu Reveiewed by: Mohan Srinivasan, Raja Mukerji Approved by: re
# 68d37625	25-Jun-2005	Stephan Uphoff <ups@FreeBSD.org>	Fix a timer ticks wrap around bug for minmssoverload processing. Approved by: re (scottl,dwhite) MFC after: 4 weeks
# 1e2d989d	31-May-2005	Robert Watson <rwatson@FreeBSD.org>	Assert that tcbinfo is locked in tcp_input() before calling into tcp_drop(). MFC after: 7 days
# 416738a7	01-Jun-2005	Robert Watson <rwatson@FreeBSD.org>	Assert the tcbinfo lock whenever tcp_close() is to be called by tcp_input(). MFC after: 7 days
# 808f11b7	25-May-2005	Paul Saab <ps@FreeBSD.org>	This is conform with the terminology in M.Mathis and J.Mahdavi, "Forward Acknowledgement: Refining TCP Congestion Control" SIGCOMM'96, August 1996. Submitted by: Noritoshi Demizu, Raja Mukerji
# 0077b016	11-May-2005	Paul Saab <ps@FreeBSD.org>	When looking for the next hole to retransmit from the scoreboard, or to compute the total retransmitted bytes in this sack recovery episode, the scoreboard is traversed. While in sack recovery, this traversal occurs on every call to tcp_output(), every dupack and every partial ack. The scoreboard could potentially get quite large, making this traversal expensive. This change optimizes this by storing hints (for the next hole to retransmit and the total retransmitted bytes in this sack recovery episode) reducing the complexity to find these values from O(n) to constant time. The debug code that sanity checks the hints against the computed value will be removed eventually. Submitted by: Mohan Srinivasan, Noritoshi Demizu, Raja Mukerji.
# 25e6f9ed	14-Apr-2005	Paul Saab <ps@FreeBSD.org>	Fix for a TCP SACK bug where more than (win/2) bytes could have been in flight in SACK recovery. Found by: Noritoshi Demizu Submitted by: Mohan Srinivasan <mohans at yahoo-inc dot com> Noritoshi Demizu <demizu at dd dot ij4u dot or dot jp> Raja Mukerji <raja at moselle dot com>
# cf09195b	09-Apr-2005	Paul Saab <ps@FreeBSD.org>	- Tighten up the Timestamp checks to prevent a spoofed segment from setting ts_recent to an arbitrary value, stopping further communication between the two hosts. - If the Echoed Timestamp is greater than the current time, fall back to the non RFC 1323 RTT calculation. Submitted by: Raja Mukerji (raja at moselle dot com) Reviewed by: Noritoshi Demizu, Mohan Srinivasan
# e346eeff	09-Apr-2005	Paul Saab <ps@FreeBSD.org>	- If the reassembly queue limit was reached or if we couldn't allocate a reassembly queue state structure, don't update (receiver) sack report. - Similarly, if tcp_drain() is called, freeing up all items on the reassembly queue, clean the sack report. Found, Submitted by: Noritoshi Demizu <demizu at dd dot iij4u dot or dot jp> Reviewed by: Mohan Srinivasan (mohans at yahoo-inc dot com), Raja Mukerji (raja at moselle dot com).
# 7643c37c	17-Feb-2005	Paul Saab <ps@FreeBSD.org>	Remove 2 (SACK) fields from the tcpcb. These are only used by a function that is called from tcp_input(), so they oughta be passed on the stack instead of stuck in the tcpcb. Submitted by: Mohan Srinivasan
# 7776346f	15-Feb-2005	Paul Saab <ps@FreeBSD.org>	Fix for a SACK (receiver) bug where incorrect SACK blocks are reported to the sender - in the case where the sender sends data outside the window (as WinXP does :(). Reported by: Sam Jensen <sam at wand dot net dot nz> Submitted by: Mohan Srinivasan
# 8db456bf	14-Feb-2005	Paul Saab <ps@FreeBSD.org>	- Retransmit just one segment on initiation of SACK recovery. Remove the SACK "initburst" sysctl. - Fix bugs in SACK dupack and partialack handling that can cause large bursts while in SACK recovery. Submitted by: Mohan Srinivasan
# c398230b	06-Jan-2005	Warner Losh <imp@FreeBSD.org>	/* -> /*- for license, minor formatting changes
# a69968ee	03-Jan-2005	Mike Silbersack <silby@FreeBSD.org>	Add a sysctl (net.inet.tcp.insecure_rst) which allows one to specify that the RFC 793 specification for accepting RST packets should be following. When followed, this makes one vulnerable to the attacks described in "slipping in the window", but it may be necessary in some odd circumstances.
# 42cf3289	25-Dec-2004	Robert Watson <rwatson@FreeBSD.org>	In the dropafterack case of tcp_input(), it's OK to release the TCP pcbinfo lock before calling tcp_output(), as holding just the inpcb lock is sufficient to prevent garbage collection.
# e0bef1cb	25-Dec-2004	Robert Watson <rwatson@FreeBSD.org>	Revert parts of tcp_input.c:1.255 associated with the header predicted cases for tcp_input(): While it is true that the pcbinfo lock provides a pseudo-reference to inpcbs, both the inpcb and pcbinfo locks are required to free an un-referenced inpcb. As such, we can release the pcbinfo lock as long as the inpcb remains locked with the confidence that it will not be garbage-collected. This leads to a less conservative locking strategy that should reduce contention on the TCP pcbinfo lock. Discussed with: sam
# 2be3bf22	28-Nov-2004	Robert Watson <rwatson@FreeBSD.org>	Assert the inpcb lock in tcp_xmit_timer() as it performs read-modify- write of various time/rtt-related fields in the tcpcb.
# 18ad5842	28-Nov-2004	Robert Watson <rwatson@FreeBSD.org>	Expand coverage of the receive socket buffer lock when handling urgent pointer updates: test available space while holding the socket buffer mutex, and continue to hold until until the pointer update has been performed. MFC after: 2 weeks
# 6a220ed8	25-Nov-2004	Mike Silbersack <silby@FreeBSD.org>	Fix a problem where our TCP stack would ignore RST packets if the receive window was 0 bytes in size. This may have been the cause of unsolved "connection not closing" reports over the years. Thanks to Michiel Boland for providing the fix and providing a concise test program for the problem. Submitted by: Michiel Boland MFC after: 2 weeks
# de30ea13	23-Nov-2004	Robert Watson <rwatson@FreeBSD.org>	In tcp_reass(), assert the inpcb lock on the passed tcpcb, since the contents of the tcpcb are read and modified in volume. In tcp_input(), replace th comparison with 0 with a comparison with NULL. At the 'findpcb', 'dropafterack', and 'dropwithreset' labels in tcp_input(), assert 'headlocked'. Try to improve consistency between various assertions regarding headlocked to be more informative. MFC after: 2 weeks
# cce83ffb	23-Nov-2004	Robert Watson <rwatson@FreeBSD.org>	tcp_timewait() performs multiple non-atomic reads on the tcptw structure, so assert the inpcb lock associated with the tcptw. Also assert the tcbinfo lock, as tcp_timewait() may call tcp_twclose() or tcp_2msl_rest(), which require it. Since tcp_timewait() is already called with that lock from tcp_input(), this doesn't change current locking, merely documents reasons for it. In tcp_twstart(), assert the tcbinfo lock, as tcp_timer_2msl_rest() is called, which requires that lock. In tcp_twclose(), assert the tcbinfo lock, as tcp_timer_2msl_stop() is called, which requires that lock. Document the locking strategy for the time wait queues in tcp_timer.c, which consists of protecting the time wait queues in the same manner as the tcbinfo structure (using the tcbinfo lock). In tcp_timer_2msl_reset(), assert the tcbinfo lock, as the time wait queues are modified. In tcp_timer_2msl_stop(), assert the tcbinfo lock, as the time wait queues may be modified. In tcp_timer_2msl_tw(), assert the tcbinfo lock, as the time wait queues may be modified. MFC after: 2 weeks
# ca127a3e	22-Nov-2004	Robert Watson <rwatson@FreeBSD.org>	Remove "Unlocked read" annotations associated with previously unlocked use of socket buffer fields in the TCP input code. These references are now protected by use of the receive socket buffer lock. MFC after: 1 week
# d6915262	07-Nov-2004	Robert Watson <rwatson@FreeBSD.org>	Do some re-sorting of TCP pcbinfo locking and assertions: make sure to retain the pcbinfo lock until we're done using a pcb in the in-bound path, as the pcbinfo lock acts as a pseuo-reference to prevent the pcb from potentially being recycled. Clean up assertions and make sure to assert that the pcbinfo is locked at the head of code subsections where it is needed. Free the mbuf at the end of tcp_input after releasing any held locks to reduce the time the locks are held. MFC after: 3 weeks
# c94c54e4	02-Nov-2004	Andre Oppermann <andre@FreeBSD.org>	Remove RFC1644 T/TCP support from the TCP side of the network stack. A complete rationale and discussion is given in this message and the resulting discussion: http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706 Note that this commit removes only the functional part of T/TCP from the tcp_* related functions in the kernel. Other features introduced with RFC1644 are left intact (socket layer changes, sendmsg(2) on connection oriented protocols) and are meant to be reused by a simpler and less intrusive reimplemention of the previous T/TCP functionality. Discussed on: -arch
# a55db2b6	05-Oct-2004	Paul Saab <ps@FreeBSD.org>	- Estimate the amount of data in flight in sack recovery and use it to control the packets injected while in sack recovery (for both retransmissions and new data). - Cleanups to the sack codepaths in tcp_output.c and tcp_sack.c. - Add a new sysctl (net.inet.tcp.sack.initburst) that controls the number of sack retransmissions done upon initiation of sack recovery. Submitted by: Mohan Srinivasan <mohans@yahoo-inc.com>
# 9b932e9e	17-Aug-2004	Andre Oppermann <andre@FreeBSD.org>	Convert ipfw to use PFIL_HOOKS. This is change is transparent to userland and preserves the ipfw ABI. The ipfw core packet inspection and filtering functions have not been changed, only how ipfw is invoked is different. However there are many changes how ipfw is and its add-on's are handled: In general ipfw is now called through the PFIL_HOOKS and most associated magic, that was in ip_input() or ip_output() previously, is now done in ipfw_check_[in\|out]() in the ipfw PFIL handler. IPDIVERT is entirely handled within the ipfw PFIL handlers. A packet to be diverted is checked if it is fragmented, if yes, ip_reass() gets in for reassembly. If not, or all fragments arrived and the packet is complete, divert_packet is called directly. For 'tee' no reassembly attempt is made and a copy of the packet is sent to the divert socket unmodified. The original packet continues its way through ip_input/output(). ipfw 'forward' is done via m_tag's. The ipfw PFIL handlers tag the packet with the new destination sockaddr_in. A check if the new destination is a local IP address is made and the m_flags are set appropriately. ip_input() and ip_output() have some more work to do here. For ip_input() the m_flags are checked and a packet for us is directly sent to the 'ours' section for further processing. Destination changes on the input path are only tagged and the 'srcrt' flag to ip_forward() is set to disable destination checks and ICMP replies at this stage. The tag is going to be handled on output. ip_output() again checks for m_flags and the 'ours' tag. If found, the packet will be dropped back to the IP netisr where it is going to be picked up by ip_input() again and the directly sent to the 'ours' section. When only the destination changes, the route's 'dst' is overwritten with the new destination from the forward m_tag. Then it jumps back at the route lookup again and skips the firewall check because it has been marked with M_SKIP_FIREWALL. ipfw 'forward' has to be compiled into the kernel with 'option IPFIREWALL_FORWARD' to enable it. DUMMYNET is entirely handled within the ipfw PFIL handlers. A packet for a dummynet pipe or queue is directly sent to dummynet_io(). Dummynet will then inject it back into ip_input/ip_output() after it has served its time. Dummynet packets are tagged and will continue from the next rule when they hit the ipfw PFIL handlers again after re-injection. BRIDGING and IPFW_ETHER are not changed yet and use ipfw_chk() directly as they did before. Later this will be changed to dedicated ETHER PFIL_HOOKS. More detailed changes to the code: conf/files Add netinet/ip_fw_pfil.c. conf/options Add IPFIREWALL_FORWARD option. modules/ipfw/Makefile Add ip_fw_pfil.c. net/bridge.c Disable PFIL_HOOKS if ipfw for bridging is active. Bridging ipfw is still directly invoked to handle layer2 headers and packets would get a double ipfw when run through PFIL_HOOKS as well. netinet/ip_divert.c Removed divert_clone() function. It is no longer used. netinet/ip_dummynet.[ch] Neither the route 'ro' nor the destination 'dst' need to be stored while in dummynet transit. Structure members and associated macros are removed. netinet/ip_fastfwd.c Removed all direct ipfw handling code and replace it with the new 'ipfw forward' handling code. netinet/ip_fw.h Removed 'ro' and 'dst' from struct ip_fw_args. netinet/ip_fw2.c (Re)moved some global variables and the module handling. netinet/ip_fw_pfil.c New file containing the ipfw PFIL handlers and module initialization. netinet/ip_input.c Removed all direct ipfw handling code and replace it with the new 'ipfw forward' handling code. ip_forward() does not longer require the 'next_hop' struct sockaddr_in argument. Disable early checks if 'srcrt' is set. netinet/ip_output.c Removed all direct ipfw handling code and replace it with the new 'ipfw forward' handling code. netinet/ip_var.h Add ip_reass() as general function. (Used from ipfw PFIL handlers for IPDIVERT.) netinet/raw_ip.c Directly check if ipfw and dummynet control pointers are active. netinet/tcp_input.c Rework the 'ipfw forward' to local code to work with the new way of forward tags. netinet/tcp_sack.c Remove include 'opt_ipfw.h' which is not needed here. sys/mbuf.h Remove m_claim_next() macro which was exclusively for ipfw 'forward' and is no longer needed. Approved by: re (scottl)
# a4f757cd	16-Aug-2004	Robert Watson <rwatson@FreeBSD.org>	White space cleanup for netinet before branch: - Trailing tab/space cleanup - Remove spurious spaces between or before tabs This change avoids touching files that Andre likely has in his working set for PFIL hooks changes for IPFW/DUMMYNET. Approved by: re (scottl) Submitted by: Xin LI <delphij@frontfree.net>
# 7cfc6904	12-Jul-2004	Robert Watson <rwatson@FreeBSD.org>	After each label in tcp_input(), assert the inpcbinfo and inpcb lock state that we expect.
# a0445c2e	01-Jul-2004	Jayanth Vijayaraghavan <jayanth@FreeBSD.org>	On receiving 3 duplicate acknowledgements, SACK recovery was not being entered correctly. Fix this problem by separating out the SACK and the newreno cases. Also, check if we are in FASTRECOVERY for the sack case and if so, turn off dupacks. Fix an issue where the congestion window was not being incremented by ssthresh. Thanks to Mohan Srinivasan for finding this problem.
# 1e4d7da7	26-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Reduce the number of unnecessary unlock-relocks on socket buffer mutexes associated with performing a wakeup on the socket buffer: - When performing an sbappend*() followed by a so[rw]wakeup(), explicitly acquire the socket buffer lock and use the _locked() variants of both calls. Note that the _locked() sowakeup() versions unlock the mutex on return. This is done in uipc_send(), divert_packet(), mroute socket_send(), raw_append(), tcp_reass(), tcp_input(), and udp_append(). - When the socket buffer lock is dropped before a sowakeup(), remove the explicit unlock and use the _locked() sowakeup() variant. This is done in soisdisconnecting(), soisdisconnected() when setting the can't send/ receive flags and dropping data, and in uipc_rcvd() which adjusting back-pressure on the sockets. For UNIX domain sockets running mpsafe with a contention-intensive SMP mysql benchmark, this results in a 1.6% query rate improvement due to reduce mutex costs.
# 652178a1	24-Jun-2004	Paul Saab <ps@FreeBSD.org>	White space & spelling fixes Submitted by: Xin LI <delphij@frontfree.net>
# 5905999b	23-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Broaden scope of the socket buffer lock when processing an ACK so that the read and write of sb_cc are atomic. Call sbdrop_locked() instead of sbdrop() since we already hold the socket buffer lock.
# 927c5cea	23-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Protect so_oobmark with with SOCKBUF_LOCK(&so->so_rcv), and broaden locking in tcp_input() for TCP packets with urgent data pointers to hold the socket buffer lock across testing and updating oobmark from just protecting sb_state. Update socket locking annotations
# 3f11a2f3	23-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Introduce sbreserve_locked(), which asserts the socket buffer lock on the socket buffer having its limits adjusted. sbreserve() now acquires the lock before calling sbreserve_locked(). In soreserve(), acquire socket buffer locks across read-modify-writes of socket buffer fields, and calls into sbreserve/sbrelease; make sure to acquire in keeping with the socket buffer lock order. In tcp_mss(), acquire the socket buffer lock in the calling context so that we have atomic read-modify -write on buffer sizes.
# 6d90faf3	23-Jun-2004	Paul Saab <ps@FreeBSD.org>	Add support for TCP Selective Acknowledgements. The work for this originated on RELENG_4 and was ported to -CURRENT. The scoreboarding code was obtained from OpenBSD, and many of the remaining changes were inspired by OpenBSD, but not taken directly from there. You can enable/disable sack using net.inet.tcp.do_sack. You can also limit the number of sack holes that all senders can have in the scoreboard with net.inet.tcp.sackhole_limit. Reviewed by: gnn Obtained from: Yahoo! (Mohan Srinivasan, Jayanth Vijayaraghavan)
# 1f82efb3	20-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Assert the inpcb lock before letting MAC check whether we can deliver to the inpcb in tcp_input().
# d420fcda	16-Jun-2004	Bruce M Simpson <bms@FreeBSD.org>	Fix build for IPSEC && !INET6 PR: kern/66125 Submitted by: Cyrille Lefevre
# 7721f5d7	14-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Grab the socket buffer send or receive mutex when performing a read-modify-write on the sb_state field. This commit catches only the "easy" ones where it doesn't interact with as yet unmerged locking.
# c0b99ffa	14-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	The socket field so_state is used to hold a variety of socket related flags relating to several aspects of socket functionality. This change breaks out several bits relating to send and receive operation into a new per-socket buffer field, sb_state, in order to facilitate locking. This is required because, in order to provide more granular locking of sockets, different state fields have different locking properties. The following fields are moved to sb_state: SS_CANTRCVMORE (so_state) SS_CANTSENDMORE (so_state) SS_RCVATMARK (so_state) Rename respectively to: SBS_CANTRCVMORE (so_rcv.sb_state) SBS_CANTSENDMORE (so_snd.sb_state) SBS_RCVATMARK (so_rcv.sb_state) This facilitates locking by isolating fields to be located with other identically locked fields, and permits greater granularity in socket locking by avoiding storing fields with different locking semantics in the same short (avoiding locking conflicts). In the future, we may wish to coallesce sb_state and sb_flags; for the time being I leave them separate and there is no additional memory overhead due to the packing/alignment of shorts in the socket buffer structure.
# 310e7ceb	12-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Socket MAC labels so_label and so_peerlabel are now protected by SOCK_LOCK(so): - Hold socket lock over calls to MAC entry points reading or manipulating socket labels. - Assert socket lock in MAC entry point implementations. - When externalizing the socket label, first make a thread-local copy while holding the socket lock, then release the socket lock to externalize to userspace.
# 2f3f1e67	02-May-2004	Darren Reed <darrenr@FreeBSD.org>	Rename m_claim_next_hop() to m_claim_next(), as suggested by Max Laier.
# 7fbb1300	02-May-2004	Darren Reed <darrenr@FreeBSD.org>	oops, I forgot this file in a prior commit (change was still sitting here, uncommitted): Rename ip_claim_next_hop() to m_claim_next_hop(), give it an extra arg (the type of tag to claim) and push it out of ip_var.h into mbuf.h alongside all of the other macros that work ok mbuf's and tag's.
# 80dd2a81	25-Apr-2004	Mike Silbersack <silby@FreeBSD.org>	Tighten up reset handling in order to make reset attacks as difficult as possible while maintaining compatibility with the widest range of TCP stacks. The algorithm is as follows: --- For connections in the ESTABLISHED state, only resets with sequence numbers exactly matching last_ack_sent will cause a reset, all other segments will be silently dropped. For connections in all other states, a reset anywhere in the window will cause the connection to be reset. All other segments will be silently dropped. --- The necessity of accepting all in-window resets was discovered by jayanth and jlemon, both of whom have seen TCP stacks that will respond to FIN-ACK packets with resets not meeting the strict last_ack_sent check. Idea by: Darren Reed Reviewed by: truckman, jlemon, others(?)
# 2d166c02	23-Apr-2004	Andre Oppermann <andre@FreeBSD.org>	Correct an edge case in tcp_mss() where the cached path MTU from tcp_hostcache would have overridden a (now) lower MTU of an interface or route that changed since first PMTU discovery. The bug would have caused TCP to redo the PMTU discovery when not strictly necessary. Make a comment about already pre-initialized default values more clear. Reviewed by: sam
# f36cfd49	07-Apr-2004	Warner Losh <imp@FreeBSD.org>	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999 and email from Peter Wemm, Alan Cox and Robert Watson. Approved by: core, peter, alc, rwatson
# 04d3a452	01-Mar-2004	Hajimu UMEMOTO <ume@FreeBSD.org>	fix -O0 compilation without INET6. Pointed out by: ru
# a7b6a14a	28-Feb-2004	Robert Watson <rwatson@FreeBSD.org>	Remove now unneeded arguments to tcp_twrespond() -- so and msrc. These were needed by the MAC Framework until inpcbs gained labels. Submitted by: sam
# ac9d7e26	25-Feb-2004	Max Laier <mlaier@FreeBSD.org>	Re-remove MT_TAGs. The problems with dummynet have been fixed now. Tested by: -current, bms(mentor), me Approved by: bms(mentor), sam
# 89c02376	25-Feb-2004	Jeffrey Hsu <hsu@FreeBSD.org>	Relax a KASSERT condition to allow for a valid corner case where the FIN on the last segment consumes an extra sequence number. Spurious panic reported by Mike Silbersack <silby@silby.com>.
# 12e2e970	24-Feb-2004	Andre Oppermann <andre@FreeBSD.org>	Convert the tcp segment reassembly queue to UMA and limit the maximum amount of segments it will hold. The following tuneables and sysctls control the behaviour of the tcp segment reassembly queue: net.inet.tcp.reass.maxsegments (loader tuneable) specifies the maximum number of segments all tcp reassemly queues can hold (defaults to 1/16 of nmbclusters). net.inet.tcp.reass.maxqlen specifies the maximum number of segments any individual tcp session queue can hold (defaults to 48). net.inet.tcp.reass.cursegments (readonly) counts the number of segments currently in all reassembly queues. net.inet.tcp.reass.overflows (readonly) counts how often either the global or local queue limit has been reached. Tested by: bms, silby Reviewed by: bms, silby
# 36e8826f	17-Feb-2004	Max Laier <mlaier@FreeBSD.org>	Backout MT_TAG removal (i.e. bring back MT_TAGs) for now, as dummynet is not working properly with the patch in place. Approved by: bms(mentor)
# da0f4099	17-Feb-2004	Hajimu UMEMOTO <ume@FreeBSD.org>	IPSEC and FAST_IPSEC have the same internal API now; so merge these (IPSEC has an extra ipsecstat) Submitted by: "Bjoern A. Zeeb" <bzeeb+freebsd@zabbadoz.net>
# 1094bdca	13-Feb-2004	Max Laier <mlaier@FreeBSD.org>	This set of changes eliminates the use of MT_TAG "pseudo mbufs", replacing them mostly with packet tags (one case is handled by using an mbuf flag since the linkage between "caller" and "callee" is direct and there's no need to incur the overhead of a packet tag). This is (mostly) work from: sam Silence from: -arch Approved by: bms(mentor), sam, rwatson
# 265ed012	13-Feb-2004	Bruce M Simpson <bms@FreeBSD.org>	Brucification. Submitted by: bde
# a0194ef1	12-Feb-2004	Bruce M Simpson <bms@FreeBSD.org>	Remove an unnecessary initialization that crept in from the code which verifies TCP-MD5 digests. Noticed by: njl
# 1cfd4b53	10-Feb-2004	Bruce M Simpson <bms@FreeBSD.org>	Initial import of RFC 2385 (TCP-MD5) digest support. This is the first of two commits; bringing in the kernel support first. This can be enabled by compiling a kernel with options TCP_SIGNATURE and FAST_IPSEC. For the uninitiated, this is a TCP option which provides for a means of authenticating TCP sessions which came into being before IPSEC. It is still relevant today, however, as it is used by many commercial router vendors, particularly with BGP, and as such has become a requirement for interconnect at many major Internet points of presence. Several parts of the TCP and IP headers, including the segment payload, are digested with MD5, including a shared secret. The PF_KEY interface is used to manage the secrets using security associations in the SADB. There is a limitation here in that as there is no way to map a TCP flow per-port back to an SPI without polluting tcpcb or using the SPD; the code to do the latter is unstable at this time. Therefore this code only supports per-host keying granularity. Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6), TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective users of this feature, this will not pose any problem. This implementation is output-only; that is, the option is honoured when responding to a host initiating a TCP session, but no effort is made [yet] to authenticate inbound traffic. This is, however, sufficient to interwork with Cisco equipment. Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with local patches. Patches for tcpdump to validate TCP-MD5 sessions are also available from me upon request. Sponsored by: sentex.net
# f073c60f	03-Feb-2004	Hajimu UMEMOTO <ume@FreeBSD.org>	pass pcb rather than so. it is expected that per socket policy works again.
# 61a36e3d	20-Jan-2004	Jeffrey Hsu <hsu@FreeBSD.org>	Merge from DragonFlyBSD rev 1.10: date: 2003/09/02 10:04:47; author: hsu; state: Exp; lines: +5 -6 Account for when Limited Transmit is not congestion window limited. Obtained from: DragonFlyBSD
# 53369ac9	08-Jan-2004	Andre Oppermann <andre@FreeBSD.org>	Limiters and sanity checks for TCP MSS (maximum segement size) resource exhaustion attacks. For network link optimization TCP can adjust its MSS and thus packet size according to the observed path MTU. This is done dynamically based on feedback from the remote host and network components along the packet path. This information can be abused to pretend an extremely low path MTU. The resource exhaustion works in two ways: o during tcp connection setup the advertized local MSS is exchanged between the endpoints. The remote endpoint can set this arbitrarily low (except for a minimum MTU of 64 octets enforced in the BSD code). When the local host is sending data it is forced to send many small IP packets instead of a large one. For example instead of the normal TCP payload size of 1448 it forces TCP payload size of 12 (MTU 64) and thus we have a 120 times increase in workload and packets. On fast links this quickly saturates the local CPU and may also hit pps processing limites of network components along the path. This type of attack is particularly effective for servers where the attacker can download large files (WWW and FTP). We mitigate it by enforcing a minimum MTU settable by sysctl net.inet.tcp.minmss defaulting to 256 octets. o the local host is reveiving data on a TCP connection from the remote host. The local host has no control over the packet size the remote host is sending. The remote host may chose to do what is described in the first attack and send the data in packets with an TCP payload of at least one byte. For each packet the tcp_input() function will be entered, the packet is processed and a sowakeup() is signalled to the connected process. For example an attack with 2 Mbit/s gives 4716 packets per second and the same amount of sowakeup()s to the process (and context switches). This type of attack is particularly effective for servers where the attacker can upload large amounts of data. Normally this is the case with WWW server where large POSTs can be made. We mitigate this by calculating the average MSS payload per second. If it goes below 'net.inet.tcp.minmss' and the pps rate is above 'net.inet.tcp.minmssoverload' defaulting to 1000 this particular TCP connection is resetted and dropped. MITRE CVE: CAN-2004-0002 Reviewed by: sam (mentor) MFC after: 1 day
# dba7bc6a	06-Jan-2004	Andre Oppermann <andre@FreeBSD.org>	Enable the following TCP options by default to give it more exposure: rfc3042 Limited retransmit rfc3390 Increasing TCP's initial congestion Window inflight TCP inflight bandwidth limiting All my production server have it enabled and there have been no issues. I am confident about having them on by default and it gives us better overall TCP performance. Reviewed by: sam (mentor)
# 943ae302	25-Nov-2003	Andre Oppermann <andre@FreeBSD.org>	Restructure a too broad ifdef which was disabling the setting of the tcp flightsize sysctl value for local networks in the !INET6 case. Approved by: re (scottl)
# 97d8d152	20-Nov-2003	Andre Oppermann <andre@FreeBSD.org>	Introduce tcp_hostcache and remove the tcp specific metrics from the routing table. Move all usage and references in the tcp stack from the routing table metrics to the tcp hostcache. It caches measured parameters of past tcp sessions to provide better initial start values for following connections from or to the same source or destination. Depending on the network parameters to/from the remote host this can lead to significant speedups for new tcp connections after the first one because they inherit and shortcut the learning curve. tcp_hostcache is designed for multiple concurrent access in SMP environments with high contention and is hash indexed by remote ip address. It removes significant locking requirements from the tcp stack with regard to the routing table. Reviewed by: sam (mentor), bms Reviewed by: -net, -current, core@kame.net (IPv6 parts) Approved by: re (scottl)
# a557af22	17-Nov-2003	Robert Watson <rwatson@FreeBSD.org>	Introduce a MAC label reference in 'struct inpcb', which caches the MAC label referenced from 'struct socket' in the IPv4 and IPv6-based protocols. This permits MAC labels to be checked during network delivery operations without dereferencing inp->inp_socket to get to so->so_label, which will eventually avoid our having to grab the socket lock during delivery at the network layer. This change introduces 'struct inpcb' as a labeled object to the MAC Framework, along with the normal circus of entry points: initialization, creation from socket, destruction, as well as a delivery access control check. For most policies, the inpcb label will simply be a cache of the socket label, so a new protocol switch method is introduced, pr_sosetlabel() to notify protocols that the socket layer label has been updated so that the cache can be updated while holding appropriate locks. Most protocols implement this using pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use the the worker function in_pcbsosetlabel(), which calls into the MAC Framework to perform a cache update. Biba, LOMAC, and MLS implement these entry points, as do the stub policy, and test policy. Reviewed by: sam, bms Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 122aad88	12-Nov-2003	Andre Oppermann <andre@FreeBSD.org>	dropwithreset is not needed in this case as tcp_drop() is already notifying the other side. Before we were sending two RST packets.
# c29afad6	08-Nov-2003	Sam Leffler <sam@FreeBSD.org>	o correct locking problem: the inpcb must be held across tcp_respond o add assertions in tcp_respond to validate inpcb locking assumptions o use local variable instead of chasing pointers in tcp_respond Supported by: FreeBSD Foundation
# 395bb186	27-Oct-2003	Sam Leffler <sam@FreeBSD.org>	speedup stream socket recv handling by tracking the tail of the mbuf chain instead of walking the list for each append Submitted by: ps/jayanth Obtained from: netbsd (jason thorpe)
# b3399803	20-Oct-2003	Hajimu UMEMOTO <ume@FreeBSD.org>	enclose IPv6 part with ifdef INET6. Obtained from: KAME
# 31b3783c	20-Oct-2003	Hajimu UMEMOTO <ume@FreeBSD.org>	correct linkmtu handling. Obtained from: KAME
# 31b1bfe1	17-Oct-2003	Hajimu UMEMOTO <ume@FreeBSD.org>	- add dom_if{attach,detach} framework. - transition to use ifp->if_afdata. Obtained from: KAME
# 3c653157	13-Aug-2003	Hartmut Brandt <harti@FreeBSD.org>	A number of patches in the last years have created new return paths in tcp_input that leave the function before hitting the tcp_trace function call for the TCPDEBUG option. This has made TCPDEBUG mostly useless (and tools like ports/benchmarks/dbs not working). Add tcp_trace calls to the return paths that could be identified in this maze. This is a NOP unless you compile with TCPDEBUG.
# 9d11646d	15-Jul-2003	Jeffrey Hsu <hsu@FreeBSD.org>	Unify the "send high" and "recover" variables as specified in the lastest rev of the spec. Use an explicit flag for Fast Recovery. [1] Fix bug with exiting Fast Recovery on a retransmit timeout diagnosed by Lu Guohan. [2] Reviewed by: Thomas Henderson <thomas.r.henderson@boeing.com> Reported and tested by: Lu Guohan <lguohan00@mails.tsinghua.edu.cn> [2] Approved by: Thomas Henderson <thomas.r.henderson@boeing.com>, Sally Floyd <floyd@acm.org> [1]
# e4d2978d	31-May-2003	Poul-Henning Kamp <phk@FreeBSD.org>	Add /* FALLTHROUGH */ Found by: FlexeLint
# 430c6354	06-May-2003	Robert Watson <rwatson@FreeBSD.org>	Correct a bug introduced with reduced TCP state handling; make sure that the MAC label on TCP responses during TIMEWAIT is properly set from either the socket (if available), or the mbuf that it's responding to. Unfortunately, this is made somewhat difficult by the TCP code, as tcp_twstart() calls tcp_twrespond() after discarding the socket but without a reference to the mbuf that causes the "response". Passing both the socket and the mbuf works arounds this--eventually it might be good to make sure the mbuf always gets passed in in "response" scenarios but working through this provided to complicate things too much. Approved by: re (scottl) Reviewed by: hsu Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 152385d1	21-Apr-2003	David E. O'Brien <obrien@FreeBSD.org>	Explicitly declare 'int' parameters.
# 48d2549c	01-Apr-2003	Jeffrey Hsu <hsu@FreeBSD.org>	Observe conservation of packets when entering Fast Recovery while doing Limited Transmit. Only artificially inflate the congestion window by 1 segment instead of the usual 3 to take into account the 2 already sent by Limited Transmit. Approved in principle by: Mark Allman <mallman@grc.nasa.gov>, Hari Balakrishnan <hari@nms.lcs.mit.edu>, Sally Floyd <floyd@icir.org>
# 7792ea27	13-Mar-2003	Jeffrey Hsu <hsu@FreeBSD.org>	Greatly simplify the unlocking logic by holding the TCP protocol lock until after FIN_WAIT_2 processing. Helped with debugging: Doug Barton
# da3a8a1a	12-Mar-2003	Jeffrey Hsu <hsu@FreeBSD.org>	Add support for RFC 3390, which allows for a variable-sized initial congestion window.
# 582a954b	12-Mar-2003	Jeffrey Hsu <hsu@FreeBSD.org>	Implement the Limited Transmit algorithm (RFC 3042).
# 607b0b0c	08-Mar-2003	Jonathan Lemon <jlemon@FreeBSD.org>	Remove a panic(); if the zone allocator can't provide more timewait structures, reuse the oldest one. Also move the expiry timer from a per-structure callout to the tcp slow timer. Sponsored by: DARPA, NAI Labs
# 272c5dfe	26-Feb-2003	Jonathan Lemon <jlemon@FreeBSD.org>	In timewait state, if the incoming segment is a pure in-sequence ack that matches snd_max, then do not respond with an ack, just drop the segment. This fixes a problem where a simultaneous close results in an ack loop between two time-wait states. Test case supplied by: Tim Robbins <tjr@FreeBSD.ORG> Sponsored by: DARPA, NAI Labs
# ef6b48de	26-Feb-2003	Jonathan Lemon <jlemon@FreeBSD.org>	The TCP protocol lock may still be held if the reassembly queue dropped FIN. Detect this case and drop the lock accordingly. Sponsored by: DARPA, NAI Labs
# 11a20fb8	23-Feb-2003	Jeffrey Hsu <hsu@FreeBSD.org>	tcp_twstart() need to be called with the TCP protocol lock held to avoid a race condition with the TCP timer routines.
# 2fbef918	23-Feb-2003	Jeffrey Hsu <hsu@FreeBSD.org>	Pass the right function to callout_reset() for a compressed TIME-WAIT control block.
# f243998b	23-Feb-2003	Jonathan Lemon <jlemon@FreeBSD.org>	Yesterday just wasn't my day. Remove testing delta that crept into the diff. Pointy hat provided by: sam
# a14c749f	22-Feb-2003	Jonathan Lemon <jlemon@FreeBSD.org>	Check to see if the TF_DELACK flag is set before returning from tcp_input(). This unbreaks delack handling, while still preserving correct T/TCP behavior Tested by: maxim Sponsored by: DARPA, NAI Labs
# 340c35de	19-Feb-2003	Jonathan Lemon <jlemon@FreeBSD.org>	Add a TCP TIMEWAIT state which uses less space than a fullblown TCP control block. Allow the socket and tcpcb structures to be freed earlier than inpcb. Update code to understand an inp w/o a socket. Reviewed by: hsu, silby, jayanth Sponsored by: DARPA, NAI Labs
# 41446225	19-Feb-2003	Jonathan Lemon <jlemon@FreeBSD.org>	Correct comments.
# 3bfd6421	19-Feb-2003	Jonathan Lemon <jlemon@FreeBSD.org>	Clean up delayed acks and T/TCP interactions: - delay acks for T/TCP regardless of delack setting - fix bug where a single pass through tcp_input might not delay acks - use callout_active() instead of callout_pending() Sponsored by: DARPA, NAI Labs
# 85e8b243	13-Feb-2003	Jeffrey Hsu <hsu@FreeBSD.org>	The protocol lock is always held in the dropafterack case, so we don't need to check for it at runtime.
# 39eb27a4	02-Feb-2003	Crist J. Clark <cjc@FreeBSD.org>	Add the TCP flags to the log message whenever log_in_vain is 1, not just when set to 2. PR: kern/43348 MFC after: 5 days
# cb942153	13-Jan-2003	Jeffrey Hsu <hsu@FreeBSD.org>	Fix NewReno. Reviewed by: Tom Henderson <thomas.r.henderson@boeing.com>
# 07fd333d	30-Dec-2002	Matthew Dillon <dillon@FreeBSD.org>	Remove the PAWS ack-on-ack debugging printf(). Note that the original RFC 1323 (PAWS) says in 4.2.1 that the out of order / reverse-time-indexed packet should be acknowledged as specified in RFC-793 page 69 then dropped. The original PAWS code in FreeBSD (1994) simply acknowledged the segment unconditionally, which is incorrect, and was fixed in 1.183 (2002). At the moment we do not do checks for SYN or FIN in addition to (tlen != 0), which may or may not be correct, but the worst that ought to happen should be a retry by the sender.
# 540e8b7e	20-Dec-2002	Jeffrey Hsu <hsu@FreeBSD.org>	Unravel a nested conditional. Remove an unneeded local variable.
# 967adce8	16-Dec-2002	Matthew Dillon <dillon@FreeBSD.org>	Fix syntax in last commit.
# 1ab4789d	14-Dec-2002	Matthew Dillon <dillon@FreeBSD.org>	Bruce forwarded this tidbit from an analysis Van Jacobson did on an apparent ack-on-ack problem with FreeBSD. Prof. Jacobson noticed a case in our TCP stack which would acknowledge a received ack-only packet, which is not legal in TCP. Submitted by: Van Jacobson <van@packetdesign.com>, bmah@packetdesign.com (Bruce A. Mah) MFC after: 7 days
# 6f0d017c	10-Nov-2002	Sam Leffler <sam@FreeBSD.org>	a better solution to building FAST_IPSEC w/o INET6 Submitted by: Jeffrey Hsu <hsu@FreeBSD.org>
# 58fcadfc	08-Nov-2002	Sam Leffler <sam@FreeBSD.org>	fixup FAST_IPSEC build w/o INET6
# 1645d090	31-Oct-2002	Jeff Roberson <jeff@FreeBSD.org>	- Consistently update snd_wl1, snd_wl2, and rcv_up in the header prediction code. Previously, 2GB worth of header predicted data could leave these variables too far out of sequence which would cause problems after receiving a packet that did not match the header prediction. Submitted by: Bill Baumann <bbaumann@isilon.com> Sponsored by: Isilon Systems, Inc. Reviewed by: hsu, pete@isilon.com, neal@isilon.com, aaronp@isilon.com
# 30613f56	30-Oct-2002	Jeffrey Hsu <hsu@FreeBSD.org>	Don't need to check if SO_OOBINLINE is defined. Don't need to protect isipv6 conditional with INET6. Fix leading indentation in 2 lines.
# b9234faf	15-Oct-2002	Sam Leffler <sam@FreeBSD.org>	Tie new "Fast IPsec" code into the build. This involves the usual configuration stuff as well as conditional code in the IPv4 and IPv6 areas. Everything is conditional on FAST_IPSEC which is mutually exclusive with IPSEC (KAME IPsec implmentation). As noted previously, don't use FAST_IPSEC with INET6 at the moment. Reviewed by: KAME, rwatson Approved by: silence Supported by: Vernier Networks
# 5d846453	15-Oct-2002	Sam Leffler <sam@FreeBSD.org>	Replace aux mbufs with packet tags: o instead of a list of mbufs use a list of m_tag structures a la openbsd o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit ABI/module number cookie o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and use this in defining openbsd-compatible m_tag_find and m_tag_get routines o rewrite KAME use of aux mbufs in terms of packet tags o eliminate the most heavily used aux mbufs by adding an additional struct inpcb parameter to ip_output and ip6_output to allow the IPsec code to locate the security policy to apply to outbound packets o bump __FreeBSD_version so code can be conditionalized o fixup ipfilter's call to ip_output based on __FreeBSD_version Reviewed by: julian, luigi (silent), -arch, -net, darren Approved by: julian, silence from everyone else Obtained from: openbsd (mostly) MFC after: 1 month
# a84db8f4	30-Sep-2002	Matthew Dillon <dillon@FreeBSD.org>	Guido found another bug. There is a situation with timestamped TCP packets where FreeBSD will send DATA+FIN and A W2K box will ack just the DATA portion. If this occurs after FreeBSD has done a (NewReno) fast-retransmit and is recovering it (dupacks > threshold) it triggers a case in tcp_newreno_partial_ack() (tcp_newreno() in stable) where tcp_output() is called with the expectation that the retransmit timer will be reloaded. But tcp_output() falls through and returns without doing anything, causing the persist timer to be loaded instead. This causes the connection to hang until W2K gives up. This occurs because in the case where only the FIN must be acked, the 'len' calculation in tcp_output() will be 0, a lot of checks will be skipped, and the FIN check will also be skipped because it is designed to handle FIN retransmits, not forced transmits from tcp_newreno(). The solution is to simply set TF_ACKNOW before calling tcp_output() to absolute guarentee that it will run the send code and reset the retransmit timer. TF_ACKNOW is already used for this purpose in other cases. For some unknown reason this patch also seems to greatly reduce the number of duplicate acks received when Guido runs his tests over a lossy network. It is quite possible that there are other tcp_newreno{_partial_ack()} cases which were not generating the expected output which this patch also fixes. X-MFC after: Will be MFC'd after the freeze is over
# c1c36a2c	21-Sep-2002	Mike Silbersack <silby@FreeBSD.org>	Fix issue where shutdown(socket, SHUT_RD) was effectively ignored for TCP sockets. NetBSD PR: 18185 Submitted by: Sean Boudreau <seanb@qnx.com> MFC after: 3 days
# fa55172b	17-Sep-2002	Matthew Dillon <dillon@FreeBSD.org>	Guido reported an interesting bug where an FTP connection between a Windows 2000 box and a FreeBSD box could stall. The problem turned out to be a timestamp reply bug in the W2K TCP stack. FreeBSD sends a timestamp with the SYN, W2K returns a timestamp of 0 in the SYN+ACK causing FreeBSD to calculate an insane SRTT and RTT, resulting in a maximal retransmit timeout (60 seconds). If there is any packet loss on the connection for the first six or so packets the retransmit case may be hit (the window will still be too small for fast-retransmit), causing a 60+ second pause. The W2K box gives up and closes the connection. This commit works around the W2K bug. 15:04:59.374588 FREEBSD.20 > W2K.1036: S 1420807004:1420807004(0) win 65535 <mss 1460,nop,wscale 2,nop,nop,timestamp 188297344 0> (DF) [tos 0x8] 15:04:59.377558 W2K.1036 > FREEBSD.20: S 4134611565:4134611565(0) ack 1420807005 win 17520 <mss 1460,nop,wscale 0,nop,nop,timestamp 0 0> (DF) Bug reported by: Guido van Rooij <guido@gvr.org>
# 93b0017f	25-Aug-2002	Philippe Charnier <charnier@FreeBSD.org>	Replace various spelling with FALLTHROUGH which is lint()able
# ded7008a	19-Aug-2002	Juli Mallett <jmallett@FreeBSD.org>	Enclose IPv6 addresses in brackets when they are displayed printable with a TCP/UDP port seperated by a colon. This is for the log_in_vain facility. Pointed out by: Edward J. M. Brocklesby Reviewed by: ume MFC after: 2 weeks
# 1fcc99b5	17-Aug-2002	Matthew Dillon <dillon@FreeBSD.org>	Implement TCP bandwidth delay product window limiting, similar to (but not meant to duplicate) TCP/Vegas. Add four sysctls and default the implementation to 'off'. net.inet.tcp.inflight_enable enable algorithm (defaults to 0=off) net.inet.tcp.inflight_debug debugging (defaults to 1=on) net.inet.tcp.inflight_min minimum window limit net.inet.tcp.inflight_max maximum window limit MFC after: 1 week
# c068736a	16-Aug-2002	Jeffrey Hsu <hsu@FreeBSD.org>	Cosmetic-only changes for readability. Reviewed by: (early form passed by) bde Approved by: itojun (from core@kame.net)
# fb95b5d3	15-Aug-2002	Robert Watson <rwatson@FreeBSD.org>	Rename mac_check_socket_receive() to mac_check_socket_deliver() so that we can use the names _receive() and _send() for the receive() and send() checks. Rename related constants, policy implementations, etc. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# b5addd85	15-Aug-2002	Jeffrey Hsu <hsu@FreeBSD.org>	Reset dupack count in header prediction. Follow-on to rev 1.39. Reviewed by: jayanth, Thomas R Henderson <thomas.r.henderson@boeing.com>, silby, dillon
# c488362e	31-Jul-2002	Robert Watson <rwatson@FreeBSD.org>	Introduce support for Mandatory Access Control and extensible kernel access control. Instrument the TCP socket code for packet generation and delivery: label outgoing mbufs with the label of the socket, and check socket and mbuf labels before permitting delivery to a socket. Assign labels to newly accepted connections when the syncache/cookie code has done its business. Also set peer labels as convenient. Currently, MAC policies cannot influence the PCB matching algorithm, so cannot implement polyinstantiation. Note that there is at least one case where a PCB is not available due to the TCP packet not being associated with any socket, so we don't label in that case, but need to handle it in a special manner. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 88c39af3	22-Jul-2002	Ruslan Ermilov <ru@FreeBSD.org>	Don't shrink socket buffers in tcp_mss(), application might have already configured them with setsockopt(SO_*BUF), for RFC1323's scaled windows. PR: kern/11966 MFC after: 1 week
# d65bf08a	19-Jul-2002	Matthew Dillon <dillon@FreeBSD.org>	Add the tcps_sndrexmitbad statistic, keep track of late acks that caused unnecessary retransmissions.
# 6fd22caf	24-Jun-2002	Jeffrey Hsu <hsu@FreeBSD.org>	Avoid unlocking the inp twice if badport_bandlim() returns -1. Reported by: jlemon
# f14e4cfe	24-Jun-2002	Jeffrey Hsu <hsu@FreeBSD.org>	Style bug: fix 4 space indentations that should have been tabs. Submitted by: jlemon
# 410bb1bf	23-Jun-2002	Luigi Rizzo <luigi@FreeBSD.org>	Move two global variables to automatic variables within the only function where they are used (they are used with TCPDEBUG only).
# 2b25acc1	22-Jun-2002	Luigi Rizzo <luigi@FreeBSD.org>	Remove (almost all) global variables that were used to hold packet forwarding state ("annotations") during ip processing. The code is considerably cleaner now. The variables removed by this change are: ip_divert_cookie used by divert sockets ip_fw_fwd_addr used for transparent ip redirection last_pkt used by dynamic pipes in dummynet Removal of the first two has been done by carrying the annotations into volatile structs prepended to the mbuf chains, and adding appropriate code to add/remove annotations in the routines which make use of them, i.e. ip_input(), ip_output(), tcp_input(), bdg_forward(), ether_demux(), ether_output_frame(), div_output(). On passing, remove a bug in divert handling of fragmented packet. Now it is the fragment at offset 0 which sets the divert status of the whole packet, whereas formerly it was the last incoming fragment to decide. Removal of last_pkt required a change in the interface of ip_fw_chk() and dummynet_io(). On passing, use the same mechanism for dummynet annotations and for divert/forward annotations. option IPFIREWALL_FORWARD is effectively useless, the code to implement it is very small and is now in by default to avoid the obfuscation of conditionally compiled code. NOTES: * there is at least one global variable left, sro_fwd, in ip_output(). I am not sure if/how this can be removed. * I have deliberately avoided gratuitous style changes in this commit to avoid cluttering the diffs. Minor stule cleanup will likely be necessary * this commit only focused on the IP layer. I am sure there is a number of global variables used in the TCP and maybe UDP stack. * despite the number of files touched, there are absolutely no API's or data structures changed by this commit (except the interfaces of ip_fw_chk() and dummynet_io(), which are internal anyways), so an MFC is quite safe and unintrusive (and desirable, given the improved readability of the code). MFC after: 10 days
# 03e49181	18-Jun-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Remove so*_locked(), which were backed out by mistake.
# f76fcf6d	10-Jun-2002	Jeffrey Hsu <hsu@FreeBSD.org>	Lock up inpcb. Submitted by: Jennifer Yang <yangjihui@yahoo.com>
# 4cc20ab1	31-May-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Back out my lats commit of locking down a socket, it conflicts with hsu's work. Requested by: hsu
# 243917fe	19-May-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Lock down a socket, milestone 1. o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a socket buffer. The mutex in the receive buffer also protects the data in struct socket. o Determine the lock strategy for each members in struct socket. o Lock down the following members: - so_count - so_options - so_linger - so_state o Remove *_locked() socket APIs. Make the following socket APIs touching the members above now require a locked socket: - sodisconnect() - soisconnected() - soisconnecting() - soisdisconnected() - soisdisconnecting() - sofree() - soref() - sorele() - sorwakeup() - sotryfree() - sowakeup() - sowwakeup() Reviewed by: alfred
# f1320723	01-May-2002	Alfred Perlstein <alfred@FreeBSD.org>	Redo the sigio locking. Turn the sigio sx into a mutex. Sigio lock is really only needed to protect interrupts from dereferencing the sigio pointer in an object when the sigio itself is being destroyed. In order to do this in the most unintrusive manner change pgsigio's sigio * argument into a **, that way we can lock internally to the function.
# 960ed29c	29-Apr-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Revert the change of #includes in sys/filedesc.h and sys/socketvar.h. Requested by: bde Since locking sigio_lock is usually followed by calling pgsigio(), move the declaration of sigio_lock and the definitions of SIGIO_*() to sys/signalvar.h. While I am here, sort include files alphabetically, where possible.
# d48d4b25	27-Apr-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Add a global sx sigio_lock to protect the pointer to the sigio object of a socket. This avoids lock order reversal caused by locking a process in pgsigio(). sowakeup() and the callers of it (sowwakeup, soisconnected, etc.) now require sigio_lock to be locked. Provide sowwakeup_locked(), soisconnected_locked(), and so on in case where we have to modify a socket and wake up a process atomically.
# 88ff5695	18-Apr-2002	SUZUKI Shinsuke <suz@FreeBSD.org>	just merged cosmetic changes from KAME to ease sync between KAME and FreeBSD. (based on freebsd4-snap-20020128) Reviewed by: ume MFC after: 1 week
# 898568d8	10-Apr-2002	Mike Silbersack <silby@FreeBSD.org>	Remove some ISN generation code which has been unused since the syncache went in. MFC after: 3 days
# c1cd65ba	24-Mar-2002	Bruce Evans <bde@FreeBSD.org>	Fixed some style bugs in the removal of __P(()). Continuation lines were not outdented to preserve non-KNF lining up of code with parentheses. Switch to KNF formatting.
# 4d77a549	19-Mar-2002	Alfred Perlstein <alfred@FreeBSD.org>	Remove __P.
# 93ec91ba	27-Feb-2002	Crist J. Clark <cjc@FreeBSD.org>	Change the wording of the inline comments from the previous commit. Objection from: ru
# 2ca2159f	25-Feb-2002	Crist J. Clark <cjc@FreeBSD.org>	The TCP code did not do sufficient checks on whether incoming packets were destined for a broadcast IP address. All TCP packets with a broadcast destination must be ignored. The system only ignored packets that were _link-layer_ broadcasts or multicast. We need to check the IP address too since it is quite possible for a broadcast IP address to come in with a unicast link-layer address. Note that the check existed prior to CSRG revision 7.35, but was removed. This commit effectively backs out that nine-year-old change. PR: misc/35022
# fd8e4ebc	18-Feb-2002	Mike Barcroft <mike@FreeBSD.org>	o Move NTOHL() and associated macros into <sys/param.h>. These are deprecated in favor of the POSIX-defined lowercase variants. o Change all occurrences of NTOHL() and associated marcros in the source tree to use the lowercase function variants. o Add missing license bits to sparc64's <machine/endian.h>. Approved by: jake o Clean up <machine/endian.h> files. o Remove unused __uint16_swap_uint32() from i386's <machine/endian.h>. o Remove prototypes for non-existent bswapXX() functions. o Include <machine/endian.h> in <arpa/inet.h> to define the POSIX-required ntohl() family of functions. o Do similar things to expose the ntohl() family in libstand, <netinet/in.h>, and <sys/param.h>. o Prepend underscores to the ntohl() family to help deal with complexities associated with having MD (asm and inline) versions, and having to prevent exposure of these functions in other headers that happen to make use of endian-specific defines. o Create weak aliases to the canonical function name to help deal with third-party software forgetting to include an appropriate header. o Remove some now unneeded pollution from <sys/types.h>. o Add missing <arpa/inet.h> includes in userland. Tested on: alpha, i386 Reviewed by: bde, jake, tmm
# e6658b12	04-Jan-2002	Robert Watson <rwatson@FreeBSD.org>	o Spelling fix in comment: tcp_ouput -> tcp_output
# 0ef3206b	12-Dec-2001	Jonathan Lemon <jlemon@FreeBSD.org>	Fix up tabs in comments.
# 262c1c1a	02-Dec-2001	Matthew Dillon <dillon@FreeBSD.org>	Fix a bug with transmitter restart after receiving a 0 window. The receiver was not sending an immediate ack with delayed acks turned on when the input buffer is drained, preventing the transmitter from restarting immediately. Propogate the TCP_NODELAY option to accept()ed sockets. (Helps tbench and is a good idea anyway). Some cleanup. Identify additonal issues in comments. MFC after: 1 day
# be2ac88c	21-Nov-2001	Jonathan Lemon <jlemon@FreeBSD.org>	Introduce a syncache, which enables FreeBSD to withstand a SYN flood DoS in an improved fashion over the existing code. Reviewed by: silby (in a previous iteration) Sponsored by: DARPA, NAI Labs
# d00fd201	21-Nov-2001	Jonathan Lemon <jlemon@FreeBSD.org>	Move initialization of snd_recover into tcp_sendseqinit().
# b40ce416	12-Sep-2001	Julian Elischer <julian@FreeBSD.org>	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
# f0ffb944	03-Sep-2001	Julian Elischer <julian@FreeBSD.org>	Patches from Keiichi SHIMA <keiichi@iij.ad.jp> to make ip use the standard protosw structure again. Obtained from: Well, KAME I guess.
# e7e2b801	29-Aug-2001	Jayanth Vijayaraghavan <jayanth@FreeBSD.org>	when newreno is turned on, if dupacks = 1 or dupacks = 2 and new data is acknowledged, reset the dupacks to 0. The problem was spotted when a connection had its send buffer full because the congestion window was only 1 MSS and was not being incremented because dupacks was not reset to 0. Obtained from: Yahoo!
# 745bab7f	23-Aug-2001	Dima Dorfman <dd@FreeBSD.org>	Correct a typo in a comment: FIN_WAIT2 -> FIN_WAIT_2 PR: 29970 Submitted by: Joseph Mallett <jmallett@xMach.org>
# b0e3ad75	21-Aug-2001	Mike Silbersack <silby@FreeBSD.org>	Much delayed but now present: RFC 1948 style sequence numbers In order to ensure security and functionality, RFC 1948 style initial sequence number generation has been implemented. Barring any major crypographic breakthroughs, this algorithm should be unbreakable. In addition, the problems with TIME_WAIT recycling which affect our currently used algorithm are not present. Reviewed by: jesper
# 2d610a50	07-Jul-2001	Mike Silbersack <silby@FreeBSD.org>	Temporary feature: Runtime tuneable tcp initial sequence number generation scheme. Users may now select between the currently used OpenBSD algorithm and the older random positive increment method. While the OpenBSD algorithm is more secure, it also breaks TIME_WAIT handling; this is causing trouble for an increasing number of folks. To switch between generation schemes, one sets the sysctl net.inet.tcp.tcp_seq_genscheme. 0 = random positive increments, 1 = the OpenBSD algorithm. 1 is still the default. Once a secure _and_ compatible algorithm is implemented, this sysctl will be removed. Reviewed by: jlemon Tested by: numerous subscribers of -net
# c73d99b5	23-Jun-2001	Ruslan Ermilov <ru@FreeBSD.org>	Add netstat(1) knob to reset net.inet.{ip\|icmp\|tcp\|udp\|igmp}.stats. For example, ``netstat -s -p ip -z'' will show and reset IP stats. PR: bin/17338
# 08517d53	22-Jun-2001	Mike Silbersack <silby@FreeBSD.org>	Eliminate the allocation of a tcp template structure for each connection. The information contained in a tcptemp can be reconstructed from a tcpcb when needed. Previously, tcp templates required the allocation of one mbuf per connection. On large systems, this change should free up a large number of mbufs. Reviewed by: bmilekic, jlemon, ru MFC after: 2 weeks
# 33841545	10-Jun-2001	Hajimu UMEMOTO <ume@FreeBSD.org>	Sync with recent KAME. This work was based on kame-20010528-freebsd43-snap.tgz and some critical problem after the snap was out were fixed. There are many many changes since last KAME merge. TODO: - The definitions of SADB_* in sys/net/pfkeyv2.h are still different from RFC2407/IANA assignment because of binary compatibility issue. It should be fixed under 5-CURRENT. - ip6po_m member of struct ip6_pktopts is no longer used. But, it is still there because of binary compatibility issue. It should be removed under 5-CURRENT. Reviewed by: itojun Obtained from: KAME MFC after: 3 weeks
# 65f28919	06-Jun-2001	Jesper Skriver <jesper@FreeBSD.org>	Silby's take one on increasing FreeBSD's resistance to SYN floods: One way we can reduce the amount of traffic we send in response to a SYN flood is to eliminate the RST we send when removing a connection from the listen queue. Since we are being flooded, we can assume that the majority of connections in the queue are bogus. Our RST is unwanted by these hosts, just as our SYN-ACK was. Genuine connection attempts will result in hosts responding to our SYN-ACK with an ACK packet. We will automatically return a RST response to their ACK when it gets to us if the connection has been dropped, so the early RST doesn't serve the genuine class of connections much. In summary, we can reduce the number of packets we send by a factor of two without any loss in functionality by ensuring that RST packets are not sent when dropping a connection from the listen queue. Submitted by: Mike Silbersack <silby@silby.com> Reviewed by: jesper MFC after: 2 weeks
# e4b64281	29-May-2001	Jesper Skriver <jesper@FreeBSD.org>	Inline TCP_REASS() in the single location where it's used, just as OpenBSD and NetBSD has done. No functional difference. MFC after: 2 weeks
# 853be122	29-May-2001	Jesper Skriver <jesper@FreeBSD.org>	properly delay acks in half-closed TCP connections PR: 24962 Submitted by: Tony Finch <dot@dotat.at> MFC after: 2 weeks
# d1745f45	20-Apr-2001	Jesper Skriver <jesper@FreeBSD.org>	Say goodbye to TCP_COMPAT_42 Reviewed by: wollman Requested by: wollman
# f0a04f3f	17-Apr-2001	Kris Kennaway <kris@FreeBSD.org>	Randomize the TCP initial sequence numbers more thoroughly. Obtained from: OpenBSD Reviewed by: jesper, peter, -developers
# c59319bf	19-Mar-2001	Dag-Erling Smørgrav <des@FreeBSD.org>	Axe TCP_RESTRICT_RST. It was never a particularly good idea except for a few very specific scenarios, and now that we have had net.inet.tcp.blackhole for quite some time there is really no reason to use it any more. (last of three commits)
# d8c85a26	25-Feb-2001	Jonathan Lemon <jlemon@FreeBSD.org>	Do not delay a new ack if there already is a delayed ack pending on the connection, but send it immediately. Prior to this change, it was possible to delay a delayed-ack for multiple times, resulting in degraded TCP behavior in certain corner cases.
# a57815ef	11-Feb-2001	Bosko Milekic <bmilekic@FreeBSD.org>	Clean up RST ratelimiting. Previously, ratelimiting occured before tests were performed to determine if the received packet should be reset. This created erroneous ratelimiting and false alarms in some cases. The code has now been reorganized so that the checks for validity come before the call to badport_bandlim. Additionally, a few changes in the symbolic names of the bandlim types have been made, as well as a clarification of exactly which type each RST case falls under. Submitted by: Mike Silbersack <silby@silby.com>
# a589a70e	24-Jan-2001	Garrett Wollman <wollman@FreeBSD.org>	Correct a comment.
# 09f81a46	15-Dec-2000	Bosko Milekic <bmilekic@FreeBSD.org>	Change the following: 1. ICMP ECHO and TSTAMP replies are now rate limited. 2. RSTs generated due to packets sent to open and unopen ports are now limited by seperate counters. 3. Each rate limiting queue now has its own description, as follows: Limiting icmp unreach response from 439 to 200 packets per second Limiting closed port RST response from 283 to 200 packets per second Limiting open port RST response from 18724 to 200 packets per second Limiting icmp ping response from 211 to 200 packets per second Limiting icmp tstamp response from 394 to 200 packets per second Submitted by: Mike Silbersack <silby@silby.com>
# 7cc0979f	08-Dec-2000	David Malone <dwmalone@FreeBSD.org>	Convert more malloc+bzero to malloc+M_ZERO. Submitted by: josh@zipperup.org Submitted by: Robert Drehmel <robd@gmx.net>
# 8735719e	04-Nov-2000	Jonathan Lemon <jlemon@FreeBSD.org>	tp->snd_recover is part of the New Reno recovery algorithm, and should only be checked if the system is currently performing New Reno style fast recovery. However, this value was being checked regardless of the NR state, with the end result being that the congestion window was never opened. Change the logic to check t_dupack instead; the only code path that allows it to be nonzero at this point is NewReno, so if it is nonzero, we are in fast recovery mode and should not touch the congestion window. Tested by: phk
# e7f32693	21-Jul-2000	Jayanth Vijayaraghavan <jayanth@FreeBSD.org>	When a connection is being dropped due to a listen queue overflow, delete the cloned route that is associated with the connection. This does not exhaust the routing table memory when the system is under a SYN flood attack. The route entry is not deleted if there is any prior information cached in it. Reviewed by: Peter Wemm,asmodai
# b474779f	09-Jul-2000	Jun-ichiro itojun Hagino <itojun@FreeBSD.org>	be more cautious about tcp option length field. drop bogus ones earlier. not sure if there is a real threat or not, but it seems that there's possibility for overrun/underrun (like non-NOP option with optlen > cnt).
# 686cdd19	04-Jul-2000	Jun-ichiro itojun Hagino <itojun@FreeBSD.org>	sync with kame tree as of july00. tons of bug fixes/improvements. API changes: - additional IPv6 ioctls - IPsec PF_KEY API was changed, it is mandatory to upgrade setkey(8). (also syntax change)
# 4f14ee00	22-May-2000	Dan Moschuk <dan@FreeBSD.org>	sysctl'ize ICMP_BANDLIM and ICMP_BANDLIM_SUPPRESS_OUTPUT. Suggested by: des/nbm
# d8417274	18-May-2000	Jayanth Vijayaraghavan <jayanth@FreeBSD.org>	snd_cwnd was updated twice in the tcp_newreno function.
# 75c6e0e2	17-May-2000	Jayanth Vijayaraghavan <jayanth@FreeBSD.org>	Sigh, fix a rookie patch merge error. Also-missed-by: peter
# 6b2a5f92	15-May-2000	Jayanth Vijayaraghavan <jayanth@FreeBSD.org>	snd_una was being updated incorrectly, this resulted in the newreno code retransmitting data from the wrong offset. As a footnote, the newreno code was partially derived from NetBSD and Tom Henderson <tomh@cs.berkeley.edu>
# 46f58482	05-May-2000	Jonathan Lemon <jlemon@FreeBSD.org>	Implement TCP NewReno, as documented in RFC 2582. This allows better recovery for multiple packet losses in a single window. The algorithm can be toggled via the sysctl net.inet.tcp.newreno, which defaults to "on". Submitted by: Jayanth Vijayaraghavan <jayanth@yahoo-inc.com>
# 5e0ab69d	17-Apr-2000	Munechika SUMIKAWA <sumikawa@FreeBSD.org>	ND6_HINT() should not be called unless the connection status is ESTABLISHED. Obtained from: KAME Project
# fdaf052e	01-Apr-2000	Yoshinobu Inoue <shin@FreeBSD.org>	Support per socket based IPv4 mapped IPv6 addr enable/disable control. Submitted by: ume
# db4f9cc7	27-Mar-2000	Jonathan Lemon <jlemon@FreeBSD.org>	Add support for offloading IP/TCP/UDP checksums to NIC hardware which supports them.
# 4739b807	11-Mar-2000	Yoshinobu Inoue <shin@FreeBSD.org>	IPv6 6to4 support. Now most big problem of IPv6 is getting IPv6 address assignment. 6to4 solve the problem. 6to4 addr is defined like below, 2002: 4byte v4 addr : 2byte SLA ID : 8byte interface ID The most important point of the address format is that an IPv4 addr is embeded in it. So any user who has IPv4 addr can get IPv6 address block with 2byte subnet space. Also, the IPv4 addr is used for semi-automatic IPv6 over IPv4 tunneling. With 6to4, getting IPv6 addr become dramatically easy. The attached patch enable 6to4 extension, and confirmed to work, between "Richard Seaman, Jr." <dick@tar.com> and me. Approved by: jkh Reviewed by: itojun
# 173c0f9f	27-Jan-2000	Warner Losh <imp@FreeBSD.org>	Mitigate the stream.c attacks o Drop all broadcast and multicast source addresses in tcp_input. o Enable ICMP_BANDLIM in GENERIC. o Change default to 200/s from 100/s. This will still stop the attack, but is conservative enough to do this close to code freeze. This is not the optimal patch for the problem, but is likely the least intrusive patch that can be made for this. Obtained from: Don Lewis and Matt Dillon. Reviewed by: freebsd-security
# 69a34685	24-Jan-2000	Yoshinobu Inoue <shin@FreeBSD.org>	Avoid m_len and m_pkthdr.len inconsistency when changing m_len for an mbuf whose M_PKTHDR is set. PR: related to kern/15175 Reviewed by: archie
# 3a2a9f79	15-Jan-2000	Yoshinobu Inoue <shin@FreeBSD.org>	Fixed the problem that IPsec connection hangs when bigger data is sent. -opt_ipsec.h was missing on some tcp files (sorry for basic mistake) -made buildable as above fix -also added some missing IPv4 mapped IPv6 addr consideration into ipsec4_getpolicybysock
# 8972cdb1	12-Jan-2000	Yoshinobu Inoue <shin@FreeBSD.org>	add a comment for some possible? IPv4 option processing.
# fb59c426	09-Jan-2000	Yoshinobu Inoue <shin@FreeBSD.org>	tcp updates to support IPv6. also a small patch to sys/nfs/nfs_socket.c, as max_hdr size change. Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project
# 6a800098	22-Dec-1999	Yoshinobu Inoue <shin@FreeBSD.org>	IPSEC support in the kernel. pr_input() routines prototype is also changed to support IPSEC and IPV6 chained protocol headers. Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project
# c0f7fd55	14-Dec-1999	Jonathan Lemon <jlemon@FreeBSD.org>	Use SEQ_* macros for comparing sequence space numbers. Reviewed by: truckman
# 1a244a61	10-Dec-1999	Jonathan Lemon <jlemon@FreeBSD.org>	According to RFC 793, a reset should be honored if the sequence number is within the receive window. Follow this behavior, instead of only allowing resets at last_ack_sent. Pointed out by: jayanth@yahoo-inc.com
# cfa1ca9d	07-Dec-1999	Yoshinobu Inoue <shin@FreeBSD.org>	udp IPv6 support, IPv6/IPv4 tunneling support in kernel, packet divert at kernel for IPv6/IPv4 translater daemon This includes queue related patch submitted by jburkhol@home.com. Submitted by: queue related patch from jburkhol@home.com Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project
# ecf72308	09-Oct-1999	Brian Feldman <green@FreeBSD.org>	Implement RLIMIT_SBSIZE in the kernel. This is a per-uid sockbuf total usage limit.
# f8613305	14-Sep-1999	Dag-Erling Smørgrav <des@FreeBSD.org>	Fix some more disordering, as well as the description string for the net.inet.tcp.drop_synfin sysctl, which for some mysterious reason said "Drop TCP packets with FIN+ACK set" (instead of "...with SYN+FIN set")
# e46cd3d4	12-Sep-1999	Dag-Erling Smørgrav <des@FreeBSD.org>	Add the net.inet.tcp.restrict_rst and net.inet.tcp.drop_synfin sysctl variables, conditional on the TCP_RESTRICT_RST and TCP_DROP_SYNFIN kernel options, respectively. See the comments in LINT for details.
# 9b8b58e0	30-Aug-1999	Jonathan Lemon <jlemon@FreeBSD.org>	Restructure TCP timeout handling: - eliminate the fast/slow timeout lists for TCP and instead use a callout entry for each timer. - increase the TCP timer granularity to HZ - implement "bad retransmit" recovery, as presented in "On Estimating End-to-End Network Path Properties", by Allman and Paxson. Submitted by: jlemon, wollmann
# 5a8c77a8	29-Aug-1999	David E. O'Brien <obrien@FreeBSD.org>	Remove extra indenting of `break' statements introducted in rev 1.89, plus wrap some long lines from that revision. While here, wrap some other long lines.
# c3aac50f	27-Aug-1999	Peter Wemm <peter@FreeBSD.org>	$Id$ -> $FreeBSD$
# 828b7f40	18-Aug-1999	Geoff Rehmet <csgr@FreeBSD.org>	Fix breakage if blackhole=1 and tiflags & TH_SYN, plus style(9) fixes Submitted by: Jonathon Lemon
# 2e4e1b4c	18-Aug-1999	Geoff Rehmet <csgr@FreeBSD.org>	Slight tweak to tcp.blackhole to add optional behaviour to drop any segment arriving at a closed port. tcp.blackhole=1 - only drop SYN without RST tcp.blackhole=2 - drop everything without RST tcp.blackhole=0 - always send RST - default behaviour This confuses nmap -sF or -sX or -sN quite badly.
# 16f7f31f	16-Aug-1999	Geoff Rehmet <csgr@FreeBSD.org>	Add net.inet.tcp.blackhole and net.inet.udp.blackhole sysctl knobs. With these knobs on, refused connection attempts are dropped without sending a RST, or Port unreachable in the UDP case. In the TCP case, sending of RST is inhibited iff the incoming segment was a SYN. Docs and rc.conf settings to follow.
# e9bd3a37	18-Jul-1999	Jonathan M. Bresler <jmb@FreeBSD.org>	fix comment re: RST received in TIME_WAIT to match the code.
# dfd5dee1	06-May-1999	Peter Wemm <peter@FreeBSD.org>	Add sufficient braces to keep egcs happy about potentially ambiguous if/else nesting.
# 3d177f46	03-May-1999	Bill Fumerola <billf@FreeBSD.org>	Add sysctl descriptions to many SYSCTL_XXXs PR: kern/11197 Submitted by: Adrian Chadd <adrian@FreeBSD.org> Reviewed by: billf(spelling/style/minor nits) Looked at by: bde(style)
# 51b7b337	05-Feb-1999	Bill Fenner <fenner@FreeBSD.org>	Use snd_nxt, not rcv_nxt, when calculating the ISS during TIME_WAIT. This was missed in the 4.4-Lite2 merge. Noticed by: Mohan Parthasarathy <Mohan.Parthasarathy@eng.Sun.COM> and jayanth@loc201.tandem.com (vijayaraghavan_jayanth) on the tcp-impl mailing list.
# 831a80b0	27-Jan-1999	Matthew Dillon <dillon@FreeBSD.org>	Fix warnings in preparation for adding -Wall -Wcast-qual to the kernel compile
# 51508de1	03-Dec-1998	Matthew Dillon <dillon@FreeBSD.org>	Reviewed by: freebsd-current Add ICMP_BANDLIM option and 'net.inet.icmp.icmplim' sysctl. If option is specified in kernel config, icmplim defaults to 100 pps. Setting it to 0 will disable the feature. This feature limits ICMP error responses for packets sent to bad tcp or udp ports, which does a lot to help the machine handle network D.O.S. attacks. The kernel will report packet rates that exceed the limit at a rate of one kernel printf per second. There is one issue in regards to the 'tail end' of an attack... the kernel will not output the last report until some unrelated and valid icmp error packet is return at some point after the attack is over. This is a minor reporting issue only.
# 80ab7c0e	11-Sep-1998	Garrett Wollman <wollman@FreeBSD.org>	Fix RST validation. PR: 7892 Submitted by: Don.Lewis@tsc.tdk.com
# 6effc713	24-Aug-1998	Doug Rabson <dfr@FreeBSD.org>	Re-implement tcp and ip fragment reassembly to not store pointers in the ip header which can't work on alpha since pointers are too big. Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>
# f9e354df	05-Jul-1998	Julian Elischer <julian@FreeBSD.org>	Support for IPFW based transparent forwarding. Any packet that can be matched by a ipfw rule can be redirected transparently to another port or machine. Redirection to another port mostly makes sense with tcp, where a session can be set up between a proxy and an unsuspecting client. Redirection to another machine requires that the other machine also be expecting to receive the forwarded packets, as their headers will not have been modified. /sbin/ipfw must be recompiled!!! Reviewed by: Peter Wemm <peter@freebsd.org> Submitted by: Chrisy Luke <chrisy@flix.net>
# 04a3fd12	31-May-1998	Peter Wemm <peter@FreeBSD.org>	Let the sowwakeup macro decide when to call sowakeup rather than have tcp "know" about it. A pending upcall would be missed, eg: used by NFS. Obtained from: NetBSD
# 068373b6	18-May-1998	Guido van Rooij <guido@FreeBSD.org>	Grumble...It seems I'm suffering from some mental disease. Do it correct now.
# 0bce271a	18-May-1998	Guido van Rooij <guido@FreeBSD.org>	Add some parenthesis for clarity and fix a bug Pointed out by: Garrett Wollmand
# 11ad4550	04-May-1998	Guido van Rooij <guido@FreeBSD.org>	Refuse accellerated opens on listening sockets that have not set the TCP_NOPUSH socket option. This disables TAO for those services that do not know about T/TCP. Reviewed by: Garrett Wollman Submitted by: Peter Wemm
# 84e33c9e	24-Apr-1998	David Greenman <dg@FreeBSD.org>	At the request of Garrett, changed sysctl: net.inet.tcp.delack_enabled -> net.inet.tcp.delayed_ack
# dc733423	17-Apr-1998	Dag-Erling Smørgrav <des@FreeBSD.org>	Seventy-odd "its" / "it's" typos in comments fixed as per kern/6108.
# 8e5db87c	06-Apr-1998	Poul-Henning Kamp <phk@FreeBSD.org>	Remove the last traces of TUBA. Inspired by: PR kern/3317
# 75daa6a5	19-Mar-1998	Bill Fenner <fenner@FreeBSD.org>	Remove the check for SYN in SYN_RECEIVED state; it breaks simultaneous connect. This check was added as part of the defense against the "land" attack, to prevent attacks which guess the ISS from going into ESTABLISHED. The "src == dst" check will still prevent the single-homed case of the "land" attack, and guessing ISS's should be hard anyway. Submitted by: David Borman <dab@bsdi.com>
# f498eeee	25-Feb-1998	David Greenman <dg@FreeBSD.org>	Changes to support the addition of a new sysctl variable: net.inet.tcp.delack_enabled Which defaults to 1 and can be set to 0 to disable TCP delayed-ack processing (i.e. all acks are immediate).
# c3229e05	27-Jan-1998	David Greenman <dg@FreeBSD.org>	Improved connection establishment performance by doing local port lookups via a hashed port list. In the new scheme, in_pcblookup() goes away and is replaced by a new routine, in_pcblookup_local() for doing the local port check. Note that this implementation is space inefficient in that the PCB struct is now too large to fit into 128 bytes. I might deal with this in the future by using the new zone allocator, but I wanted these changes to be extensively tested in their current form first. Also: 1) Fixed off-by-one errors in the port lookup loops in in_pcbbind(). 2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash() to do the initialial hash insertion. 3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability. 4) Added a new routine, in_pcbremlists() to remove the PCB from the various hash lists. 5) Added/deleted comments where appropriate. 6) Removed unnecessary splnet() locking. In general, the PCB functions should be called at splnet()...there are unfortunately a few exceptions, however. 7) Reorganized a few structs for better cache line behavior. 8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in the future, however. These changes have been tested on wcarchive for more than a month. In tests done here, connection establishment overhead is reduced by more than 50 times, thus getting rid of one of the major networking scalability problems. Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult. WARNING: Anything that knows about inpcb and tcpcb structs will have to be recompiled; at the very least, this includes netstat(1).
# 764d8cef	20-Jan-1998	Bill Fenner <fenner@FreeBSD.org>	A more complete fix for the "land" attack, removing the "quick fix" from rev 1.66. This fix contains both belt and suspenders. Belt: ignore packets where src == dst and srcport == dstport in TCPS_LISTEN. These packets can only legitimately occur when connecting a socket to itself, which doesn't go through TCPS_LISTEN (it goes CLOSED->SYN_SENT->SYN_RCVD-> ESTABLISHED). This prevents the "standard" "land" attack, although doesn't prevent the multi-homed variation. Suspenders: send a RST in response to a SYN/ACK in SYN_RECEIVED state. The only packets we should get in SYN_RECEIVED are 1. A retransmitted SYN, or 2. An ack of our SYN/ACK. The "land" attack depends on us accepting our own SYN/ACK as an ACK; in SYN_RECEIVED state; this should prevent all "land" attacks. We also move up the sequence number check for the ACK in SYN_RECEIVED. This neither helps nor hurts with respect to the "land" attack, but puts more of the validation checking in one spot. PR: kern/5103
# 592071e8	19-Dec-1997	Bruce Evans <bde@FreeBSD.org>	Don't use ANSI string concatenation to misformat a string.
# 76d3eadb	20-Nov-1997	Garrett Wollman <wollman@FreeBSD.org>	Add Matt Dillon's quick fix hack for the self-connect DoS. PR: 5103
# 4a11ca4e	07-Nov-1997	Poul-Henning Kamp <phk@FreeBSD.org>	Remove a bunch of variables which were unused both in GENERIC and LINT. Found by: -Wunused
# 55b211e3	28-Oct-1997	Bruce Evans <bde@FreeBSD.org>	Removed unused #includes.
# 4281faf2	01-Oct-1997	David Greenman <dg@FreeBSD.org>	Killed the SYN_RECEIVED addition from rev 1.52. It results in legitimate RST's being ignored, keeping a connection around until it times out, and thus has the opposite effect of what was intended (which is to make the system more robust to DoS attacks).
# 026650e5	30-Sep-1997	Bill Fenner <fenner@FreeBSD.org>	Don't consider a SYN/ACK with CC but no CCECHO a proper T/TCP handshake. Reviewed by: Rich Stevens <rstevens@kohala.com>
# 0cc12cc5	16-Sep-1997	Joerg Wunsch <joerg@FreeBSD.org>	Make TCPDEBUG a new-style option.
# 57bf258e	16-Aug-1997	Garrett Wollman <wollman@FreeBSD.org>	Fix all areas of the system (or at least all those in LINT) to avoid storing socket addresses in mbufs. (Socket buffers are the one exception.) A number of kernel APIs needed to get fixed in order to make this happen. Also, fix three protocol families which kept PCBs in mbufs to not malloc them instead. Delete some old compatibility cruft while we're at it, and add some new routines in the in_cksum family.
# 66e39adc	30-Jun-1997	John Polstra <jdp@FreeBSD.org>	Fix a bug (apparently very old) that can cause a TCP connection to be dropped when it has an unusual traffic pattern. For full details as well as a test case that demonstrates the failure, see the referenced PR. Under certain circumstances involving the persist state, it is possible for the receive side's tp->rcv_nxt to advance beyond its tp->rcv_adv. This causes (tp->rcv_adv - tp->rcv_nxt) to become negative. However, in the code affected by this fix, that difference was interpreted as an unsigned number by max(). Since it was negative, it was taken as a huge unsigned number. The effect was to cause the receiver to believe that its receive window had negative size, thereby rejecting all received segments including ACKs. As the test case shows, this led to fruitless retransmissions and eventually to a dropped connection. Even connections using the loopback interface could be dropped. The fix substitutes the signed imax() for the unsigned max() function. PR: closes kern/3998 Reviewed by: davidg, fenner, wollman
# a29f300e	27-Apr-1997	Garrett Wollman <wollman@FreeBSD.org>	The long-awaited mega-massive-network-code- cleanup. Part I. This commit includes the following changes: 1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility glue for them is deleted, and the kernel will panic on boot if any are compiled in. 2) Certain protocol entry points are modified to take a process structure, so they they can easily tell whether or not it is possible to sleep, and also to access credentials. 3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt() call. Protocols should use the process pointer they are now passed. 4) The PF_LOCAL and PF_ROUTE families have been updated to use the new style, as has the `raw' skeleton family. 5) PF_LOCAL sockets now obey the process's umask when creating a socket in the filesystem. As a result, LINT is now broken. I'm hoping that some enterprising hacker with a bit more time will either make the broken bits work (should be easy for netipx) or dike them out.
# 6875d254	22-Feb-1997	Peter Wemm <peter@FreeBSD.org>	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# 1130b656	14-Jan-1997	Jordan K. Hubbard <jkh@FreeBSD.org>	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# 39172c94	10-Nov-1996	Bill Fenner <fenner@FreeBSD.org>	Re-enable the TCP SYN-attack protection code. I was the one who didn't understand the socket state flag. 2.2 candidate.
# a51764a8	11-Oct-1996	Paul Traina <pst@FreeBSD.org>	Fix two bugs I accidently put into the syn code at the last minute (yes I had tested the hell out of this). I've also temporarily disabled the code so that it behaves as it previously did (tail drop's the syns) pending discussion with fenner about some socket state flags that I don't fully understand. Submitted by: fenner
# 6d6a026b	07-Oct-1996	David Greenman <dg@FreeBSD.org>	Improved in_pcblookuphash() to support wildcarding, and changed relavent callers of it to take advantage of this. This reduces new connection request overhead in the face of a large number of PCBs in the system. Thanks to David Filo <filo@yahoo.com> for suggesting this and providing a sample implementation (which wasn't used, but showed that it could be done). Reviewed by: wollman
# ebb0cbea	06-Oct-1996	Paul Traina <pst@FreeBSD.org>	Increase robustness of FreeBSD against high-rate connection attempt denial of service attacks. Reviewed by: bde,wollman,olah Inspired by: vjs@sgi.com
# 12eafeb0	21-Sep-1996	Paul Traina <pst@FreeBSD.org>	I don't understand, I committed this fix (move a counter and fixed a typo) this evening. I think I'm going insane.
# 96d719e6	21-Sep-1996	Andrey A. Chernov <ache@FreeBSD.org>	Syntax error: so_incom -> so_incomp
# 4195b4af	20-Sep-1996	Paul Traina <pst@FreeBSD.org>	If the incomplete listen queue for a given socket is full, drop the oldest entry in the queue. There was a fair bit of discussion as to whether or not the proper action is to drop a random entry in the queue. It's my conclusion that a random drop is better than a head drop, however profiling this section of code (done by John Capo) shows that a head-drop results in a significant performance increase. There are scenarios where a random drop is more appropriate. If I find one in reality, I'll add the random drop code under a conditional. Obtained from: discussions and code done by Vernon Schryver (vjs@sgi.com).
# 7b40aa32	13-Sep-1996	Paul Traina <pst@FreeBSD.org>	Make the misnamed tcp initial keepalive timer value (which is really the time, in seconds, that state for non-established TCP sessions stays about) a sysctl modifyable variable. [part 1 of two commits, I just realized I can't play with the indices as I was typing this commit message.]
# 7ff19458	13-Sep-1996	Paul Traina <pst@FreeBSD.org>	Receipt of two SYN's are sufficient to set the t_timer[TCPT_KEEP] to "keepidle". this should not occur unless the connection has been established via the 3-way handshake which requires an ACK Submitted by: jmb Obtained from: problem discussed in Stevens vol. 3
# df5c0b8a	01-May-1996	Bill Fenner <fenner@FreeBSD.org>	Back out my stupid braino; I was thinking strlen and not sizeof.
# af00f800	01-May-1996	Bill Fenner <fenner@FreeBSD.org>	Size temp var correctly; buf[4*sizeof "123"] is not long enough to store "192.252.119.189\0".
# 75cfc95f	27-Apr-1996	Andrey A. Chernov <ache@FreeBSD.org>	inet_ntoa buffer was evaluated twice in log_in_vain, fix it. Thanx to: jdp
# a2352fc1	26-Apr-1996	Garrett Wollman <wollman@FreeBSD.org>	Delete #ifdef notdef blocks containing old method of srtt calculation. Requested by: davidg
# d78a37ad	09-Apr-1996	Paul Traina <pst@FreeBSD.org>	Logging UDP and TCP connection attempts should not be enabled by default. It's trivial to create a denial of service attack on a box so enabled. These messages, if enabled at all, must be rate-limited. (!)
# 816a3d83	04-Apr-1996	Poul-Henning Kamp <phk@FreeBSD.org>	Log TCP syn packets for ports we don't listen on. Controlled by: sysctl net.inet.tcp.log_in_vain: 1 Log UDP syn packets for ports we don't listen on. Controlled by: sysctl net.inet.udp.log_in_vain: 1 Suggested by: Warren Toomey <wkt@cs.adfa.oz.au>
# 9e2874b0	25-Mar-1996	Garrett Wollman <wollman@FreeBSD.org>	Slight modification of RTO floor calculation.
# 233e8c18	22-Mar-1996	Garrett Wollman <wollman@FreeBSD.org>	A number of performance-reducing flaws fixed based on comments from Larry Peterson &co. at Arizona: - Header prediction for ACKs did not exclude Fast Retransmit/Recovery. - srtt calculation tended to get ``stuck'' and could never decrease when below 8. It still can't, but the scaling factors are adjusted so that this artifact does not cause as bad an effect on the RTO value as it used to. The paper also points out the incr/8 error that has been long since fixed, and the problems with ACKing frequency resulting from the use of options which I suspect to be fixed already as well (as part of the T/TCP work). Obtained from: Brakmo & Peterson, ``Performance Problems in BSD4.4 TCP''
# 2ee45d7d	11-Mar-1996	David Greenman <dg@FreeBSD.org>	Move or add #include <queue.h> in preparation for upcoming struct socket changes.
# 1347f5b8	26-Feb-1996	Guido van Rooij <guido@FreeBSD.org>	Add a counter for the number of times the listen queue was overflowed to the tcpstat structure. (netstat -s) Reviewed by: wollman Obtained from: Steves, TCP/IP Ill. vol.3, page 189
# f9d5a964	22-Feb-1996	David Greenman <dg@FreeBSD.org>	Fixed bug in Path MTU Discovery that caused the system to have to re- discover the Path MTU for each connection if the connecting host didn't offer an initial MSS. Submitted by: davidg & olah
# 07e43e10	31-Jan-1996	Andras Olah <olah@FreeBSD.org>	Fix a bug related to the interworking of T/TCP and window scaling: when a connection enters the ESTBLS state using T/TCP, then window scaling wasn't properly handled. The fix is twofold. 1) When the 3WHS completes, make sure that we update our window scaling state variables. 2) When setting the `virtual advertized window', then make sure that we do not try to offer a window that is larger than the maximum window without scaling (TCP_MAXWIN). Reviewed by: davidg Reported by: Jerry Chen <chen@Ipsilon.COM>
# f708ef1b	14-Dec-1995	Poul-Henning Kamp <phk@FreeBSD.org>	Another mega commit to staticize things.
# 0312fbe9	14-Nov-1995	Poul-Henning Kamp <phk@FreeBSD.org>	New style sysctl & staticize alot of stuff.
# 98163b98	09-Nov-1995	Poul-Henning Kamp <phk@FreeBSD.org>	Start adding new style sysctl here too.
# 845799c1	03-Nov-1995	Andras Olah <olah@FreeBSD.org>	Cosmetic changes to processing of segments in the SYN_SENT state: - remove a redundant condition; - complete all validity checks on segment before calling soisconnected(so). Reviewed by: Richard Stevens, davidg, wollman
# 91badc86	13-Oct-1995	Garrett Wollman <wollman@FreeBSD.org>	Routes can be asymmetric. Always offer to /accept/ an MSS of up to the capacity of the link, even if the route's MTU indicates that we cannot send that much in their direction. (This might actually make it possible to test Path MTU discovery in a useful variety of cases.)
# e79adb8e	03-Oct-1995	Garrett Wollman <wollman@FreeBSD.org>	Finish 4.4-Lite-2 merge: randomize TCP initial sequence numbers to make ISS-guessing spoofing attacks harder.
# efe4b0eb	21-Sep-1995	Garrett Wollman <wollman@FreeBSD.org>	Second try: get 4.4-Lite-2 into the source tree. The conflicts don't matter because none of our working source files are on the CSRG branch any more. Obtained from: 4.4BSD-Lite-2
# d3eede9d	31-Jul-1995	Andras Olah <olah@FreeBSD.org>	Remove a redundant `if' from tcp_reass(). Correct a typo in a comment (SEND_SYN -> NEEDSYN). Reviewed by: David Greenman
# dd224982	10-Jul-1995	Garrett Wollman <wollman@FreeBSD.org>	tcp_input.c - keep track of how many times a route contained a cached rtt or ssthresh that we were able to use tcp_var.h - declare tcpstat entries for above; declare tcp_{send,recv}space in_rmx.c - fill in the MTU and pipe sizes with the defaults TCP would have used anyway in the absence of values here
# fc978271	29-Jun-1995	Garrett Wollman <wollman@FreeBSD.org>	Keep track of the number of samples through the srtt filter so that we know better when to cache values in the route, rather than relying on a heuristic involving sequence numbers that broke when tcp_sendspace was increased to 16k.
# 9b2e5354	30-May-1995	Rodney W. Grimes <rgrimes@FreeBSD.org>	Remove trailing whitespace.
# 6b067b07	10-May-1995	David Greenman <dg@FreeBSD.org>	#ifdef'd my Nagel/ACK hack with "TCP_ACK_HACK", disabled by default. I'm currently considering reducing the TCP fasttimo to 100ms to help improve things, but this would be done as a seperate step at some point in the future. This was done because it was causing some sometimes serious performance problems with T/TCP.
# 40db8ef7	08-May-1995	Andras Olah <olah@FreeBSD.org>	Fix a misspelled constant in tcp_input.c. On Tue, 09 May 1995 04:35:27 PDT, Richard Stevens wrote: > In tcp_dooptions() under the case TCPOPT_CC there is an assignment > > to->to_flag \|= TCPOPT_CC; > > that should be > > to->to_flag \|= TOF_CC; > > I haven't thought through the ramifications of what's been happening ... > > Rich Stevens Submitted by: rstevens@noao.edu (Richard Stevens)
# 0d7b7d3e	03-May-1995	David Greenman <dg@FreeBSD.org>	Changed in_pcblookuphash() to not automatically call in_pcblookup() if the lookup fails. Updated callers to deal with this. Call in_pcblookuphash instead of in_pcblookup() in in_pcbconnect; this improves performance of UDP output by about 17% in the standard case.
# d79940da	10-Apr-1995	David Greenman <dg@FreeBSD.org>	Further satisfy my paranoia by making sure that the ACKNOW is only set when ti_len is non-zero.
# afa70c96	10-Apr-1995	David Greenman <dg@FreeBSD.org>	Fixed bug I introduced with my Nagel hack which caused tcp_input and tcp_output to loop endlessly. This was freefall's problem during the past day.
# 15bd2b43	08-Apr-1995	David Greenman <dg@FreeBSD.org>	Implemented PCB hashing. Includes new functions in_pcbinshash, in_pcbrehash, and in_pcblookuphash.
# 755c1f07	05-Apr-1995	Andras Olah <olah@FreeBSD.org>	Fix a bug in tcp_input reported by Rick Jones <raj@hpisrdq.cup.hp.com>. If a goto findpcb occurred during the processing of a segment, the TCP and IP headers were dropped twice from the mbuf which resulted in data acked by TCP but not delivered to the user. Reviewed by: davidg
# e612a582	27-Mar-1995	David Greenman <dg@FreeBSD.org>	Re-apply my "breakage" to the Nagel congestion avoidence. This version differs slightly in the logic from the previous version; packets are now acked immediately if the sender set PUSH.
# b5e8ce9f	16-Mar-1995	Bruce Evans <bde@FreeBSD.org>	Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) and most of the warnings from `gcc -Wnested-externs'. Fix all the bugs found. There were no serious ones.
# dac20301	15-Feb-1995	Garrett Wollman <wollman@FreeBSD.org>	Avoid deadlock situation described by Stevens using his suggested replacement code. Obtained from: Stevens, vol. 2, pp. 959-960
# 41f82abe	15-Feb-1995	Garrett Wollman <wollman@FreeBSD.org>	Transaction TCP support now standard. Hack away!
# c70f4510	13-Feb-1995	Poul-Henning Kamp <phk@FreeBSD.org>	YFfix.
# 2f96f1f4	13-Feb-1995	Garrett Wollman <wollman@FreeBSD.org>	Get rid of some unneeded #ifdef TTCP lines. Also, get rid of some bogus commons declared in header files.
# a0292f23	09-Feb-1995	Garrett Wollman <wollman@FreeBSD.org>	Merge Transaction TCP, courtesy of Andras Olah <olah@cs.utwente.nl> and Bob Braden <braden@isi.edu>. NB: This has not had David's TCP ACK hack re-integrated. It is not clear what the correct solution to this problem is, if any. If a better solution doesn't pop up in response to this message, I'll put David's code back in (or he's welcome to do so himself).
# 10be5648	13-Oct-1994	Garrett Wollman <wollman@FreeBSD.org>	As suggested by Sally Floyd, don't add the ``small fraction of the window size'' when doing congestion avoidance. Submitted by: Mark Andrews
# 623ae52e	02-Oct-1994	Poul-Henning Kamp <phk@FreeBSD.org>	GCC cleanup. Reviewed by: Submitted by: Obtained from:
# 610ee2f9	15-Sep-1994	David Greenman <dg@FreeBSD.org>	Made TCPDEBUG truely optional. Based on changes I made in FreeBSD 1.1.5. Fixed somebody's idea of a joke - about the first half of the lines in in_proto.c were spaced over by one space.
# 6a7be6e8	26-Aug-1994	Garrett Wollman <wollman@FreeBSD.org>	Obey RFC 793, section 3.4: Several examples of connection initiation follow. Although these examples do not show connection synchronization using data-carrying segments, this is perfectly legitimate, so long as the receiving TCP doesn't deliver the data to the user until it is clear the data is valid (i.e., the data must be buffered at the receiver until the connection reaches the ESTABLISHED state).
# f23b4c91	18-Aug-1994	Garrett Wollman <wollman@FreeBSD.org>	Fix up some sloppy coding practices: - Delete redundant declarations. - Add -Wredundant-declarations to Makefile.i386 so they don't come back. - Delete sloppy COMMON-style declarations of uninitialized data in header files. - Add a few prototypes. - Clean up warnings resulting from the above. NB: ioconf.c will still generate a redundant-declaration warning, which is unavoidable unless somebody volunteers to make `config' smarter.
# 3c4dd356	02-Aug-1994	David Greenman <dg@FreeBSD.org>	Added $Id$
# b164106f	31-Jul-1994	David Greenman <dg@FreeBSD.org>	Fixed bug with Nagel Congestion Avoidance where a tcp connection would stall unnecessarily - always send an ACK when a packet len of < mss is received.
# d4d0967e	26-May-1994	David Greenman <dg@FreeBSD.org>	Added missing ntohl()'s that are needed before calling IN_MULTICAST in a couple of places. Submitted by: Johannes Helander
# 26f9a767	25-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
# df8bae1d	24-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	BSD 4.4 Lite Kernel Sources