#
fce03f85 |
|
05-May-2024 |
Randall Stewart <rrs@FreeBSD.org> |
TCP can be subject to Sack Attacks lets fix this issue. There is a type of attack that a TCP peer can launch on a connection. This is for sure in Rack or BBR and probably even the default stack if it uses lists in sack processing. The idea of the attack is that the attacker is driving you to look at 100's of sack blocks that only update 1 byte. So for example if you have 1 - 10,000 bytes outstanding the attacker sends in something like: ACK 0 SACK(1-512) SACK(1024 - 1536), SACK(2048-2536), SACK(4096 - 4608), SACK(8192-8704) This first sack looks fine but then the attacker sends ACK 0 SACK(1-512) SACK(1025 - 1537), SACK(2049-2537), SACK(4097 - 4609), SACK(8193-8705) ACK 0 SACK(1-512) SACK(1027 - 1539), SACK(2051-2539), SACK(4099 - 4611), SACK(8195-8707) ... These blocks are making you hunt across your linked list and split things up so that you have an entry for every other byte. Has your list grows you spend more and more CPU running through the lists. The idea here is the attacker chooses entries as far apart as possible that make you run through the list. This example is small but in theory if the window is open to say 1Meg you could end up with 100's of thousands link list entries. To combat this we introduce three things. when the peer requests a very small MSS we stop processing SACK's from them. This prevents a malicious peer from just using a small MSS to do the same thing. Any time we get a sack block, we use the sack-filter to remove sacks that are smaller than the smallest v4 mss (minus 40 for max TCP options) unless it ties up to snd_max (since that is legal). All other sacks in theory should be at least an MSS. If we get such an attacker that means we basically start skipping all but MSS sized Sacked blocks. The sack filter used to throw away data when its bounds were exceeded, instead now we increase its size to 15 and then throw away sack's if the filter gets over-run to prevent the malicious attacker from over-running the sack filter and thus we start to process things anyway. The default stack will need to start using the sack-filter which we have talked about in past conference calls to take full advantage of the protections offered by it (and reduce cpu consumption when processing sacks). After this set of changes is in rack can drop its SAD detection completely Reviewed by:tuexen@, rscheff@ Differential Revision: <https://reviews.freebsd.org/D44903>
|
#
c9cd686b |
|
18-Apr-2024 |
Michael Tuexen <tuexen@FreeBSD.org> |
tcp: drop data received after a FIN has been processed RFC 9293 describes the handling of data in the CLOSE-WAIT, CLOSING, LAST-ACK, and TIME-WAIT states: This should not occur since a FIN has been received from the remote side. Ignore the segment text. Therefore, implement this handling. Reviewed by: rrs, rscheff MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D44746
|
#
605a0066 |
|
15-Apr-2024 |
Michael Tuexen <tuexen@FreeBSD.org> |
tcp bbr: improve code consistency Improve code consistency with the RACK stack. Reviewed by: gallatin, rscheff MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D44800
|
#
dd7b86e2 |
|
18-Mar-2024 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp: remove IS_FASTOPEN() macro The macro is more obfuscating than helping as it just checks a single flag of t_flags. All other t_flags bits are checked without a macro. A bigger problem was that declaration of the macro in tcp_var.h depended on a kernel option. It is a bad practice to create such definitions in installable headers. Reviewed by: rscheff, tuexen, kib Differential Revision: https://reviews.freebsd.org/D44362
|
#
e18b97bd |
|
12-Mar-2024 |
Randall Stewart <rrs@FreeBSD.org> |
Update to bring the rack stack with all its fixes in. This brings the rack stack up to the current level used at NF. Many fixes and improvements have been added. I also add in a fix to BBR to deal with the changes that have been in hpts for a while i.e. only one call no matter if mbuf queue or tcp_output. It basically does little except BBlogs and is a placemark for future work on doing path capacity measurements. With a bit of a struggle with git I finally got rack_pcm.c into place (apologies for not noticing this error). The LINT kernel is running on my box now .. sigh. Reviewed by: tuexen, glebius Sponsored by: Netflix Inc. Differential Revision:https://reviews.freebsd.org/D43986
|
#
c112243f |
|
11-Mar-2024 |
Brooks Davis <brooks@FreeBSD.org> |
Revert "Update to bring the rack stack with all its fixes in." This commit was incomplete and breaks LINT kernels. The tree has been broken for 8+ hours. This reverts commit f6d489f402c320f1a6eaa473491a0b8c3878113e.
|
#
f6d489f4 |
|
11-Mar-2024 |
Randall Stewart <rrs@FreeBSD.org> |
Update to bring the rack stack with all its fixes in. This brings the rack stack up to the current level used at NF. Many fixes and improvements have been added. I also add in a fix to BBR to deal with the changes that have been in hpts for a while i.e. only one call no matter if mbuf queue or tcp_output. Note there is a new file that I can't figure out how to get in rack_pcm.c It basically does little except BBlogs and is a placemark for future work on doing path capacity measurements. Reviewed by: tuexen, glebius Sponsored by: Netflix Inc. Differential Revision:https://reviews.freebsd.org/D43986
|
#
2f4e46df |
|
15-Feb-2024 |
Michael Tuexen <tuexen@FreeBSD.org> |
RACK, BBR: handle EACCES like EPERM for IP output handling The FreeBSD TCP base stack handles them also the same way. In case of packet filters dropping packets in the output path, this avoids retranmitting the dropped packet every 10ms or so. Reviewed by: rscheff MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D43773
|
#
7b0b448b |
|
27-Dec-2023 |
Gordon Bergling <gbe@FreeBSD.org> |
tcp_stacks: Fix two typos in a source code comments - s/recieved/received/ MFC after: 3 days
|
#
d2ef52ef |
|
04-Dec-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp/hpts: make stacks responsible for clearing themselves out HPTS There already is the tfb_tcp_timer_stop_all method that is supposed to stop all time events associated with a given tcpcb by given stack. Some time ago it was doing actual callout_stop(). Today bbr/rack just mark their internal state as inactive in their tfb_tcp_timer_stop_all methods, but tcpcb stays in HPTS wheel and potentially called in from HPTS. Change the methods to also call tcp_hpts_remove(). Note: I'm not sure if internal flag is still relevant once we are out of HPTS wheel. Call the method when connection goes into TCP_CLOSED state, instead of calling it later when tcpcb is freed. Also call it when we switch between stacks. Reviewed by: tuexen, rrs Differential Revision: https://reviews.freebsd.org/D42857
|
#
2b3a7746 |
|
04-Dec-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
hpts: make stacks responsible for tcp_hpts_init() Those stacks that use HPTS should care about init, not generic code. Reviewed by: imp, tuexen, rrs Differential Revision: https://reviews.freebsd.org/D42856
|
#
685dc743 |
|
16-Aug-2023 |
Warner Losh <imp@FreeBSD.org> |
sys: Remove $FreeBSD$: one-line .c pattern Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/
|
#
ab65c64b |
|
26-Jul-2023 |
Michael Tuexen <tuexen@FreeBSD.org> |
tcp: fix handling of <RST,ACK> segments in SYN-RCVD for RACK and BBR This deals with TCP endpoints in the SYN-RCVD state coming from the SYN-SENT state. Reviewed by: rscheff MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D41203
|
#
02b885b0 |
|
21-Jun-2023 |
Michael Tuexen <tuexen@FreeBSD.org> |
tcp: fix TCP MD5 computation for the BBR and RACK stack PR: 253096 Reviewed by: cc, rscheff MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D40597
|
#
7ea8d027 |
|
20-Jun-2023 |
Richard Scheffenegger <rscheff@FreeBSD.org> |
Update various sys/netinet source files to conform with the style(9) guide on how to label FALLTHOUGH in switch statements. No functional chance. Reviewed By: tuexen, cc, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D40622
|
#
43b117f8 |
|
06-Jun-2023 |
Richard Scheffenegger <rscheff@FreeBSD.org> |
tcp: make the maximum number of retransmissions tunable per VNET Both Windows (TcpMaxDataRetransmissions) and Linux (tcp_retries2) allow to restrict the maximum number of consecutive timer based retransmissions. Add that same capability on a per-VNet basis to FreeBSD. Reviewed By: cc, tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D40424
|
#
c3c20de3 |
|
25-Apr-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp: move HPTS/LRO flags out of inpcb to tcpcb These flags are TCP specific. While here, make also several LRO internal functions to pass tcpcb pointer instead of inpcb one. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D39698
|
#
c2a69e84 |
|
25-Apr-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp_hpts: move HPTS related fields from inpcb to tcpcb This makes inpcb lighter and allows future cache line optimizations of tcpcb. The reason why HPTS originally used inpcb is the compressed TIME-WAIT state (see 0d7445193ab), that used to free a tcpcb, while the associated connection is still on the HPTS ring. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D39697
|
#
2ad584c5 |
|
17-Apr-2023 |
Randall Stewart <rrs@FreeBSD.org> |
tcp: Inconsistent use of hpts_calling flag Gleb has noticed there were some inconsistency's in the way the inp_hpts_calls flag was being used. One such inconsistency results in a bug when we can't allocate enough sendmap entries to entertain a call to rack_output().. basically a timer won't get started like it should. Also in cleaning this up I find that the "no_output" side of input needs to be adjusted to make sure we don't try to re-pace too quickly outside the hpts assurance of 250useconds. Another thing here is we end up with duplicate calls to tcp_output() which we should not. If packets go from hpts for processing the input side of tcp will call the output side of tcp on the last packet if it is needed. This means that when that occurs a second call to tcp_output would be made that is not needed and if pacing is going on may be harmful. Lets fix all this and explicitly state the contract that hpts is making with transports that care about the flag. Reviewed by: tuexen, glebius Sponsored by: Netflix Inc Differential Revision:https://reviews.freebsd.org/D39653
|
#
a540cdca |
|
17-Apr-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp_hpts: use queue(9) STAILQ for the input queue Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D39574
|
#
3cc7b667 |
|
14-Apr-2023 |
Randall Stewart <rrs@FreeBSD.org> |
tcp: stack unloading crash in rack and bbr Its possible to induce a crash in either rack or bbr. This would be done if the rack stack were say the default and bbr was being used by a connection. If the bbr stack is then unloaded and it was active, we will trigger a MPASS assert in tcp_hpts since the new stack (default rack) would start a timer, and the old stack (bbr) would have the inp already in hpts. Reviewed by: tuexen Sponsored by: Netflix Inc Differential Revision:https://reviews.freebsd.org/D39576
|
#
66fbc19f |
|
07-Apr-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp: pass tcpcb in the tfb_tcp_ctloutput() method instead of inpcb Just matches rest of the KPI. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D39435
|
#
35bc0bcc |
|
07-Apr-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp: reduce argument list to functions that pass a segment The socket argument is superfluous, as a tcpcb always has one and only one socket. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D39434
|
#
945f9a7c |
|
07-Apr-2023 |
Randall Stewart <rrs@FreeBSD.org> |
tcp: misc cleanup of options for rack as well as socket option logging. Both BBR and Rack have the ability to log socket options, which is currently disabled. Rack has an experimental SaD (Sack Attack Detection) algorithm that should be made available. Also there is a t_maxpeak_rate that needs to be removed (its un-used). Reviewed by: tuexen, cc Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D39427
|
#
73ee5756 |
|
31-Mar-2023 |
Randall Stewart <rrs@FreeBSD.org> |
Fixes in the tcp infrastructure with respect to stack changes as well as other infrastructure updates for incoming rack features. So stack switching as always been a bit of a issue. We currently use a break before make setup which means that if something goes wrong you have to try to get back to a stack. This patch among a lot of other things changes that so that it is a make before break. We also expand some of the function blocks in prep for new features in rack that will allow more controlled pacing. We also add other abilities such as the pathway for a stack to query a previous stack to acquire from it critical state information so things in flight don't get dropped or mis-handled when switching stacks. We also add the concept of a timer granularity. This allows an alternate stack to change from the old ticks granularity to microseconds and of course this even gives us a pathway to go to nanosecond timekeeping if we need to (something for the data center to consider for sure). Once all this lands I will then update rack to begin using all these new features. Reviewed by: tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D39210
|
#
69c7c811 |
|
16-Mar-2023 |
Randall Stewart <rrs@FreeBSD.org> |
Move access to tcp's t_logstate into inline functions and provide new tracepoint and bbpoint capabilities. The TCP stacks have long accessed t_logstate directly, but in order to do tracepoints and the new bbpoints we need to move to using the new inline functions. This adds them and moves rack to now use the tcp_tracepoints. Reviewed by: tuexen, gallatin Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D38831
|
#
c0e4090e |
|
08-Feb-2023 |
Andrew Gallatin <gallatin@FreeBSD.org> |
ktls: Accurately track if ifnet ktls is enabled This allows us to avoid spurious calls to ktls_disable_ifnet() When we implemented ifnet kTLSe, we set a flag in the tx socket buffer (SB_TLS_IFNET) to indicate ifnet kTLS. This flag meant that now, or in the past, ifnet ktls was active on a socket. Later, I added code to switch ifnet ktls sessions to software in the case of lossy TCP connections that have a high retransmit rate. Because TCP was using SB_TLS_IFNET to know if it needed to do math to calculate the retransmit ratio and potentially call into ktls_disable_ifnet(), it was doing unneeded work long after a session was moved to software. This patch carefully tracks whether or not ifnet ktls is still enabled on a TCP connection. Because the inp is now embedded in the tcpcb, and because TCP is the most frequent accessor of this state, it made sense to move this from the socket buffer flags to the tcpcb. Because we now need reliable access to the tcbcb, we take a ref on the inp when creating a tx ktls session. While here, I noticed that rack/bbr were incorrectly implementing tfb_hwtls_change(), and applying the change to all pending sends, when it should apply only to future sends. This change reduces spurious calls to ktls_disable_ifnet() by 95% or so in a Netflix CDN environment. Reviewed by: markj, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D38380
|
#
18b83b62 |
|
26-Jan-2023 |
Richard Scheffenegger <rscheff@FreeBSD.org> |
tcp: reduce the size of t_rttupdated in tcpcb During tcp session start, various mechanisms need to track a few initial RTTs before becoming active. Prevent overflows of the corresponding tracking counter and reduce the size of tcpcb simultaneously. Reviewed By: #transport, tuexen, guest-ccui Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D21117
|
#
73e994a9 |
|
19-Jan-2023 |
Gordon Bergling <gbe@FreeBSD.org> |
extra_tcp_stacks: Fix a common typo in source code comments - s/orginal/original/ MFC after: 3 days
|
#
eaabc937 |
|
14-Dec-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp: retire TCPDEBUG This subsystem is superseded by modern debugging facilities, e.g. DTrace probes and TCP black box logging. We intentionally leave SO_DEBUG in place, as many utilities may set it on a socket. Also the tcp::debug DTrace probes look at this flag on a socket. Reviewed by: gnn, tuexen Discussed with: rscheff, rrs, jtl Differential revision: https://reviews.freebsd.org/D37694
|
#
446ccdd0 |
|
07-Dec-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp: use single locked callout per tcpcb for the TCP timers Use only one callout structure per tcpcb that is responsible for handling all five TCP timeouts. Use locked version of callout, of course. The callout function tcp_timer_enter() chooses soonest timer and executes it with lock held. Unless the timer reports that the tcpcb has been freed, the callout is rescheduled for next soonest timer, if there is any. With single callout per tcpcb on connection teardown we should be able to fully stop the callout and immediately free it, avoiding use of callout_async_drain(). There is one gotcha here: callout_stop() can actually touch our memory when a rare race condition happens. See comment above tcp_timer_stop(). Synchronous stop of the callout makes tcp_discardcb() the single entry point for tcpcb destructor, merging the tcp_freecb() to the end of the function. While here, also remove lots of lingering checks in the beginning of TCP timer functions. With a locked callout they are unnecessary. While here, clean unused parts of timer KPI for the pluggable TCP stacks. While here, remove TCPDEBUG from tcp_timer.c, as this allows for more simplification of TCP timers. The TCPDEBUG is scheduled for removal. Move the DTrace probes in timers to the beginning of a function, where a tcpcb is always existing. Discussed with: rrs, tuexen, rscheff (the TCP part of the diff) Reviewed by: hselasky, kib, mav (the callout part) Differential revision: https://reviews.freebsd.org/D37321
|
#
918fa422 |
|
07-Dec-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp: remove tcp_timer_suspend() It was a temporary code added together with RACK to fight against TCP timer races.
|
#
bd4f9866 |
|
16-Nov-2022 |
Michael Tuexen <tuexen@FreeBSD.org> |
tcp: remove unused t_rttbest No functional change intended. Reviewed by: rscheff@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D37401
|
#
9eb0e832 |
|
08-Nov-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp: provide macros to access inpcb and socket from a tcpcb There should be no functional changes with this commit. Reviewed by: rscheff Differential revision: https://reviews.freebsd.org/D37123
|
#
bcf8fb7f |
|
08-Nov-2022 |
Gordon Bergling <gbe@FreeBSD.org> |
tcp_bbr(4): Fix a typo in a source code comment - s/retranmitted/retransmitted/ MFC after: 3 days
|
#
f504685a |
|
31-Oct-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
rack/bbr: put back assertion that connection is not in TIME-WAIT The assertion was incorrectly removed in 0d7445193ab. The leak of a TIME-WAIT state into tfb_do_segment_nounlock method was fixed in 31bc602ff81. The TIME-WAIT connections are processed by the main tcp_input() always.
|
#
83c1ec92 |
|
20-Oct-2022 |
Richard Scheffenegger <rscheff@FreeBSD.org> |
tcp: ECN preparations for ECN++, AccECN (tcp_respond) tcp_respond is another function to build a tcp control packet quickly. With ECN++ and AccECN, both the IP ECN header, and the TCP ECN flags are supposed to reflect the correct state. Also ensure that on receiving multiple ECN SYN-ACKs, the responses triggered will reflect the latest state. Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D36973
|
#
53af6903 |
|
06-Oct-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp: remove INP_TIMEWAIT flag Mechanically cleanup INP_TIMEWAIT from the kernel sources. After 0d7445193ab, this commit shall not cause any functional changes. Note: this flag was very often checked together with INP_DROPPED. If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we will be able to remove most of this checks and turn them to assertions. Some of them can be turned into assertions right now, but that should be carefully done on a case by case basis. Differential revision: https://reviews.freebsd.org/D36400
|
#
0d744519 |
|
06-Oct-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp: remove tcptw, the compressed timewait state structure The memory savings the tcptw brought back in 2003 (see 340c35de6a2) no longer justify the complexity required to maintain it. For longer explanation please check out the email [1]. Surpisingly through almost 20 years the TCP stack functionality of handling the TIME_WAIT state with a normal tcpcb did not bitrot. The existing tcp_input() properly handles a tcpcb in TCPS_TIME_WAIT state, which is confirmed by the packetdrill tcp-testsuite [2]. This change just removes tcptw and leaves INP_TIMEWAIT. The flag will be removed in a separate commit. This makes it easier to review and possibly debug the changes. [1] https://lists.freebsd.org/archives/freebsd-net/2022-January/001206.html [2] https://github.com/freebsd-net/tcp-testsuite Differential revision: https://reviews.freebsd.org/D36398
|
#
2220b66f |
|
03-Oct-2022 |
Konstantin Belousov <kib@FreeBSD.org> |
Add mbuf_tstmp2timeval() Reviewed by: hselasky, jkim, rscheff Sponsored by: NVIDIA networking MFC after: 1 week Differential revision: https://reviews.freebsd.org/D36870
|
#
493105c2 |
|
21-Sep-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp: fix simultaneous open and refine e80062a2d43 - The soisconnected() call on transition from SYN_RCVD to ESTABLISHED is also necessary for a half-synchronized connection. Fix that just setting the flag, when we transfer SYN-SENT -> SYN-RECEIVED. - Provide a comment that explains at what conditions the call to soisconnected() is necessary. - Hence mechanically rename the TF_INCQUEUE flag to TF_SONOTCONN. - Extend the change to the BBR and RACK stacks. Note: the interaction between the accept_filter(9) and the socket layer is not fully consistent, yet. For most accept filters this call to soisconnected() will not move the connection from the incomplete queue to the complete. The move would happen only when the filter has received the desired data, and soisconnected() would be called once again from sorwakeup(). Ideally, we should mark socket as connected only there, and leave the soisconnected() from SYN_RCVD->ESTABLISHED only for the simultaneous open case. However, this doesn't yet work. Reviewed by: rscheff, tuexen, rrs Differential revision: https://reviews.freebsd.org/D36641
|
#
76248965 |
|
14-Aug-2022 |
Dimitry Andric <dim@FreeBSD.org> |
Fix unused variable warning in tcp_stacks's bbr.c With clang 15, the following -Werror warning is produced: sys/netinet/tcp_stacks/bbr.c:11925:11: error: variable 'rtr_cnt' set but not used [-Werror,-Wunused-but-set-variable] uint32_t rtr_cnt = 0; ^ The 'rtr_cnt' variable was in bbr.c when it was first added, but it appears to have been a debugging aid that has never been used, so remove it. MFC after: 3 days
|
#
aeb6948d |
|
08-Jul-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
bbr: check proper flag for connection had been closed An older version of D35663 slipped through final reviews. Submitted by: Peter Lei Fixes: 74703901d8bbc3bc7a29df648bc3c131c87393c2
|
#
74703901 |
|
04-Jul-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp: use a TCP flag to check if connection has been close(2)d The flag SS_NOFDREF is a private flag of the socket layer. It also is supposed to be read with SOCK_LOCK(), which we don't own here. Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D35663
|
#
4581cffb |
|
12-May-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
sockets: fix build, convert missed sbreserve_locked() calls Fixes: 4328318445ae
|
#
033718ab |
|
12-Apr-2022 |
Richard Scheffenegger <rscheff@FreeBSD.org> |
tcp: Whitespace cleanup in brr and rack Whitespace cleanup (leading spaces to tabs) Nicefy function definitions with indentations No functional change Reviewed By: #transport, thj Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D30043
|
#
2dd0c2bc |
|
09-Apr-2022 |
Gordon Bergling <gbe@FreeBSD.org> |
tcp_bbr(4): Fix a typo in a source code comment - s/possiblity/possibility/ MFC after: 3 days
|
#
66570901 |
|
09-Apr-2022 |
Gordon Bergling <gbe@FreeBSD.org> |
tcp_bbr(4): Fix two typos in source code comments - s/postive/positive/ - s/postion/position/ MFC after: 3days
|
#
4d6883cb |
|
08-Apr-2022 |
Gordon Bergling <gbe@FreeBSD.org> |
tcp_bbr(4): Fix a typo in a sysctl description and a comment - s/postive/positive/ MFC after: 5 days
|
#
75fdc440 |
|
07-Feb-2022 |
Gordon Bergling <gbe@FreeBSD.org> |
extra_tcp_stacks: Fix two typos in source code comments - s/recusive/recursive/ MFC after: 3 days
|
#
1ebf4607 |
|
03-Feb-2022 |
Richard Scheffenegger <rscheff@FreeBSD.org> |
tcp: Access all 12 TCP header flags via inline function In order to consistently provide access to all (including reserved) TCP header flag bits, use an accessor function tcp_get_flags and tcp_set_flags. Also expand any flag variable from uint8_t / char to uint16_t. Reviewed By: hselasky, tuexen, glebius, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D34130
|
#
3b3c08c1 |
|
02-Feb-2022 |
Michael Tuexen <tuexen@FreeBSD.org> |
tcp: cleanup functions related to socket option handling Consistently only pass the inp and the sopt around. Don't pass the so around, since in a upcoming commit tcp_ctloutput_set() will be called from a context different from setsockopt(). Also expect the inp to be locked when calling tcp_ctloutput_[gs]et(), this is also required for the upcoming use by tcpsso, a command line tool to set socket options. Reviewed by: glebius, rscheff Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D34151
|
#
9e58cca3 |
|
26-Jan-2022 |
Gordon Bergling <gbe@FreeBSD.org> |
extra_tcp_stacks: Fix two typos in source code comments - s/differnt/different/ MFC after; 3 days
|
#
b3df222e |
|
26-Jan-2022 |
Gordon Bergling <gbe@FreeBSD.org> |
extra_tcp_stacks: Fix a few common typos TCP_BBR: - Fix a typo introducted in 1b90dfa5d2b0, which was reported by tuexen@ TCP_RACK: - Correct two sysctl descriptions: s/corret/correct/ tcp_bbr(4): Also fix s/measurment/measurement/ in the man page MFC after: 1 week
|
#
aac52f94 |
|
18-Jan-2022 |
Randall Stewart <rrs@FreeBSD.org> |
tcp: Warning cleanup from new compiler. The clang compiler recently got an update that generates warnings of unused variables where they were set, and then never used. This revision goes through the tcp stack and cleans all of those up. Reviewed by: Michael Tuexen, Gleb Smirnoff Sponsored by: Netflix Inc. Differential Revision:
|
#
1b90dfa5 |
|
02-Jan-2022 |
Gordon Bergling <gbe@FreeBSD.org> |
tcp_bbr(4): Fix a few typos in sysctl descriptions - s/measurment/measurement/ MFC after: 3 days
|
#
a370832b |
|
26-Dec-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp: remove delayed drop KPI No longer needed after tcp_output() can ask caller to drop. Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D33371
|
#
f64dc2ab |
|
26-Dec-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp: TCP output method can request tcp_drop The advanced TCP stacks (bbr, rack) may decide to drop a TCP connection when they do output on it. The default stack never does this, thus existing framework expects tcp_output() always to return locked and valid tcpcb. Provide KPI extension to satisfy demands of advanced stacks. If the output method returns negative error code, it means that caller must call tcp_drop(). In tcp_var() provide three inline methods to call tcp_output(): - tcp_output() is a drop-in replacement for the default stack, so that default stack can continue using it internally without modifications. For advanced stacks it would perform tcp_drop() and unlock and report that with negative error code. - tcp_output_unlock() handles the negative code and always converts it to positive and always unlocks. - tcp_output_nodrop() just calls the method and leaves the responsibility to drop on the caller. Sweep over the advanced stacks and use new KPI instead of using HPTS delayed drop queue for that. Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D33370
|
#
40fa3e40 |
|
26-Dec-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp: mechanically substitute call to tfb_tcp_output to new method. Made with sed(1) execution: sed -Ef sed -i "" $(grep --exclude tcp_var.h -lr tcp_output sys/) sed: s/tp->t_fb->tfb_tcp_output\(tp\)/tcp_output(tp)/ s/to tfb_tcp_output\(\)/to tcp_output()/ Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D33366
|
#
db0ac6de |
|
02-Dec-2021 |
Cy Schubert <cy@FreeBSD.org> |
Revert "wpa: Import wpa_supplicant/hostapd commit 14ab4a816" This reverts commit 266f97b5e9a7958e365e78288616a459b40d924a, reversing changes made to a10253cffea84c0c980a36ba6776b00ed96c3e3b. A mismerge of a merge to catch up to main resulted in files being committed which should not have been.
|
#
f971e791 |
|
02-Dec-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp_hpts: rename input queue to drop queue and trim dead code The HPTS input queue is in reality used only for "delayed drops". When a TCP stack decides to drop a connection on the output path it can't do that due to locking protocol between main tcp_output() and stacks. So, rack/bbr utilize HPTS to drop the connection in a different context. In the past the queue could also process input packets in context of HPTS thread, but now no stack uses this, so remove this functionality. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33025
|
#
50f081ec |
|
02-Dec-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp_hpts: provide tcp_in_hpts(). It will hide some internal HPTS knowledge from the consumers. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33023
|
#
1dadeab3 |
|
30-Nov-2021 |
Gordon Bergling <gbe@FreeBSD.org> |
netinet: Fix a common typo in source code comments - s/segement/segment/ MFC after: 3 days
|
#
c28e39c3 |
|
03-Nov-2021 |
Gordon Bergling <gbe@FreeBSD.org> |
Fix a common typo in syctl descriptions - s/maxiumum/maximum/ MFC after: 3 days
|
#
f581a26e |
|
25-Oct-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Factor out tcp6_use_min_mtu() to handle IPV6_USE_MIN_MTU by TCP. Pass control for IP/IP6 level options from generic tcp_ctloutput_set() down to per-stack ctloutput. Call tcp6_use_min_mtu() from tcp stack tcp_default_ctloutput(). Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D32655
|
#
a36230f7 |
|
01-Oct-2021 |
Randall Stewart <rrs@FreeBSD.org> |
tcp: Make dsack stats available in netstat and also make sure its aware of TLP's. DSACK accounting has been for quite some time under a NETFLIX_STATS ifdef. Statistics on DSACKs however are very useful in figuring out how much bad retransmissions you are doing. This is further complicated, however, by stacks that do TLP. A TLP when discovering a lost ack in the reverse path will cause the generation of a DSACK. For this situation we introduce a new dsack-tlp-bytes as well as the more traditional dsack-bytes and dsack-packets. These will now all display in netstat -p tcp -s. This also updates all stacks that are currently built to keep track of these stats. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32158
|
#
b1e806c0 |
|
07-Jul-2021 |
Andrew Gallatin <gallatin@FreeBSD.org> |
tcp: fix alternate stack build with LINT-NO{INET,INET6,IP} When fixing another bug, I noticed that the alternate TCP stacks do not build when various combinations of ipv4 and ipv6 are disabled. Reviewed by: rrs, tuexen Differential Revision: https://reviews.freebsd.org/D31094 Sponsored by: Netflix
|
#
d7955cc0 |
|
06-Jul-2021 |
Randall Stewart <rrs@FreeBSD.org> |
tcp: HPTS performance enhancements HPTS drives both rack and bbr, and yet there have been many complaints about performance. This bit of work restructures hpts to help reduce CPU overhead. It does this by now instead of relying on the timer/callout to drive it instead use user return from a system call as well as lro flushes to drive hpts. The timer becomes a backstop that dynamically adjusts based on how "late" we are. Reviewed by: tuexen, glebius Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31083
|
#
9e4d9e4c |
|
25-Jun-2021 |
Randall Stewart <rrs@FreeBSD.org> |
tcp: Preparation for allowing hardware TLS to be able to kick a tcp connection that is retransmitting too much out of hardware and back to software. Hardware TLS is now supported in some interface cards and it works well. Except that when we have connections that retransmit a lot we get into trouble with all the retransmits. This prep step makes way for change that Drew will be making so that we can "kick out" a session from hardware TLS. Reviewed by: mtuexen, gallatin Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30895
|
#
ba1b3e48 |
|
11-Jun-2021 |
Randall Stewart <rrs@FreeBSD.org> |
tcp: Missing mfree in rack and bbr Recently (Nov) we added logic that protects against a peer negotiating a timestamp, and then not including a timestamp. This involved in the input path doing a goto done_with_input label. Now I suspect the code was cribbed from one in Rack that has to do with the SYN. This had a bug, i.e. it should have a m_freem(m) before going to the label (bbr had this missing m_freem() but rack did not). This then caused the missing m_freem to show up in both BBR and Rack. Also looking at the code referencing m->m_pkthdr.lro_nsegs later (after processing) is not a good idea, even though its only for logging. Best to copy that off before any frees can take place. Reviewed by: mtuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30727
|
#
67e89281 |
|
10-Jun-2021 |
Randall Stewart <rrs@FreeBSD.org> |
tcp: Mbuf leak while holding a socket buffer lock. When running at NF the current Rack and BBR changes with the recent commits from Richard that cause the socket buffer lock to be held over the ip_output() call and then finally culminating in a call to tcp_handle_wakeup() we get a lot of leaked mbufs. I don't think that this leak is actually caused by holding the lock or what Richard has done, but is exposing some other bug that has probably been lying dormant for a long time. I will continue to look (using his changes) at what is going on to try to root cause out the issue. In the meantime I can't leave the leaks out for everyone else. So this commit will revert all of Richards changes and move both Rack and BBR back to just doing the old sorwakeup_locked() calls after messing with the so_rcv buffer. We may want to look at adding back in Richards changes after I have pinpointed the root cause of the mbuf leak and fixed it. Reviewed by: mtuexen,rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30704
|
#
4747500d |
|
04-Jun-2021 |
Randall Stewart <rrs@FreeBSD.org> |
tcp: A better fix for the previously attempted fix of the ack-war issue with tcp. So it turns out that my fix before was not correct. It ended with us failing some of the "improved" SYN tests, since we are not in the correct states. With more digging I have figured out the root of the problem is that when we receive a SYN|FIN the reassembly code made it so we create a segq entry to hold the FIN. In the established state where we were not in order this would be correct i.e. a 0 len with a FIN would need to be accepted. But if you are in a front state we need to strip the FIN so we correctly handle the ACK but ignore the FIN. This gets us into the proper states and avoids the previous ack war. I back out some of the previous changes but then add a new change here in tcp_reass() that fixes the root cause of the issue. We still leave the rack panic fixes in place however. Reviewed by: mtuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30627
|
#
8c69d988 |
|
27-May-2021 |
Randall Stewart <rrs@FreeBSD.org> |
tcp: When we have an out-of-order FIN we do want to strip off the FIN bit. The last set of commits fixed both a panic (in rack) and an ACK-war (in freebsd and bbr). However there was a missing case, i.e. where we get an out-of-order FIN by itself. In such a case we don't want to leave the FIN bit set, otherwise we will do the wrong thing and ack the FIN incorrectly. Instead we need to go through the tcp_reasm() code and that way the FIN will be stripped and all will be well. Reviewed by: mtuexen,rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30497
|
#
13c0e198 |
|
25-May-2021 |
Randall Stewart <rrs@FreeBSD.org> |
tcp: Fix bugs related to the PUSH bit and rack and an ack war Michaels testing with UDP tunneling found an issue with the push bit, which was only partly fixed in the last commit. The problem is the left edge gets transmitted before the adjustments are done to the send_map, this means that right edge bits must be considered to be added only if the entire RSM is being retransmitted. Now syzkaller also continued to find a crash, which Michael sent me the reproducer for. Turns out that the reproducer on default (freebsd) stack made the stack get into an ack-war with itself. After fixing the reference issues in rack the same ack-war was found in rack (and bbr). Basically what happens is we go into the reassembly code and lose the FIN bit. The trick here is we should not be going into the reassembly code if tlen == 0 i.e. the peer never sent you anything. That then gets the proper action on the FIN bit but then you end up in LAST_ACK with no timers running. This is because the usrclosed function gets called and the FIN's and such have already been exchanged. So when we should be entering FIN_WAIT2 (or even FIN_WAIT1) we get stuck in LAST_ACK. Fixing this means tweaking the usrclosed function so that we properly recognize the condition and drop into FIN_WAIT2 where a timer will allow at least TP_MAXIDLE before closing (to allow time for the peer to retransmit its FIN if the ack is lost). Setting the fast_finwait2 timer can speed this up in testing. Reviewed by: mtuexen,rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30451
|
#
8923ce63 |
|
22-May-2021 |
Michael Tuexen <tuexen@FreeBSD.org> |
tcp: Handle stack switch while processing socket options Handle the case where during socket option processing, the user switches a stack such that processing the stack specific socket option does not make sense anymore. Return an error in this case. MFC after: 1 week Reviewed by: markj Reported by: syzbot+a6e1d91f240ad5d72cd1@syzkaller.appspotmail.com Sponsored by: Netflix, Inc. Differential revision: https://reviews.freebsd.org/D30395
|
#
032bf749 |
|
21-May-2021 |
Richard Scheffenegger <rscheff@FreeBSD.org> |
[tcp] Keep socket buffer locked until upcall r367492 would unlock the socket buffer before eventually calling the upcall. This leads to problematic interaction with NFS kernel server/client components (MP threads) accessing the socket buffer with potentially not correctly updated state. Reported by: rmacklem Reviewed By: tuexen, #transport Tested by: rmacklem, otis MFC after: 2 weeks Sponsored By: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29690
|
#
500eb6dd |
|
21-May-2021 |
Michael Tuexen <tuexen@FreeBSD.org> |
tcp: Fix sending of TCP segments with IP level options When bringing in TCP over UDP support in https://cgit.FreeBSD.org/src/commit/?id=9e644c23000c2f5028b235f6263d17ffb24d3605, the length of IP level options was considered when locating the transport header. This was incorrect and is fixed by this patch. X-MFC with: https://cgit.FreeBSD.org/src/commit/?id=9e644c23000c2f5028b235f6263d17ffb24d3605 MFC after: 3 days Reviewed by: markj, rscheff Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D30358
|
#
5d8fd932 |
|
06-May-2021 |
Randall Stewart <rrs@FreeBSD.org> |
This brings into sync FreeBSD with the netflix versions of rack and bbr. This fixes several breakages (panics) since the tcp_lro code was committed that have been reported. Quite a few new features are now in rack (prefecting of DGP -- Dynamic Goodput Pacing among the largest). There is also support for ack-war prevention. Documents comming soon on rack.. Sponsored by: Netflix Reviewed by: rscheff, mtuexen Differential Revision: https://reviews.freebsd.org/D30036
|
#
9e644c23 |
|
18-Apr-2021 |
Michael Tuexen <tuexen@FreeBSD.org> |
tcp: add support for TCP over UDP Adding support for TCP over UDP allows communication with TCP stacks which can be implemented in userspace without requiring special priviledges or specific support by the OS. This is joint work with rrs. Reviewed by: rrs Sponsored by: Netflix, Inc. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29469
|
#
40f41ece |
|
22-Mar-2021 |
Michael Tuexen <tuexen@FreeBSD.org> |
tcp: improve handling of SYN segments in SYN-SENT state Ensure that the stack does not generate a DSACK block for user data received on a SYN segment in SYN-SENT state. Reviewed by: rscheff MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D29376 Sponsored by: Netflix, Inc.
|
#
5666643a |
|
13-Mar-2021 |
Gordon Bergling <gbe@FreeBSD.org> |
Fix some common typos in comments - occured -> occurred - normaly -> normally - controling -> controlling - fileds -> fields - insterted -> inserted - outputing -> outputting MFC after: 1 week
|
#
1a714ff2 |
|
26-Jan-2021 |
Randall Stewart <rrs@FreeBSD.org> |
This pulls over all the changes that are in the netflix tree that fix the ratelimit code. There were several bugs in tcp_ratelimit itself and we needed further work to support the multiple tag format coming for the joint TLS and Ratelimit dances. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D28357
|
#
d2b3cedd |
|
13-Jan-2021 |
Michael Tuexen <tuexen@FreeBSD.org> |
tcp: add sysctl to tolerate TCP segments missing timestamps When timestamp support has been negotiated, TCP segements received without a timestamp should be discarded. However, there are broken TCP implementations (for example, stacks used by Omniswitch 63xx and 64xx models), which send TCP segments without timestamps although they negotiated timestamp support. This patch adds a sysctl variable which tolerates such TCP segments and allows to interoperate with broken stacks. Reviewed by: jtl@, rscheff@ Differential Revision: https://reviews.freebsd.org/D28142 Sponsored by: Netflix, Inc. PR: 252449 MFC after: 1 week
|
#
cc3c3485 |
|
13-Jan-2021 |
Michael Tuexen <tuexen@FreeBSD.org> |
tcp: fix handling of TCP RST segments missing timestamps A TCP RST segment should be processed even it is missing TCP timestamps. Reported by: dmgk@, kevans@ Reviewed by: rscheff@, dmgk@ Sponsored by: Netflix, Inc. MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D28143
|
#
283c76c7 |
|
09-Nov-2020 |
Michael Tuexen <tuexen@FreeBSD.org> |
RFC 7323 specifies that: * TCP segments without timestamps should be dropped when support for the timestamp option has been negotiated. * TCP segments with timestamps should be processed normally if support for the timestamp option has not been negotiated. This patch enforces the above. PR: 250499 Reviewed by: gnn, rrs MFC after: 1 week Sponsored by: Netflix, Inc Differential Revision: https://reviews.freebsd.org/D27148
|
#
4d0770f1 |
|
08-Nov-2020 |
Richard Scheffenegger <rscheff@FreeBSD.org> |
Prevent premature SACK block transmission during loss recovery Under specific conditions, a window update can be sent with outdated SACK information. Some clients react to this by subsequently delaying loss recovery, making TCP perform very poorly. Reported by: chengc_netapp.com Reviewed by: rrs, jtl MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D24237
|
#
285385ba |
|
09-Sep-2020 |
Randall Stewart <rrs@FreeBSD.org> |
So it turns out that syzkaller hit another crash. It has to do with switching stacks with a SENT_FIN outstanding. Both rack and bbr will only send a FIN if all data is ack'd so this must be enforced. Also if the previous stack sent the FIN we need to make sure in rack that when we manufacture the "unknown" sends that we include the proper HAS_FIN bits. Note for BBR we take a simpler approach and just refuse to switch. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D26269
|
#
67d224ef |
|
04-Sep-2020 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
bbr: remove unused static function bbr_log_type_hrdwtso() is a file local static unused function. Remove it to avoid warnings on kernel compiles. Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D26331
|
#
662c1305 |
|
01-Sep-2020 |
Mateusz Guzik <mjg@FreeBSD.org> |
net: clean up empty lines in .c and .h files
|
#
b9978183 |
|
19-Aug-2020 |
Andrew Gallatin <gallatin@FreeBSD.org> |
TCP: remove special treatment for hardware (ifnet) TLS Remove most special treatment for ifnet TLS in the TCP stack, except for code to avoid mixing handshakes and bulk data. This code made heroic efforts to send down entire TLS records to NICs. It was added to improve the PCIe bus efficiency of older TLS offload NICs which did not keep state per-session, and so would need to re-DMA the first part(s) of a TLS record if a TLS record was sent in multiple TCP packets or TSOs. Newer TLS offload NICs do not need this feature. At Netflix, we've run extensive QoE tests which show that this feature reduces client quality metrics, presumably because the effort to send TLS records atomically causes the server to both wait too long to send data (leading to buffers running dry), and to send too much data at once (leading to packet loss). Reviewed by: hselasky, jhb, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26103
|
#
e54b7cd0 |
|
01-Jul-2020 |
Michael Tuexen <tuexen@FreeBSD.org> |
Fix the cleanup handling in a error path for TCP BBR. Reported by: syzbot+df7899c55c4cc52f5447@syzkaller.appspotmail.com Reviewed by: rscheff Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D25486
|
#
95ef69c6 |
|
16-Jun-2020 |
Randall Stewart <rrs@FreeBSD.org> |
iSo in doing final checks on OCA firmware with all the latest tweaks the dup-ack checking packet drill script was failing with a number of unexpected acks. So it turns out if you have the default recvwin set up to 1Meg (like OCA's do) and you have no window scaling (like the dupack checking code) then we have another case where we are always trying to update the rwnd and sending an ack when we should not. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D25298
|
#
f092a3c7 |
|
12-Jun-2020 |
Randall Stewart <rrs@FreeBSD.org> |
So it turns out with the right window scaling you can get the code in all stacks to always want to do a window update, even when no data can be sent. Now in cases where you are not pacing thats probably ok, you just send an extra window update or two. However with bbr (and rack if its paced) every time the pacer goes off its going to send a "window update". Also in testing bbr I have found that if we are not responding to data right away we end up staying in startup but incorrectly holding a pacing gain of 192 (a loss). This is because the idle window code does not restict itself to only work with PROBE_BW. In all other states you dont want it doing a PROBE_BW state change. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D25247
|
#
e854dd38 |
|
08-Jun-2020 |
Randall Stewart <rrs@FreeBSD.org> |
An important statistic in determining if a server process (or client) is being delayed is to know the time to first byte in and time to first byte out. Currently we have no way to know these all we have is t_starttime. That (t_starttime) tells us what time the 3 way handshake completed. We don't know when the first request came in or how quickly we responded. Nor from a client perspective do we know how long from when we sent out the first byte before the server responded. This small change adds the ability to track the TTFB's. This will show up in BB logging which then can be pulled for later analysis. Note that currently the tracking is via the ticks variable of all three variables. This provides a very rough estimate (hz=1000 its 1ms). A follow-on set of work will be to change all three of these values into something with a much finer resolution (either microseconds or nanoseconds), though we may want to make the resolution configurable so that on lower powered machines we could still use the much cheaper ticks variable. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D24902
|
#
f1ea4e41 |
|
03-Jun-2020 |
Randall Stewart <rrs@FreeBSD.org> |
This fixes a couple of skyzaller crashes. Most of them have to do with TFO. Even the default stack had one of the issues: 1) We need to make sure for rack that we don't advance snd_nxt beyond iss when we are not doing fast open. We otherwise can get a bunch of SYN's sent out incorrectly with the seq number advancing. 2) When we complete the 3-way handshake we should not ever append to reassembly if the tlen is 0, if TFO is enabled prior to this fix we could still call the reasemmbly. Note this effects all three stacks. 3) Rack like its cousin BBR should track if a SYN is on a send map entry. 4) Both bbr and rack need to only consider len incremented on a SYN if the starting seq is iss, otherwise we don't increment len which may mean we return without adding a sendmap entry. This work was done in collaberation with Michael Tuexen, thanks for all the testing! Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D25000
|
#
77c68315 |
|
23-May-2020 |
Emmanuel Vadot <manu@FreeBSD.org> |
bbr: Use arc4random_uniform from libkern. This unbreak LINT build Reported by: jenkins, melifaro
|
#
8e051165 |
|
21-May-2020 |
Richard Scheffenegger <rscheff@FreeBSD.org> |
Retain only mutually supported TCP options after simultaneous SYN When receiving a parallel SYN in SYN-SENT state, remove all the options only we supported locally before sending the SYN,ACK. This addresses a consistency issue on parallel opens. Also, on such a parallel open, the stack could be coaxed into running with timestamps enabled, even if administratively disabled. Reviewed by: tuexen (mentor) Approved by: tuexen (mentor) MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D23371
|
#
777b88d6 |
|
15-May-2020 |
Randall Stewart <rrs@FreeBSD.org> |
This fixes several skyzaller issues found with the help of Michael Tuexen. There was some accounting errors with TCPFO for bbr and also for both rack and bbr there was a FO case where we should be jumping to the just_return_nolock label to exit instead of returning 0. This of course caused no timer to be running and thus the stuck sessions. Reported by: Michael Tuexen and Skyzaller Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D24852
|
#
b1ddcbc6 |
|
07-May-2020 |
Randall Stewart <rrs@FreeBSD.org> |
When in the SYN-SENT state bbr and rack will not properly send an ACK but instead start the D-ACK timer. This causes so_reuseport_lb_test to fail since it slows down how quickly the program runs until the timeout occurs and fails the test Sponsored by: Netflix inc. Differential Revision: https://reviews.freebsd.org/D24747
|
#
8717b8f1 |
|
07-May-2020 |
Randall Stewart <rrs@FreeBSD.org> |
NF has an internal option that changes the tcp_mcopy_m routine slightly (has a few extra arguments). Recently that changed to only have one arg extra so that two ifdefs around the call are no longer needed. Lets take out the extra ifdef and arg. Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D24736
|
#
570045a0 |
|
04-May-2020 |
Randall Stewart <rrs@FreeBSD.org> |
This fixes two issues found by ankitraheja09@gmail.com 1) When BBR retransmits the syn it was messing up the snd_max 2) When we need to send a RST we might not send it when we should Reported by: ankitraheja09@gmail.com Sponsored by: Netflix.com Differential Revision: https://reviews.freebsd.org/D24693
|
#
7985fd7e |
|
04-May-2020 |
Michael Tuexen <tuexen@FreeBSD.org> |
Enter the net epoch before calling the output routine in TCP BBR. This was only triggered when setting the IPPROTO_TCP level socket option TCP_DELACK. This issue was found by runnning an instance of SYZKALLER. Reviewed by: rrs Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D24690
|
#
963fb2ad |
|
04-May-2020 |
Randall Stewart <rrs@FreeBSD.org> |
This commit brings things into sync with the advancements that have been made in rack and adds a few fixes in BBR. This also removes any possibility of incorrectly doing OOB data the stacks do not support it. Should fix the skyzaller crashes seen in the past. Still to fix is the BBR issue just reported this weekend with the SYN and on sending a RST. Note that this version of rack can now do pacing as well. Sponsored by:Netflix Inc Differential Revision:https://reviews.freebsd.org/D24576
|
#
9028b6e0 |
|
29-Apr-2020 |
Richard Scheffenegger <rscheff@FreeBSD.org> |
Prevent premature shrinking of the scaled receive window which can cause a TCP client to use invalid or stale TCP sequence numbers for ACK packets. Packets with old sequence numbers are ignored and not used to update the send window size. This might cause the TCP session to hang indefinitely under some circumstances. Reported by: Cui Cheng Reviewed by: tuexen (mentor), rgrimes (mentor) Approved by: tuexen (mentor), rgrimes (mentor) MFC after: 3 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D24515
|
#
b2ade6b1 |
|
29-Apr-2020 |
Richard Scheffenegger <rscheff@FreeBSD.org> |
Correctly set up the initial TCP congestion window in all cases, by not including the SYN bit sequence space in cwnd related calculations. Snd_und is adjusted explicitly in all cases, outside the cwnd update, instead. This fixes an off-by-one conformance issue with regular TCP sessions not using Appropriate Byte Counting (RFC3465), sending one more packet during the initial window than expected. PR: 235256 Reviewed by: tuexen (mentor), rgrimes (mentor) Approved by: tuexen (mentor), rgrimes (mentor) MFC after: 3 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D19000
|
#
ac99fd86 |
|
25-Apr-2020 |
Alexander V. Chernikov <melifaro@FreeBSD.org> |
Fix LINT build broken by r360292.
|
#
983066f0 |
|
25-Apr-2020 |
Alexander V. Chernikov <melifaro@FreeBSD.org> |
Convert route caching to nexthop caching. This change is build on top of nexthop objects introduced in r359823. Nexthops are separate datastructures, containing all necessary information to perform packet forwarding such as gateway interface and mtu. Nexthops are shared among the routes, providing more pre-computed cache-efficient data while requiring less memory. Splitting the LPM code and the attached data solves multiple long-standing problems in the routing layer, drastically reduces the coupling with outher parts of the stack and allows to transparently introduce faster lookup algorithms. Route caching was (re)introduced to minimise (slow) routing lookups, allowing for notably better performance for large TCP senders. Caching works by acquiring rtentry reference, which is protected by per-rtentry mutex. If the routing table is changed (checked by comparing the rtable generation id) or link goes down, cache record gets withdrawn. Nexthops have the same reference counting interface, backed by refcount(9). This change merely replaces rtentry with the actual forwarding nextop as a cached object, which is mostly mechanical. Other moving parts like cache cleanup on rtable change remains the same. Differential Revision: https://reviews.freebsd.org/D24340
|
#
bb410f9f |
|
21-Apr-2020 |
Richard Scheffenegger <rscheff@FreeBSD.org> |
revert rS360143 - Correctly set up initial cwnd due to syzkaller panics found Reported by: tuexen Approved by: tuexen (mentor) Sponsored by: NetApp, Inc.
|
#
73b76966 |
|
21-Apr-2020 |
Richard Scheffenegger <rscheff@FreeBSD.org> |
Correctly set up the initial TCP congestion window in all cases, by adjust snd_una right after the connection initialization, to include the one byte in sequence space occupied by the SYN bit. This does not change the regular ACK processing, while making the BYTES_THIS_ACK macro to work properly. PR: 235256 Reviewed by: tuexen (mentor), rgrimes (mentor) Approved by: tuexen (mentor), rgrimes (mentor) MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D19000
|
#
413c3db1 |
|
31-Mar-2020 |
Michael Tuexen <tuexen@FreeBSD.org> |
Allow the TCP backhole detection to be disabled at all, enabled only for IPv4, enabled only for IPv6, and enabled for IPv4 and IPv6. The current blackhole detection might classify a temporary outage as an MTU issue and reduces permanently the MSS. Since the consequences of such a reduction due to a misclassification are much more drastically for IPv4 than for IPv6, allow the administrator to enable it for IPv6 only. Reviewed by: bcr@ (man page), Richard Scheffenegger Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D24219
|
#
7ca6e296 |
|
12-Mar-2020 |
Michael Tuexen <tuexen@FreeBSD.org> |
Use KMOD_TCPSTAT_INC instead of TCPSTAT_INC for RACK and BBR, since these are kernel modules. Also add a KMOD_TCPSTAT_ADD and use that instead of TCPSTAT_ADD. Reviewed by: jtl@, rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D23904
|
#
7029da5c |
|
26-Feb-2020 |
Pawel Biernacki <kaktus@FreeBSD.org> |
Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718
|
#
3fba40d9 |
|
11-Feb-2020 |
Randall Stewart <rrs@FreeBSD.org> |
Remove all trailing white space from the BBR/Rack fold. Bits left around by emacs (thanks emacs).
|
#
5aa0576b |
|
07-Feb-2020 |
Ed Maste <emaste@FreeBSD.org> |
Miscellaneous typo fixes Submitted by: Gordon Bergling <gbergling_gmail.com> Differential Revision: https://reviews.freebsd.org/D23453
|
#
109eb549 |
|
21-Jan-2020 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Make tcp_output() require network epoch. Enter the epoch before calling into tcp_output() from those functions, that didn't do that before. This eliminates a bunch of epoch recursions in TCP.
|
#
b9555453 |
|
21-Jan-2020 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Make ip6_output() and ip_output() require network epoch. All callers that before may called into these functions without network epoch now must enter it.
|
#
334fc582 |
|
08-Jan-2020 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
vnet: virtualise more network stack sysctls. Virtualise tcp_always_keepalive, TCP and UDP log_in_vain. All three are set in the netoptions startup script, which we would love to run for VNETs as well [1]. While virtualising the log_in_vain sysctls seems pointles at first for as long as the kernel message buffer is not virtualised, it at least allows an administrator to debug the base system or an individual jail if needed without turning the logging on for all jails running on a system. PR: 243193 [1] MFC after: 2 weeks
|
#
1cf55767 |
|
17-Dec-2019 |
Randall Stewart <rrs@FreeBSD.org> |
This commit is a bit of a re-arrange of deck chairs. It gets both rack and bbr ready for the completion of the STATs framework in FreeBSD. For now if you don't have both NF_stats and stats on it disables them. As soon as the rest of the stats framework lands we can remove that restriction and then just uses stats when defined. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D22479
|
#
d40c0d47 |
|
07-Nov-2019 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Now that all of the tcp_input() and all its branches are executed in the network epoch, we can greatly simplify synchronization. Remove all unneccesary epoch enters hidden under INP_INFO_RLOCK macro. Remove some unneccesary assertions and convert necessary ones into the NET_EPOCH_ASSERT macro.
|
#
9992c365 |
|
23-Oct-2019 |
Randall Stewart <rrs@FreeBSD.org> |
Fix a small bug in bbr when running under a VM. Basically what happens is we are more delayed in the pacer calling in so we remove the stack from the pacer and recalculate how much time is left after all data has been acknowledged. However the comparision was backwards so we end up with a negative value in the last_pacing_delay time which causes us to add in a huge value to the next pacing time thus stalling the connection. Reported by: vm2.finance@gmail.com
|
#
12a43d0d |
|
29-Sep-2019 |
Michael Tuexen <tuexen@FreeBSD.org> |
RFC 7112 requires a host to put the complete IP header chain including the TCP header in the first IP packet. Enforce this in tcp_output(). In addition make sure that at least one byte payload fits in the TCP segement to allow making progress. Without this check, a kernel with INVARIANTS will panic. This issue was found by running an instance of syzkaller. Reviewed by: jtl@ MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21665
|
#
ac7bd23a |
|
24-Sep-2019 |
Randall Stewart <rrs@FreeBSD.org> |
lets put (void) in a couple of functions to keep older platforms that are stuck with gcc happy (ppc). The changes are needed in both bbr and rack. Obtained from: Michael Tuexen (mtuexen@)
|
#
da99b33b |
|
24-Sep-2019 |
Randall Stewart <rrs@FreeBSD.org> |
don't call in_ratelmit detach when RATELIMIT is not compiled in the kernel.
|
#
35c7bb34 |
|
24-Sep-2019 |
Randall Stewart <rrs@FreeBSD.org> |
This commit adds BBR (Bottleneck Bandwidth and RTT) congestion control. This is a completely separate TCP stack (tcp_bbr.ko) that will be built only if you add the make options WITH_EXTRA_TCP_STACKS=1 and also include the option TCPHPTS. You can also include the RATELIMIT option if you have a NIC interface that supports hardware pacing, BBR understands how to use such a feature. Note that this commit also adds in a general purpose time-filter which allows you to have a min-filter or max-filter. A filter allows you to have a low (or high) value for some period of time and degrade slowly to another value has time passes. You can find out the details of BBR by looking at the original paper at: https://queue.acm.org/detail.cfm?id=3022184 or consult many other web resources you can find on the web referenced by "BBR congestion control". It should be noted that BBRv1 (which this is) does tend to unfairness in cases of small buffered paths, and it will usually get less bandwidth in the case of large BDP paths(when competing with new-reno or cubic flows). BBR is still an active research area and we do plan on implementing V2 of BBR to see if it is an improvement over V1. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D21582
|