Cross Reference: /freebsd-10.3-release/sys/netinet/tcp

History log of /freebsd-10.3-release/sys/netinet/tcp_reass.c
Revision	Date	Author	Comments (<<< Hide modified files) (Show modified files >>>)
# 296373	04-Mar-2016	marius	- Copy stable/10@296371 to releng/10.3 in preparation for 10.3-RC1 builds. - Update newvers.sh to reflect RC1. - Update __FreeBSD_version to reflect 10.3. - Update default pkg(8) configuration to use the quarterly branch. Approved by: re (implicit) /freebsd-10.3-release/etc/pkg/FreeBSD.conf /freebsd-10.3-release/release/pkg_repos/release-dvd.conf /freebsd-10.3-release/sys/conf/newvers.sh /freebsd-10.3-release/sys/gnu/dts /freebsd-10.3-release/sys/sys/param.h
# 285976	28-Jul-2015	delphij	Fix patch(1) shell injection vulnerability. [SA-15:14] Fix resource exhaustion in TCP reassembly. [SA-15:15] Fix OpenSSH multiple vulnerabilities. [SA-15:16]
# 273736	27-Oct-2014	hselasky	MFC r263710, r273377, r273378, r273423 and r273455: - De-vnet hash sizes and hash masks. - Fix multiple issues related to arguments passed to SYSCTL macros. Sponsored by: Mellanox Technologies
# 265122	30-Apr-2014	delphij	Fix devfs rules not applied by default for jails. Fix OpenSSL use-after-free vulnerability. Fix TCP reassembly vulnerability. Security: FreeBSD-SA-14:07.devfs Security: CVE-2014-3001 Security: FreeBSD-SA-14:08.tcp Security: CVE-2014-3000 Security: FreeBSD-SA-14:09.openssl Security: CVE-2010-5298
# 256281	10-Oct-2013	gjb	Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation
# 246208	01-Feb-2013	andre	uma_zone_set_max() directly returns the rounded effective zone limit. Use the return value directly instead of doing a second uma_zone_set_max() step. MFC after: 1 week
# 244680	25-Dec-2012	glebius	Fix sysctl_handle_int() usage. Either arg1 or arg2 should be supplied, and arg2 doesn't pass size of arg1.
# 242253	28-Oct-2012	andre	Simplify implementation of net.inet.tcp.reass.maxsegments and net.inet.tcp.reass.cursegments. MFC after: 2 weeks
# 228016	27-Nov-2011	lstewart	Plug a TCP reassembly UMA zone leak introduced in r226113 by only using the backup stack queue entry when the zone is exhausted, otherwise we leak a zone allocation each time we plug a hole in the reassembly queue. Reported by: many on freebsd-stable@ (thread: "TCP Reassembly Issues") Tested by: many on freebsd-stable@ (thread: "TCP Reassembly Issues") Reviewed by: bz (very brief sanity check) MFC after: 3 days
# 227309	07-Nov-2011	ed	Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs. The SYSCTL_NODE macro defines a list that stores all child-elements of that node. If there's no SYSCTL_DECL macro anywhere else, there's no reason why it shouldn't be static.
# 226113	07-Oct-2011	andre	Prevent TCP sessions from stalling indefinitely in reassembly when reaching the zone limit of reassembly queue entries. When the zone limit was reached not even the missing segment that would complete the sequence space could be processed preventing the TCP session forever from making any further progress. Solve this deadlock by using a temporary on-stack queue entry for the missing segment followed by an immediate dequeue again by delivering the contiguous sequence space to the socket. Add logging under net.inet.tcp.log_debug for reassembly queue issues. Reviewed by: lsteward (previous version) Tested by: Steven Hartland <killing-at-multiplay.co.uk> MFC after: 3 days
# 217554	18-Jan-2011	mdf	Specify a CTLTYPE_FOO so that a future sysctl(8) change does not need to rely on the format string. For SYSCTL_PROC instances that I noticed a discrepancy between the CTLTYPE and the format specifier, fix the CTLTYPE.
# 217126	07-Jan-2011	jhb	Trim extra spaces before tabs.
# 215701	22-Nov-2010	dim	After some off-list discussion, revert a number of changes to the DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various people working on the affected files. A better long-term solution is still being considered. This reversal may give some modules empty set_pcpu or set_vnet sections, but these are harmless. Changes reverted: ------------------------------------------------------------------------ r215318 \| dim \| 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) \| 4 lines Instead of unconditionally emitting .globl's for the __start_set_xxx and __stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu sections are actually defined. ------------------------------------------------------------------------ r215317 \| dim \| 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) \| 3 lines Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout the tree. ------------------------------------------------------------------------ r215316 \| dim \| 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) \| 2 lines Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
# 215434	17-Nov-2010	gnn	Add new, per connection, statistics for TCP, including: Retransmitted Packets Zero Window Advertisements Out of Order Receives These statistics are available via the -T argument to netstat(1). MFC after: 2 weeks
# 215317	14-Nov-2010	dim	Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout the tree.
# 213913	16-Oct-2010	lstewart	Retire the system-wide, per-reassembly queue segment limit. The mechanism is far too coarse grained to be useful and the default value significantly degrades TCP performance on moderate to high bandwidth-delay product paths with non-zero loss (e.g. 5+Mbps connections across the public Internet often suffer). Replace the outgoing mechanism with an individual per-queue limit based on the number of MSS segments that fit into the socket's receive buffer. This should strike a good balance between performance and the potential for resource exhaustion when FreeBSD is acting as a TCP receiver. With socket buffer autotuning (which is enabled by default), the reassembly queue tracks the socket buffer and benefits too. As the XXX comment suggests, my testing uncovered some unexpected behaviour which requires further investigation. By using so->so_rcv.sb_hiwat instead of sbspace(&so->so_rcv), we allow more segments to be held across both the socket receive buffer and reassembly queue than we probably should. The tradeoff is better performance in at least one common scenario, versus a devious sender's ability to consume more resources on a FreeBSD receiver. Sponsored by: FreeBSD Foundation Reviewed by: andre, gnn, rpaulo MFC after: 2 weeks
# 213912	16-Oct-2010	lstewart	- Switch the "net.inet.tcp.reass.cursegments" and "net.inet.tcp.reass.maxsegments" sysctl variables to be based on UMA zone stats. The value returned by the cursegments sysctl is approximate owing to the way in which uma_zone_get_cur is implemented. - Discontinue use of V_tcp_reass_qsize as a global reassembly segment count variable in the reassembly implementation. The variable was used without proper synchronisation and was duplicating accounting done by UMA already. The lack of synchronisation was particularly problematic on SMP systems terminating many TCP sessions, resulting in poor TCP performance for connections with non-zero packet loss. Sponsored by: FreeBSD Foundation Reviewed by: andre, gnn, rpaulo (as part of a larger patch) MFC after: 2 weeks
# 213158	25-Sep-2010	lstewart	Internalise reassembly queue related functionality and variables which should not be used outside of the reassembly queue implementation. Provide a new function to flush all segments from a reassembly queue and call it from the appropriate places instead of manipulating the queue directly. Sponsored by: FreeBSD Foundation Reviewed by: andre, gnn, rpaulo MFC after: 2 weeks
# 207369	29-Apr-2010	bz	MFP4: @176978-176982, 176984, 176990-176994, 177441 "Whitspace" churn after the VIMAGE/VNET whirls. Remove the need for some "init" functions within the network stack, like pim6_init(), icmp_init() or significantly shorten others like ip6_init() and nd6_init(), using static initialization again where possible and formerly missed. Move (most) variables back to the place they used to be before the container structs and VIMAGE_GLOABLS (before r185088) and try to reduce the diff to stable/7 and earlier as good as possible, to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9. This also removes some header file pollution for putatively static global variables. Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are no longer needed. Reviewed by: jhb Discussed with: rwatson Sponsored by: The FreeBSD Foundation Sponsored by: CK Software GmbH MFC after: 6 days
# 204838	07-Mar-2010	bz	Destroy TCP UMA zones (empty or not) upon network stack teardown to not leak them, otherwise making UMA/vmstat unhappy with every stoped vnet. We will still leak pages (especially for zones marked NOFREE). Reshuffle cleanup order in tcp_destroy() to get rid of what we can easily free first. Sponsored by: ISPsystem Reviewed by: rwatson MFC after: 5 days
# 196019	01-Aug-2009	rwatson	Merge the remainder of kern_vimage.c and vimage.h into vnet.c and vnet.h, we now use jails (rather than vimages) as the abstraction for virtualization management, and what remained was specific to virtual network stacks. Minor cleanups are done in the process, and comments updated to reflect these changes. Reviewed by: bz Approved by: re (vimage blanket)
# 195727	16-Jul-2009	rwatson	Remove unused VNET_SET() and related macros; only VNET_GET() is ever actually used. Rename VNET_GET() to VNET() to shorten variable references. Discussed with: bz, julian Reviewed by: bz Approved by: re (kensmith, kib)
# 195699	14-Jul-2009	rwatson	Build on Jeff Roberson's linker-set based dynamic per-CPU allocator (DPCPU), as suggested by Peter Wemm, and implement a new per-virtual network stack memory allocator. Modify vnet to use the allocator instead of monolithic global container structures (vinet, ...). This change solves many binary compatibility problems associated with VIMAGE, and restores ELF symbols for virtualized global variables. Each virtualized global variable exists as a "reference copy", and also once per virtual network stack. Virtualized global variables are tagged at compile-time, placing the in a special linker set, which is loaded into a contiguous region of kernel memory. Virtualized global variables in the base kernel are linked as normal, but those in modules are copied and relocated to a reserved portion of the kernel's vnet region with the help of a the kernel linker. Virtualized global variables exist in per-vnet memory set up when the network stack instance is created, and are initialized statically from the reference copy. Run-time access occurs via an accessor macro, which converts from the current vnet and requested symbol to a per-vnet address. When "options VIMAGE" is not compiled into the kernel, normal global ELF symbols will be used instead and indirection is avoided. This change restores static initialization for network stack global variables, restores support for non-global symbols and types, eliminates the need for many subsystem constructors, eliminates large per-subsystem structures that caused many binary compatibility issues both for monitoring applications (netstat) and kernel modules, removes the per-function INIT_VNET_*() macros throughout the stack, eliminates the need for vnet_symmap ksym(2) munging, and eliminates duplicate definitions of virtualized globals under VIMAGE_GLOBALS. Bump __FreeBSD_version and update UPDATING. Portions submitted by: bz Reviewed by: bz, zec Discussed with: gnn, jamie, jeff, jhb, julian, sam Suggested by: peter Approved by: re (kensmith)
# 192761	25-May-2009	rwatson	Remove comment about moving tcp_reass() to its own file named tcp_reass.c, that happened a while ago. MFC after: 3 days
# 190948	11-Apr-2009	rwatson	Update stats in struct tcpstat using two new macros, TCPSTAT_ADD() and TCPSTAT_INC(), rather than directly manipulating the fields across the kernel. This will make it easier to change the implementation of these statistics, such as using per-CPU versions of the data structures. MFC after: 3 days
# 190787	06-Apr-2009	zec	First pass at separating per-vnet initializer functions from existing functions for initializing global state. At this stage, the new per-vnet initializer functions are directly called from the existing global initialization code, which should in most cases result in compiler inlining those new functions, hence yielding a near-zero functional change. Modify the existing initializer functions which are invoked via protosw, like ip_init() et. al., to allow them to be invoked multiple times, i.e. per each vnet. Global state, if any, is initialized only if such functions are called within the context of vnet0, which will be determined via the IS_DEFAULT_VNET(curvnet) check (currently always true). While here, V_irtualize a few remaining global UMA zones used by net/netinet/netipsec networking code. While it is not yet clear to me or anybody else whether this is the right thing to do, at this stage this makes the code more readable, and makes it easier to track uncollected UMA-zone-backed objects on vnet removal. In the long run, it's quite possible that some form of shared use of UMA zone pools among multiple vnets should be considered. Bump __FreeBSD_version due to changes in layout of structs vnet_ipfw, vnet_inet and vnet_net. Approved by: julian (mentor)
# 185571	02-Dec-2008	bz	Rather than using hidden includes (with cicular dependencies), directly include only the header files needed. This reduces the unneeded spamming of various headers into lots of files. For now, this leaves us with very few modules including vnet.h and thus needing to depend on opt_route.h. Reviewed by: brooks, gnn, des, zec, imp Sponsored by: The FreeBSD Foundation
# 185088	19-Nov-2008	zec	Change the initialization methodology for global variables scheduled for virtualization. Instead of initializing the affected global variables at instatiation, assign initial values to them in initializer functions. As a rule, initialization at instatiation for such variables should never be introduced again from now on. Furthermore, enclose all instantiations of such global variables in #ifdef VIMAGE_GLOBALS blocks. Essentialy, this change should have zero functional impact. In the next phase of merging network stack virtualization infrastructure from p4/vimage branch, the new initialization methology will allow us to switch between using global variables and their counterparts residing in virtualization containers with minimum code churn, and in the long run allow us to intialize multiple instances of such container structures. Discussed at: devsummit Strassburg Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
# 183550	02-Oct-2008	zec	Step 1.5 of importing the network stack virtualization infrastructure from the vimage project, as per plan established at devsummit 08/08: http://wiki.freebsd.org/Image/Notes200808DevSummit Introduce INIT_VNET_() initializer macros, VNET_FOREACH() iterator macros, and CURVNET_SET() context setting macros, all currently resolving to NOPs. Prepare for virtualization of selected SYSCTL objects by introducing a family of SYSCTL_V_() macros, currently resolving to their global counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT(). Move selected #defines from sys/sys/vimage.h to newly introduced header files specific to virtualized subsystems (sys/net/vnet.h, sys/netinet/vinet.h etc.). All the changes are verified to have zero functional impact at this point in time by doing MD5 comparision between pre- and post-change object files(). () netipsec/keysock.c did not validate depending on compile time options. Implemented by: julian, bz, brooks, zec Reviewed by: julian, bz, brooks, kris, rwatson, ... Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
# 181803	17-Aug-2008	bz	Commit step 1 of the vimage project, (network stack) virtualization work done by Marko Zec (zec@). This is the first in a series of commits over the course of the next few weeks. Mark all uses of global variables to be virtualized with a V_ prefix. Use macros to map them back to their global names for now, so this is a NOP change only. We hope to have caught at least 85-90% of what is needed so we do not invalidate a lot of outstanding patches again. Obtained from: //depot/projects/vimage-commit2/... Reviewed by: brooks, des, ed, mav, julian, jamie, kris, rwatson, zec, ... (various people I forgot, different versions) md5 (with a bit of help) Sponsored by: NLnet Foundation, The FreeBSD Foundation X-MFC after: never V_Commit_Message_Reviewed_By: more people than the patch
# 178285	17-Apr-2008	rwatson	Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to explicitly select write locking for all use of the inpcb mutex. Update some pcbinfo lock assertions to assert locked rather than write-locked, although in practice almost all uses of the pcbinfo rwlock main exclusive, and all instances of inpcb lock acquisition are exclusive. This change should introduce (ideally) little functional change. However, it lays the groundwork for significantly increased parallelism in the TCP/IP code. MFC after: 3 months Tested by: kris (superset of committered patch)
# 172467	07-Oct-2007	silby	Add FBSDID to all files in netinet so that people can more easily include file version information in bug reports. Approved by: re (kensmith)
# 169541	13-May-2007	andre	Complete the (mechanical) move of the TCP reassembly and timewait functions from their origininal place to their own files. TCP Reassembly from tcp_input.c -> tcp_reass.c TCP Timewait from tcp_subr.c -> tcp_timewait.c
# 169481	11-May-2007	andre	Drop everything that doesn't belong into this new file. It's neither functional nor connected to the build yet.
# 169454	10-May-2007	rwatson	Move universally to ANSI C function declarations, with relatively consistent style(9)-ish layout.
# 169417	09-May-2007	maxim	o Fix style(9) bugs introduced in the last commit. Pointed out by: bde
# 169405	09-May-2007	maxim	o Unbreak "options TCPDEBUG" && "nooptions INET6" kernel build. PR: kern/112517 Submitted by: vd
# 169317	06-May-2007	andre	Use existing TF_SACK_PERMIT flag in struct tcpcb t_flags field instead of a decdicated sack_enable int for this bool. Change all users accordingly.
# 169316	06-May-2007	andre	o Remove redundant tcp reassembly check in header prediction code o Rearrange code to make intent in TCPS_SYN_SENT case more clear o Assorted style cleanup o Comment clarification for tcp_dropwithreset()
# 169315	06-May-2007	andre	Reorder the TCP header prediction test to check for the most volatile values first to spend less time on a fallback to normal processing.
# 169314	06-May-2007	andre	Remove the defunct remains of the TCPS_TIME_WAIT cases from tcp_do_segment and change it to a void function. We use a compressed structure for TCPS_TIME_WAIT to save memory. Any late late segments arriving for such a connection is handled directly in the TW code.
# 169268	04-May-2007	rwatson	Tweak comment at end of tcp_input() when calling into tcp_do_segment(): the pcbinfo lock will be released as well, not just the pcb lock.
# 168986	23-Apr-2007	andre	o Fix INP lock leak in the minttl case o Remove indirection in the decision of unlocking inp o Further annotation of locking in tcp_input()
# 168906	20-Apr-2007	andre	o Remove unncessary TOF_SIGLEN flag from struct tcpopt o Correctly set to->to_signature in tcp_dooptions() o Update comments
# 168905	20-Apr-2007	andre	Add more KASSERT's.
# 168903	20-Apr-2007	andre	Remove bogus check for accept queue length and associated failure handling from the incoming SYN handling section of tcp_input(). Enforcement of the accept queue limits is done by sonewconn() after the 3WHS is completed. It is not necessary to have an earlier check before a connection request enters the SYN cache awaiting the full handshake. It rather limits the effectiveness of the syncache by preventing legit and illegit connections from entering it and having them shaken out before we hit the real limit which may have vanished by then. Change return value of syncache_add() to void. No status communication is required.
# 168902	20-Apr-2007	andre	Simplifly syncache_expand() and clarify its semantics. Zero is returned when the ACK is invalid and doesn't belong to any registered connection, either in syncache or through SYN cookies. True but a NULL struct socket is returned when the 3WHS completed but the socket could not be created due to insufficient resources or limits reached. For both cases an RST is sent back in tcp_input(). A logic error leading to a panic is fixed where syncache_expand() would free the mbuf on socket allocation failure but tcp_input() later supplies it to tcp_dropwithreset() to issue a RST to the peer. Reported by: kris (the panic)
# 168769	15-Apr-2007	rwatson	Remove unused variable tcbinfo_mtx.
# 168615	11-Apr-2007	andre	Change the TCP timer system from using the callout system five times directly to a merged model where only one callout, the next to fire, is registered. Instead of callout_reset(9) and callout_stop(9) the new function tcp_timer_activate() is used which then internally manages the callout. The single new callout is a mutex callout on inpcb simplifying the locking a bit. tcp_timer() is the called function which handles all race conditions in one place and then dispatches the individual timer functions. Reviewed by: rwatson (earlier version)
# 168369	04-Apr-2007	andre	Add INP_INFO_UNLOCK_ASSERT() and use it in tcp_input(). Also add some further INP_INFO_WLOCK_ASSERT() while there.
# 168368	04-Apr-2007	andre	Move last tcpcb initialization for the inbound connection case from tcp_input() to syncache_socket() where it belongs and the majority of it already happens. The "tp->snd_up = tp->snd_una" is removed as it is done with the tcp_sendseqinit() macro a few lines earlier.
# 168364	04-Apr-2007	andre	Retire unused TCP_SACK_DEBUG.
# 168363	04-Apr-2007	andre	In tcp_dooptions() skip over SACK options if it is a SYN segment.
# 167989	28-Mar-2007	andre	When blackholing do a 'dropunlock' in the new world order to prevent the INP_INFO_LOCK from leaking. Reported by: ache Found by: rwatson
# 167873	24-Mar-2007	maxim	o Use a define for a buffer size. Prodded by: db o Add missed vars for TCPDEBUG in tcp_do_segment(). Prodded by: tinderbox
# 167839	23-Mar-2007	andre	Split tcp_input() into its two functional parts: o tcp_input() now handles TCP segment sanity checks and preparations including the INPCB lookup and syncache. o tcp_do_segment() handles all data and ACK processing and is IPv4/v6 agnostic. Change all KASSERT() messages to ("%s: ", __func__). The changes in this commit are primarily of mechanical nature and no functional changes besides the function split are made. Discussed with: rwatson
# 167834	23-Mar-2007	andre	Tidy up some code to conform better to surroundings and style(9), 0 = NULL and space/tab.
# 167833	23-Mar-2007	andre	Bring SACK option handling in tcp_dooptions() in line with all other options and ajust users accordingly.
# 167785	21-Mar-2007	andre	ANSIfy function declarations and remove register keywords for variables. Consistently apply style to all function declarations.
# 167779	21-Mar-2007	andre	Tidy up IPFIREWALL_FORWARD sections and comments.
# 167778	21-Mar-2007	andre	Update and clarify comments in first section of tcp_input().
# 167777	21-Mar-2007	andre	Tidy up the ACCEPTCONN section of tcp_input(), ajust comments and remove old dead T/TCP code.
# 167775	21-Mar-2007	andre	Tidy up tcp_log_in_vain and blackhole.
# 167774	21-Mar-2007	andre	Make TCP_DROP_SYNFIN a standard part of TCP. Disabled by default it doesn't impede normal operation negatively and is only a few lines of code. It's close relatives blackhole and log_in_vain aren't options either.
# 167772	21-Mar-2007	andre	Remove tcp_minmssoverload DoS detection logic. The problem it tried to protect us from wasn't really there and it only bloats the code. Should the problem surface in the future we can simply resurrect it from cvs history.
# 167721	19-Mar-2007	andre	Match up SYSCTL declaration style.
# 167606	15-Mar-2007	andre	Consolidate insertion of TCP options into a segment from within tcp_output() and syncache_respond() into its own generic function tcp_addoptions(). tcp_addoptions() is alignment agnostic and does optimal packing in all cases. In struct tcpopt rename to_requested_s_scale to just to_wscale. Add a comment with quote from RFC1323: "The Window field in a SYN (i.e., a <SYN> or <SYN,ACK>) segment itself is never scaled." Reviewed by: silby, mohans, julian Sponsored by: TCP/IP Optimization Fundraise 2005
# 167310	07-Mar-2007	qingli	This patch is provided to fix a couple of deployment issues observed in the field. In one situation, one end of the TCP connection sends a back-to-back RST packet, with delayed ack, the last_ack_sent variable has not been update yet. When tcp_insecure_rst is turned off, the code treats the RST as invalid because last_ack_sent instead of rcv_nxt is compared against th_seq. Apparently there is some kind of firewall that sits in between the two ends and that RST packet is the only RST packet received. With short lived HTTP connections, the symptom is a large accumulation of connections over a short period of time . The +/-(1) factor is to take care of implementations out there that generate RST packets with these types of sequence numbers. This behavior has also been observed in live environments. Reviewed by: silby, Mike Karels MFC after: 1 week
# 167120	28-Feb-2007	mohans	In the SYN_SENT case, Initialize the snd_wnd before the call to tcp_mss(). The TCP hostcache logic in tcp_mss() depends on the snd_wnd being initialized.
# 167036	26-Feb-2007	mohans	Reap FIN_WAIT_2 connections marked SOCANTRCVMORE faster. This mitigate potential issues where the peer does not close, potentially leaving thousands of connections in FIN_WAIT_2. This is controlled by a new sysctl fast_finwait2_recycle, which is disabled by default. Reviewed by: gnn, silby.
# 166842	20-Feb-2007	rwatson	Rename two identically named log_in_vain variables: tcp_input.c's static log_in_vain to tcp_log_in_vain, and udp_usrreq's global log_in_vain to udp_log_in_vain. MFC after: 1 week
# 166405	01-Feb-2007	andre	Auto sizing TCP socket buffers. Normally the socket buffers are static (either derived from global defaults or set with setsockopt) and do not adapt to real network conditions. Two things happen: a) your socket buffers are too small and you can't reach the full potential of the network between both hosts; b) your socket buffers are too big and you waste a lot of kernel memory for data just sitting around. With automatic TCP send and receive socket buffers we can start with a small buffer and quickly grow it in parallel with the TCP congestion window to match real network conditions. FreeBSD has a default 32K send socket buffer. This supports a maximal transfer rate of only slightly more than 2Mbit/s on a 100ms RTT trans-continental link. Or at 200ms just above 1Mbit/s. With TCP send buffer auto scaling and the default values below it supports 20Mbit/s at 100ms and 10Mbit/s at 200ms. That's an improvement of factor 10, or 1000%. For the receive side it looks slightly better with a default of 64K buffer size. New sysctls are: net.inet.tcp.sendbuf_auto=1 (enabled) net.inet.tcp.sendbuf_inc=8192 (8K, step size) net.inet.tcp.sendbuf_max=262144 (256K, growth limit) net.inet.tcp.recvbuf_auto=1 (enabled) net.inet.tcp.recvbuf_inc=16384 (16K, step size) net.inet.tcp.recvbuf_max=262144 (256K, growth limit) Tested by: many (on HEAD and RELENG_6) Approved by: re MFC after: 1 month
# 165118	12-Dec-2006	bz	MFp4: 92972, 98913 + one more change In ip6_sprintf no longer use and return one of eight static buffers for printing/logging ipv6 addresses. The caller now has to hand in a sufficiently large buffer as first argument.
# 163606	22-Oct-2006	rwatson	Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead. This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd. Obtained from: TrustedBSD Project Sponsored by: SPARTA
# 162642	26-Sep-2006	jmg	fix calculating to_tsecr... This prevents the rtt calculations from going all wonky...
# 162580	23-Sep-2006	bms	Always set the IP version in the TCP input path, to preserve the header field for possible later IPSEC SPD lookup, even when the kernel is built without 'options INET6'. PR: kern/57760 MFC after: 1 week Submitted by: Joachim Schueth
# 162277	13-Sep-2006	andre	Rewrite of TCP syncookies to remove locking requirements and to enhance functionality: - Remove a rwlock aquisition/release per generated syncookie. Locking is now integrated with the bucket row locking of syncache itself and syncookies no longer add any additional lock overhead. - Syncookie secrets are different for and stored per syncache buck row. Secrets expire after 16 seconds and are reseeded on-demand. - The computational overhead for syncookie generation and verification is one MD5 hash computation as before. - Syncache can be turned off and run with syncookies only by setting the sysctl net.inet.tcp.syncookies_only=1. This implementation extends the orginal idea and first implementation of FreeBSD by using not only the initial sequence number field to store information but also the timestamp field if present. This way we can keep track of the entire state we need to know to recreate the session in its original form. Almost all TCP speakers implement RFC1323 timestamps these days. For those that do not we still have to live with the known shortcomings of the ISN only SYN cookies. The use of the timestamp field causes the timestamps to be randomized if syncookies are enabled. The idea of SYN cookies is to encode and include all necessary information about the connection setup state within the SYN-ACK we send back and thus to get along without keeping any local state until the ACK to the SYN-ACK arrives (if ever). Everything we need to know should be available from the information we encoded in the SYN-ACK. A detailed description of the inner working of the syncookies mechanism is included in the comments in tcp_syncache.c. Reviewed by: silby (slightly earlier version) Sponsored by: TCP/IP Optimization Fundraise 2005
# 162111	07-Sep-2006	ru	Back when we had T/TCP support, we used to apply different timeouts for TCP and T/TCP connections in the TIME_WAIT state, and we had two separate timed wait queues for them. Now that is has gone, the timeout is always 2*MSL again, and there is no reason to keep two queues (the first was unused anyway!). Also, reimplement the remaining queue using a TAILQ (it was technically impossible before, with two queues).
# 162084	06-Sep-2006	andre	First step of TSO (TCP segmentation offload) support in our network stack. o add IFCAP_TSO[46] for drivers to announce this capability for IPv4 and IPv6 o add CSUM_TSO flag to mbuf pkthdr csum_flags field o add tso_segsz field to mbuf pkthdr o enhance ip_output() packet length check to allow for large TSO packets o extend tcp_maxmtu[46]() with a flag pointer to pass interface capabilities o adjust all callers of tcp_maxmtu[46]() accordingly Discussed on: -current, -net Sponsored by: TCP/IP Optimization Fundraise 2005
# 161226	11-Aug-2006	mohans	Fixes an edge case bug in timewait handling where ticks rolling over causing the timewait expiry to be exactly 0 corrupts the timewait queues (and that entry). Reviewed by: silby
# 160024	29-Jun-2006	bz	Use INPLOOKUP_WILDCARD instead of just 1 more consistently. OKed by: rwatson (some weeks ago)
# 159950	26-Jun-2006	andre	Some cleanups and janitorial work to tcp_syncache: o don't assign remote/local host/port information manually between provided struct in_conninfo and struct syncache, bcopy() it instead o rename sc_tsrecent to sc_tsreflect in struct syncache to better capture the purpose of this field o rename sc_request_r_scale to sc_requested_r_scale for ditto reasons o fix IPSEC error case printf's to report correct function name o in syncache_socket() only transpose enhanced tcp options parameters to struct tcpcb when the inpcb doesn't has TF_NOOPT set o in syncache_respond() reorder stack variables o in syncache_respond() remove bogus KASSERT() No functional changes. Sponsored by: TCP/IP Optimization Fundraise 2005
# 159949	26-Jun-2006	andre	Some cleanups and janitorial work to tcp_dooptions(): o redefine the parameter 'is_syn' to 'flags', add TO_SYN flag and adjust its usage accordingly o update the comments to the tcp_dooptions() invocation in tcp_input():after_listen to reflect reality o move the logic checking the echoed timestamp out of tcp_dooptions() to the only place that uses it next to the invocation described in the previous item o adjust parsing of TCPOPT_SACK_PERMITTED to use the same style as the others o add comments in to struct tcpopt.to_flags #defines No functional changes. Sponsored by: TCP/IP Optimization Fundraise 2005
# 159772	19-Jun-2006	dwmalone	When we receive an out-of-window SYN for an "ESTABLISHED" connection, ACK the SYN as required by RFC793, rather than ignoring it. NetBSD have had a similar change since 1999. PR: 93236 Submitted by: Grant Edwards <grante@visi.com> MFC after: 1 month
# 159695	17-Jun-2006	andre	Add locking to TCP syncache and drop the global tcpinfo lock as early as possible for the syncache_add() case. The syncache timer no longer aquires the tcpinfo lock and timeout/retransmit runs can happen in parallel with bucket granularity. On a P4 the additional locks cause a slight degression of 0.7% in tcp connections per second. When IP and TCP input are deserialized and can run in parallel this little overhead can be neglected. The syncookie handling still leaves room for improvement and its random salts may be moved to the syncache bucket head structures to remove the second lock operation currently required for it. However this would be a more involved change from the way syncookies work at the moment. Reviewed by: rwatson Tested by: rwatson, ps (earlier version) Sponsored by: TCP/IP Optimization Fundraise 2005
# 157927	21-Apr-2006	ps	Allow for nmbclusters and maxsockets to be increased via sysctl. An eventhandler is used to update all the various zones that depend on these values.
# 157609	09-Apr-2006	rwatson	Modify tcp_timewait() to accept an inpcb reference, not a tcptw reference. For now, we allow the possibility that the in_ppcb pointer in the inpcb may be NULL if a timewait socket has had its tcptw structure recycled. This allows tcp_timewait() to consistently unlock the inpcb. Reported by: Kazuaki Oda <kaakun at highway dot ne dot jp> MFC after: 3 months
# 157534	05-Apr-2006	rwatson	Don't unlock a timewait structure if the pointer is NULL in tcp_timewait(). This corrects a bug (or lack of fixing of a bug) in tcp_input.c:1.295. Submitted by: Kazuaki Oda <kaakun at highway dot ne dot jp> MFC after: 3 months
# 157474	04-Apr-2006	rwatson	Before dereferencing intotw() when INP_TIMEWAIT, check for inp_ppcb being NULL. We currently do allow this to happen, but may want to remove that possibility in the future. This case can occur when a socket is left open after TCP wraps up, and the timewait state is recycled. This will be cleaned up in the future. Found by: Kazuaki Oda <kaakun at highway dot ne dot jp> MFC after: 3 months
# 157432	03-Apr-2006	rwatson	Change inp_ppcb from caddr_t to void , fix/remove associated related casts. Consistently use intotw() to cast inp_ppcb pointers to struct tcptw pointers. Consistently use intotcpcb() to cast inp_ppcb pointers to struct tcpcb * pointers. Don't assign tp to the results to intotcpcb() during variable declation at the top of functions, as that is before the asserts relating to locking have been performed. Do this later in the function after appropriate assertions have run to allow that operation to be conisdered safe. MFC after: 3 months
# 157376	01-Apr-2006	rwatson	Update TCP for infrastructural changes to the socket/pcb refcount model, pru_abort(), pru_detach(), and in_pcbdetach(): - Universally support and enforce the invariant that so_pcb is never NULL, converting dozens of unnecessary NULL checks into assertions, and eliminating dozens of unnecessary error handling cases in protocol code. - In some cases, eliminate unnecessary pcbinfo locking, as it is no longer required to ensure so_pcb != NULL. For example, the receive code no longer requires the pcbinfo lock, and the send code only requires it if building a new connection on an otherwise unconnected socket triggered via sendto() with an address. This should significnatly reduce tcbinfo lock contention in the receive and send cases. - In order to support the invariant that so_pcb != NULL, it is now necessary for the TCP code to not discard the tcpcb any time a connection is dropped, but instead leave the tcpcb until the socket is shutdown. This case is handled by setting INP_DROPPED, to substitute for using a NULL so_pcb to indicate that the connection has been dropped. This requires the inpcb lock, but not the pcbinfo lock. - Unlike all other protocols in the tree, TCP may need to retain access to the socket after the file descriptor has been closed. Set SS_PROTOREF in tcp_detach() in order to prevent the socket from being freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether or not it needs to free the socket when the connection finally does close. The typical case where this occurs is if close() is called on a TCP socket before all sent data in the send socket buffer has been transmitted or acknowledged. If INP_SOCKREF is found when the connection is dropped, we release the inpcb, tcpcb, and socket instead of flagging INP_DROPPED. - Abort and detach protocol switch methods no longer return failures, nor attempt to free sockets, as the socket layer does this. - Annotate the existence of a long-standing race in the TCP timer code, in which timers are stopped but not drained when the socket is freed, as waiting for drain may lead to deadlocks, or have to occur in a context where waiting is not permitted. This race has been handled by testing to see if the tcpcb pointer in the inpcb is NULL (and vice versa), which is not normally permitted, but may be true of a inpcb and tcpcb have been freed. Add a counter to test how often this race has actually occurred, and a large comment for each instance where we compare potentially freed memory with NULL. This will have to be fixed in the near future, but requires is to further address how to handle the timer shutdown shutdown issue. - Several TCP calls no longer potentially free the passed inpcb/tcpcb, so no longer need to return a pointer to indicate whether the argument passed in is still valid. - Un-macroize debugging and locking setup for various protocol switch methods for TCP, as it lead to more obscurity, and as locking becomes more customized to the methods, offers less benefit. - Assert copyright on tcp_usrreq.c due to significant modifications that have been made as part of this work. These changes significantly modify the memory management and connection logic of our TCP implementation, and are (as such) High Risk Changes, and likely to contain serious bugs. Please report problems to the current@ mailing list ASAP, ideally with simple test cases, and optionally, packet traces. MFC after: 3 months
# 157136	26-Mar-2006	rwatson	Explicitly assert socket pointer is non-NULL in tcp_input() so as to provide better debugging information. Prefer explicit comparison to NULL for tcpcb pointers rather than treating them as booleans. MFC after: 1 month
# 156125	28-Feb-2006	andre	Rework TCP window scaling (RFC1323) to properly scale the send window right from the beginning and partly clean up the differences in handling between SYN_SENT and SYN_RCVD (syncache). Further changes to this code to come. This is a first incremental step to a general overhaul and streamlining of the TCP code. PR: kern/15095 PR: kern/92690 (partly) Reviewed by: qingli (and tested with ANVL) Sponsored by: TCP/IP Optimization Fundraise 2005
# 155961	23-Feb-2006	qingli	This patch fixes the problem where the current TCP code can not handle simultaneous open. Both the bug and the patch were verified using the ANVL test suite. PR: kern/74935 Submitted by: qingli (before I became committer) Reviewed by: andre MFC after: 5 days
# 155819	18-Feb-2006	andre	Remove unneeded includes and provide more accurate description to others. Submitted by: garys PR: kern/86437
# 155767	16-Feb-2006	andre	Have TCP Inflight disable itself if the RTT is below a certain threshold. Inflight doesn't make sense on a LAN as it has trouble figuring out the maximal bandwidth because of the coarse tick granularity. The sysctl net.inet.tcp.inflight.rttthresh specifies the threshold in milliseconds below which inflight will disengage. It defaults to 10ms. Tested by: Joao Barros <joao.barros-at-gmail.com>, Rich Murphey <rich-at-whiteoaklabs.com> Sponsored by: TCP/IP Optimization Fundraise 2005
# 154528	18-Jan-2006	andre	Do not derefence the ip header pointer in the IPv6 case. This fixes a bug in the previous commit. Found by: Coverity Prevent(tm) Coverity ID: CID253 Sponsored by: TCP/IP Optimization Fundraise 2005 MFC after: 3 days
# 154366	14-Jan-2006	gnn	Check the correct TTL in both the IPv6 and IPv4 cases. Submitted by: glebius Reviewed by: gnn, bz Found with: Coverity Prevent(tm)
# 152592	18-Nov-2005	andre	Consolidate all IP Options handling functions into ip_options.[ch] and include ip_options.h into all files making use of IP Options functions. From ip_input.c rev 1.306: ip_dooptions(struct mbuf m, int pass) save_rte(m, option, dst) ip_srcroute(m0) ip_stripoptions(m, mopt) From ip_output.c rev 1.249: ip_insertoptions(m, opt, phlen) ip_optcopy(ip, jp) ip_pcbopts(struct inpcb inp, int optname, struct mbuf *m) No functional changes in this commit. Discussed with: rwatson Sponsored by: TCP/IP Optimization Fundraise 2005
# 151464	19-Oct-2005	rwatson	Convert if (tp->t_state == TCPS_LISTEN) panic() into a KASSERT. MFC after: 2 weeks
# 149404	24-Aug-2005	ps	Remove a KASSERT in the sack path that fails because of a interaction between sack and a bug in the "bad retransmit recovery" logic. This is a workaround, the underlying bug will be fixed later. Submitted by: Mohan Srinivasan, Noritoshi Demizu
# 149371	22-Aug-2005	andre	Add socketoption IP_MINTTL. May be used to set the minimum acceptable TTL a packet must have when received on a socket. All packets with a lower TTL are silently dropped. Works on already connected/connecting and listening sockets for RAW/UDP/TCP. This option is only really useful when set to 255 preventing packets from outside the directly connected networks reaching local listeners on sockets. Allows userland implementation of 'The Generalized TTL Security Mechanism (GTSM)' according to RFC3682. Examples of such use include the Cisco IOS BGP implementation command "neighbor ttl-security". MFC after: 2 weeks Sponsored by: TCP/IP Optimization Fundraise 2005
# 147781	05-Jul-2005	ps	Fix for a bug in newreno partial ack handling where if a large amount of data is partial acked, snd_cwnd underflows, causing a burst. Found, Submitted by: Noritoshi Demizu Approved by: re
# 147735	01-Jul-2005	ps	Fix for a bug in the change that defers sack option processing until after PAWS checks. The symptom of this is an inconsistency in the cached sack state, caused by the fact that the sack scoreboard was not being updated for an ACK handled in the header prediction path. Found by: Andrey Chernov. Submitted by: Noritoshi Demizu, Raja Mukerji. Approved by: re
# 147734	01-Jul-2005	ps	Fix for a SACK crash caused by a bug in tcp_reass(). tcp_reass() does not clear tlen and frees the mbuf (leaving th pointing at freed memory), if the data segment is a complete duplicate. This change works around that bug. A fix for the tcp_reass() bug will appear later (that bug is benign for now, as neither th nor tlen is referenced in tcp_input() after the call to tcp_reass()). Found by: Pawel Jakub Dawidek. Submitted by: Raja Mukerji, Noritoshi Demizu. Approved by: re
# 147666	29-Jun-2005	simon	Fix ipfw packet matching errors with address tables. The ipfw tables lookup code caches the result of the last query. The kernel may process multiple packets concurrently, performing several concurrent table lookups. Due to an insufficient locking, a cached result can become corrupted that could cause some addresses to be incorrectly matched against a lookup table. Submitted by: ru Reviewed by: csjp, mlaier Security: CAN-2005-2019 Security: FreeBSD-SA-05:13.ipfw Correct bzip2 permission race condition vulnerability. Obtained from: Steve Grubb via RedHat Security: CAN-2005-0953 Security: FreeBSD-SA-05:14.bzip2 Approved by: obrien Correct TCP connection stall denial of service vulnerability. A TCP packets with the SYN flag set is accepted for established connections, allowing an attacker to overwrite certain TCP options. Submitted by: Noritoshi Demizu Reviewed by: andre, Mohan Srinivasan Security: CAN-2005-2068 Security: FreeBSD-SA-05:15.tcp Approved by: re (security blanket), cperciva
# 147637	27-Jun-2005	ps	- Postpone SACK option processing until after PAWS checks. SACK option processing is now done in the ACK processing case. - Merge tcp_sack_option() and tcp_del_sackholes() into a new function called tcp_sack_doack(). - Test (SEG.ACK < SND.MAX) before processing the ACK. Submitted by: Noritoshi Demizu Reveiewed by: Mohan Srinivasan, Raja Mukerji Approved by: re
# 147605	25-Jun-2005	ups	Fix a timer ticks wrap around bug for minmssoverload processing. Approved by: re (scottl,dwhite) MFC after: 4 weeks
# 146863	01-Jun-2005	rwatson	Assert that tcbinfo is locked in tcp_input() before calling into tcp_drop(). MFC after: 7 days
# 146862	01-Jun-2005	rwatson	Assert the tcbinfo lock whenever tcp_close() is to be called by tcp_input(). MFC after: 7 days
# 146630	25-May-2005	ps	This is conform with the terminology in M.Mathis and J.Mahdavi, "Forward Acknowledgement: Refining TCP Congestion Control" SIGCOMM'96, August 1996. Submitted by: Noritoshi Demizu, Raja Mukerji
# 146123	11-May-2005	ps	When looking for the next hole to retransmit from the scoreboard, or to compute the total retransmitted bytes in this sack recovery episode, the scoreboard is traversed. While in sack recovery, this traversal occurs on every call to tcp_output(), every dupack and every partial ack. The scoreboard could potentially get quite large, making this traversal expensive. This change optimizes this by storing hints (for the next hole to retransmit and the total retransmitted bytes in this sack recovery episode) reducing the complexity to find these values from O(n) to constant time. The debug code that sanity checks the hints against the computed value will be removed eventually. Submitted by: Mohan Srinivasan, Noritoshi Demizu, Raja Mukerji.
# 145087	14-Apr-2005	ps	Fix for a TCP SACK bug where more than (win/2) bytes could have been in flight in SACK recovery. Found by: Noritoshi Demizu Submitted by: Mohan Srinivasan <mohans at yahoo-inc dot com> Noritoshi Demizu <demizu at dd dot ij4u dot or dot jp> Raja Mukerji <raja at moselle dot com>
# 144858	10-Apr-2005	ps	- Tighten up the Timestamp checks to prevent a spoofed segment from setting ts_recent to an arbitrary value, stopping further communication between the two hosts. - If the Echoed Timestamp is greater than the current time, fall back to the non RFC 1323 RTT calculation. Submitted by: Raja Mukerji (raja at moselle dot com) Reviewed by: Noritoshi Demizu, Mohan Srinivasan
# 144857	10-Apr-2005	ps	- If the reassembly queue limit was reached or if we couldn't allocate a reassembly queue state structure, don't update (receiver) sack report. - Similarly, if tcp_drain() is called, freeing up all items on the reassembly queue, clean the sack report. Found, Submitted by: Noritoshi Demizu <demizu at dd dot iij4u dot or dot jp> Reviewed by: Mohan Srinivasan (mohans at yahoo-inc dot com), Raja Mukerji (raja at moselle dot com).
# 142031	17-Feb-2005	ps	Remove 2 (SACK) fields from the tcpcb. These are only used by a function that is called from tcp_input(), so they oughta be passed on the stack instead of stuck in the tcpcb. Submitted by: Mohan Srinivasan
# 141961	16-Feb-2005	ps	Fix for a SACK (receiver) bug where incorrect SACK blocks are reported to the sender - in the case where the sender sends data outside the window (as WinXP does :(). Reported by: Sam Jensen <sam at wand dot net dot nz> Submitted by: Mohan Srinivasan
# 141928	14-Feb-2005	ps	- Retransmit just one segment on initiation of SACK recovery. Remove the SACK "initburst" sysctl. - Fix bugs in SACK dupack and partialack handling that can cause large bursts while in SACK recovery. Submitted by: Mohan Srinivasan
# 139823	07-Jan-2005	imp	/* -> /*- for license, minor formatting changes
# 139606	03-Jan-2005	silby	Add a sysctl (net.inet.tcp.insecure_rst) which allows one to specify that the RFC 793 specification for accepting RST packets should be following. When followed, this makes one vulnerable to the attacks described in "slipping in the window", but it may be necessary in some odd circumstances.
# 139298	25-Dec-2004	rwatson	In the dropafterack case of tcp_input(), it's OK to release the TCP pcbinfo lock before calling tcp_output(), as holding just the inpcb lock is sufficient to prevent garbage collection.
# 139297	25-Dec-2004	rwatson	Revert parts of tcp_input.c:1.255 associated with the header predicted cases for tcp_input(): While it is true that the pcbinfo lock provides a pseudo-reference to inpcbs, both the inpcb and pcbinfo locks are required to free an un-referenced inpcb. As such, we can release the pcbinfo lock as long as the inpcb remains locked with the confidence that it will not be garbage-collected. This leads to a less conservative locking strategy that should reduce contention on the TCP pcbinfo lock. Discussed with: sam
# 138148	28-Nov-2004	rwatson	Assert the inpcb lock in tcp_xmit_timer() as it performs read-modify- write of various time/rtt-related fields in the tcpcb.
# 138147	28-Nov-2004	rwatson	Expand coverage of the receive socket buffer lock when handling urgent pointer updates: test available space while holding the socket buffer mutex, and continue to hold until until the pointer update has been performed. MFC after: 2 weeks
# 138098	25-Nov-2004	silby	Fix a problem where our TCP stack would ignore RST packets if the receive window was 0 bytes in size. This may have been the cause of unsolved "connection not closing" reports over the years. Thanks to Michiel Boland for providing the fix and providing a concise test program for the problem. Submitted by: Michiel Boland MFC after: 2 weeks
# 138040	23-Nov-2004	rwatson	In tcp_reass(), assert the inpcb lock on the passed tcpcb, since the contents of the tcpcb are read and modified in volume. In tcp_input(), replace th comparison with 0 with a comparison with NULL. At the 'findpcb', 'dropafterack', and 'dropwithreset' labels in tcp_input(), assert 'headlocked'. Try to improve consistency between various assertions regarding headlocked to be more informative. MFC after: 2 weeks
# 138025	23-Nov-2004	rwatson	tcp_timewait() performs multiple non-atomic reads on the tcptw structure, so assert the inpcb lock associated with the tcptw. Also assert the tcbinfo lock, as tcp_timewait() may call tcp_twclose() or tcp_2msl_rest(), which require it. Since tcp_timewait() is already called with that lock from tcp_input(), this doesn't change current locking, merely documents reasons for it. In tcp_twstart(), assert the tcbinfo lock, as tcp_timer_2msl_rest() is called, which requires that lock. In tcp_twclose(), assert the tcbinfo lock, as tcp_timer_2msl_stop() is called, which requires that lock. Document the locking strategy for the time wait queues in tcp_timer.c, which consists of protecting the time wait queues in the same manner as the tcbinfo structure (using the tcbinfo lock). In tcp_timer_2msl_reset(), assert the tcbinfo lock, as the time wait queues are modified. In tcp_timer_2msl_stop(), assert the tcbinfo lock, as the time wait queues may be modified. In tcp_timer_2msl_tw(), assert the tcbinfo lock, as the time wait queues may be modified. MFC after: 2 weeks
# 137988	22-Nov-2004	rwatson	Remove "Unlocked read" annotations associated with previously unlocked use of socket buffer fields in the TCP input code. These references are now protected by use of the receive socket buffer lock. MFC after: 1 week
# 137349	07-Nov-2004	rwatson	Do some re-sorting of TCP pcbinfo locking and assertions: make sure to retain the pcbinfo lock until we're done using a pcb in the in-bound path, as the pcbinfo lock acts as a pseuo-reference to prevent the pcb from potentially being recycled. Clean up assertions and make sure to assert that the pcbinfo is locked at the head of code subsections where it is needed. Free the mbuf at the end of tcp_input after releasing any held locks to reduce the time the locks are held. MFC after: 3 weeks
# 137139	02-Nov-2004	andre	Remove RFC1644 T/TCP support from the TCP side of the network stack. A complete rationale and discussion is given in this message and the resulting discussion: http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706 Note that this commit removes only the functional part of T/TCP from the tcp_* related functions in the kernel. Other features introduced with RFC1644 are left intact (socket layer changes, sendmsg(2) on connection oriented protocols) and are meant to be reused by a simpler and less intrusive reimplemention of the previous T/TCP functionality. Discussed on: -arch
# 136151	05-Oct-2004	ps	- Estimate the amount of data in flight in sack recovery and use it to control the packets injected while in sack recovery (for both retransmissions and new data). - Cleanups to the sack codepaths in tcp_output.c and tcp_sack.c. - Add a new sysctl (net.inet.tcp.sack.initburst) that controls the number of sack retransmissions done upon initiation of sack recovery. Submitted by: Mohan Srinivasan <mohans@yahoo-inc.com>
# 133920	17-Aug-2004	andre	Convert ipfw to use PFIL_HOOKS. This is change is transparent to userland and preserves the ipfw ABI. The ipfw core packet inspection and filtering functions have not been changed, only how ipfw is invoked is different. However there are many changes how ipfw is and its add-on's are handled: In general ipfw is now called through the PFIL_HOOKS and most associated magic, that was in ip_input() or ip_output() previously, is now done in ipfw_check_[in\|out]() in the ipfw PFIL handler. IPDIVERT is entirely handled within the ipfw PFIL handlers. A packet to be diverted is checked if it is fragmented, if yes, ip_reass() gets in for reassembly. If not, or all fragments arrived and the packet is complete, divert_packet is called directly. For 'tee' no reassembly attempt is made and a copy of the packet is sent to the divert socket unmodified. The original packet continues its way through ip_input/output(). ipfw 'forward' is done via m_tag's. The ipfw PFIL handlers tag the packet with the new destination sockaddr_in. A check if the new destination is a local IP address is made and the m_flags are set appropriately. ip_input() and ip_output() have some more work to do here. For ip_input() the m_flags are checked and a packet for us is directly sent to the 'ours' section for further processing. Destination changes on the input path are only tagged and the 'srcrt' flag to ip_forward() is set to disable destination checks and ICMP replies at this stage. The tag is going to be handled on output. ip_output() again checks for m_flags and the 'ours' tag. If found, the packet will be dropped back to the IP netisr where it is going to be picked up by ip_input() again and the directly sent to the 'ours' section. When only the destination changes, the route's 'dst' is overwritten with the new destination from the forward m_tag. Then it jumps back at the route lookup again and skips the firewall check because it has been marked with M_SKIP_FIREWALL. ipfw 'forward' has to be compiled into the kernel with 'option IPFIREWALL_FORWARD' to enable it. DUMMYNET is entirely handled within the ipfw PFIL handlers. A packet for a dummynet pipe or queue is directly sent to dummynet_io(). Dummynet will then inject it back into ip_input/ip_output() after it has served its time. Dummynet packets are tagged and will continue from the next rule when they hit the ipfw PFIL handlers again after re-injection. BRIDGING and IPFW_ETHER are not changed yet and use ipfw_chk() directly as they did before. Later this will be changed to dedicated ETHER PFIL_HOOKS. More detailed changes to the code: conf/files Add netinet/ip_fw_pfil.c. conf/options Add IPFIREWALL_FORWARD option. modules/ipfw/Makefile Add ip_fw_pfil.c. net/bridge.c Disable PFIL_HOOKS if ipfw for bridging is active. Bridging ipfw is still directly invoked to handle layer2 headers and packets would get a double ipfw when run through PFIL_HOOKS as well. netinet/ip_divert.c Removed divert_clone() function. It is no longer used. netinet/ip_dummynet.[ch] Neither the route 'ro' nor the destination 'dst' need to be stored while in dummynet transit. Structure members and associated macros are removed. netinet/ip_fastfwd.c Removed all direct ipfw handling code and replace it with the new 'ipfw forward' handling code. netinet/ip_fw.h Removed 'ro' and 'dst' from struct ip_fw_args. netinet/ip_fw2.c (Re)moved some global variables and the module handling. netinet/ip_fw_pfil.c New file containing the ipfw PFIL handlers and module initialization. netinet/ip_input.c Removed all direct ipfw handling code and replace it with the new 'ipfw forward' handling code. ip_forward() does not longer require the 'next_hop' struct sockaddr_in argument. Disable early checks if 'srcrt' is set. netinet/ip_output.c Removed all direct ipfw handling code and replace it with the new 'ipfw forward' handling code. netinet/ip_var.h Add ip_reass() as general function. (Used from ipfw PFIL handlers for IPDIVERT.) netinet/raw_ip.c Directly check if ipfw and dummynet control pointers are active. netinet/tcp_input.c Rework the 'ipfw forward' to local code to work with the new way of forward tags. netinet/tcp_sack.c Remove include 'opt_ipfw.h' which is not needed here. sys/mbuf.h Remove m_claim_next() macro which was exclusively for ipfw 'forward' and is no longer needed. Approved by: re (scottl)
# 133874	16-Aug-2004	rwatson	White space cleanup for netinet before branch: - Trailing tab/space cleanup - Remove spurious spaces between or before tabs This change avoids touching files that Andre likely has in his working set for PFIL hooks changes for IPFW/DUMMYNET. Approved by: re (scottl) Submitted by: Xin LI <delphij@frontfree.net>
# 132044	12-Jul-2004	rwatson	After each label in tcp_input(), assert the inpcbinfo and inpcb lock state that we expect.
# 131427	01-Jul-2004	jayanth	On receiving 3 duplicate acknowledgements, SACK recovery was not being entered correctly. Fix this problem by separating out the SACK and the newreno cases. Also, check if we are in FASTRECOVERY for the sack case and if so, turn off dupacks. Fix an issue where the congestion window was not being incremented by ssthresh. Thanks to Mohan Srinivasan for finding this problem.
# 131151	26-Jun-2004	rwatson	Reduce the number of unnecessary unlock-relocks on socket buffer mutexes associated with performing a wakeup on the socket buffer: - When performing an sbappend*() followed by a so[rw]wakeup(), explicitly acquire the socket buffer lock and use the _locked() variants of both calls. Note that the _locked() sowakeup() versions unlock the mutex on return. This is done in uipc_send(), divert_packet(), mroute socket_send(), raw_append(), tcp_reass(), tcp_input(), and udp_append(). - When the socket buffer lock is dropped before a sowakeup(), remove the explicit unlock and use the _locked() sowakeup() variant. This is done in soisdisconnecting(), soisdisconnected() when setting the can't send/ receive flags and dropping data, and in uipc_rcvd() which adjusting back-pressure on the sockets. For UNIX domain sockets running mpsafe with a contention-intensive SMP mysql benchmark, this results in a 1.6% query rate improvement due to reduce mutex costs.
# 131079	25-Jun-2004	ps	White space & spelling fixes Submitted by: Xin LI <delphij@frontfree.net>
# 131018	24-Jun-2004	rwatson	Broaden scope of the socket buffer lock when processing an ACK so that the read and write of sb_cc are atomic. Call sbdrop_locked() instead of sbdrop() since we already hold the socket buffer lock.
# 131017	24-Jun-2004	rwatson	Protect so_oobmark with with SOCKBUF_LOCK(&so->so_rcv), and broaden locking in tcp_input() for TCP packets with urgent data pointers to hold the socket buffer lock across testing and updating oobmark from just protecting sb_state. Update socket locking annotations
# 131006	24-Jun-2004	rwatson	Introduce sbreserve_locked(), which asserts the socket buffer lock on the socket buffer having its limits adjusted. sbreserve() now acquires the lock before calling sbreserve_locked(). In soreserve(), acquire socket buffer locks across read-modify-writes of socket buffer fields, and calls into sbreserve/sbrelease; make sure to acquire in keeping with the socket buffer lock order. In tcp_mss(), acquire the socket buffer lock in the calling context so that we have atomic read-modify -write on buffer sizes.
# 130989	23-Jun-2004	ps	Add support for TCP Selective Acknowledgements. The work for this originated on RELENG_4 and was ported to -CURRENT. The scoreboarding code was obtained from OpenBSD, and many of the remaining changes were inspired by OpenBSD, but not taken directly from there. You can enable/disable sack using net.inet.tcp.do_sack. You can also limit the number of sack holes that all senders can have in the scoreboard with net.inet.tcp.sackhole_limit. Reviewed by: gnn Obtained from: Yahoo! (Mohan Srinivasan, Jayanth Vijayaraghavan)
# 130811	20-Jun-2004	rwatson	Assert the inpcb lock before letting MAC check whether we can deliver to the inpcb in tcp_input().
# 130584	16-Jun-2004	bms	Fix build for IPSEC && !INET6 PR: kern/66125 Submitted by: Cyrille Lefevre
# 130513	15-Jun-2004	rwatson	Grab the socket buffer send or receive mutex when performing a read-modify-write on the sb_state field. This commit catches only the "easy" ones where it doesn't interact with as yet unmerged locking.
# 130480	14-Jun-2004	rwatson	The socket field so_state is used to hold a variety of socket related flags relating to several aspects of socket functionality. This change breaks out several bits relating to send and receive operation into a new per-socket buffer field, sb_state, in order to facilitate locking. This is required because, in order to provide more granular locking of sockets, different state fields have different locking properties. The following fields are moved to sb_state: SS_CANTRCVMORE (so_state) SS_CANTSENDMORE (so_state) SS_RCVATMARK (so_state) Rename respectively to: SBS_CANTRCVMORE (so_rcv.sb_state) SBS_CANTSENDMORE (so_snd.sb_state) SBS_RCVATMARK (so_rcv.sb_state) This facilitates locking by isolating fields to be located with other identically locked fields, and permits greater granularity in socket locking by avoiding storing fields with different locking semantics in the same short (avoiding locking conflicts). In the future, we may wish to coallesce sb_state and sb_flags; for the time being I leave them separate and there is no additional memory overhead due to the packing/alignment of shorts in the socket buffer structure.
# 130398	13-Jun-2004	rwatson	Socket MAC labels so_label and so_peerlabel are now protected by SOCK_LOCK(so): - Hold socket lock over calls to MAC entry points reading or manipulating socket labels. - Assert socket lock in MAC entry point implementations. - When externalizing the socket label, first make a thread-local copy while holding the socket lock, then release the socket lock to externalize to userspace.
# 128829	02-May-2004	darrenr	Rename m_claim_next_hop() to m_claim_next(), as suggested by Max Laier.
# 128828	02-May-2004	darrenr	oops, I forgot this file in a prior commit (change was still sitting here, uncommitted): Rename ip_claim_next_hop() to m_claim_next_hop(), give it an extra arg (the type of tag to claim) and push it out of ip_var.h into mbuf.h alongside all of the other macros that work ok mbuf's and tag's.
# 128653	26-Apr-2004	silby	Tighten up reset handling in order to make reset attacks as difficult as possible while maintaining compatibility with the widest range of TCP stacks. The algorithm is as follows: --- For connections in the ESTABLISHED state, only resets with sequence numbers exactly matching last_ack_sent will cause a reset, all other segments will be silently dropped. For connections in all other states, a reset anywhere in the window will cause the connection to be reset. All other segments will be silently dropped. --- The necessity of accepting all in-window resets was discovered by jayanth and jlemon, both of whom have seen TCP stacks that will respond to FIN-ACK packets with resets not meeting the strict last_ack_sent check. Idea by: Darren Reed Reviewed by: truckman, jlemon, others(?)
# 128592	23-Apr-2004	andre	Correct an edge case in tcp_mss() where the cached path MTU from tcp_hostcache would have overridden a (now) lower MTU of an interface or route that changed since first PMTU discovery. The bug would have caused TCP to redo the PMTU discovery when not strictly necessary. Make a comment about already pre-initialized default values more clear. Reviewed by: sam
# 128019	07-Apr-2004	imp	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999 and email from Peter Wemm, Alan Cox and Robert Watson. Approved by: core, peter, alc, rwatson
# 126456	01-Mar-2004	ume	fix -O0 compilation without INET6. Pointed out by: ru
# 126351	28-Feb-2004	rwatson	Remove now unneeded arguments to tcp_twrespond() -- so and msrc. These were needed by the MAC Framework until inpcbs gained labels. Submitted by: sam
# 126239	25-Feb-2004	mlaier	Re-remove MT_TAGs. The problems with dummynet have been fixed now. Tested by: -current, bms(mentor), me Approved by: bms(mentor), sam
# 126220	25-Feb-2004	hsu	Relax a KASSERT condition to allow for a valid corner case where the FIN on the last segment consumes an extra sequence number. Spurious panic reported by Mike Silbersack <silby@silby.com>.
# 126193	24-Feb-2004	andre	Convert the tcp segment reassembly queue to UMA and limit the maximum amount of segments it will hold. The following tuneables and sysctls control the behaviour of the tcp segment reassembly queue: net.inet.tcp.reass.maxsegments (loader tuneable) specifies the maximum number of segments all tcp reassemly queues can hold (defaults to 1/16 of nmbclusters). net.inet.tcp.reass.maxqlen specifies the maximum number of segments any individual tcp session queue can hold (defaults to 48). net.inet.tcp.reass.cursegments (readonly) counts the number of segments currently in all reassembly queues. net.inet.tcp.reass.overflows (readonly) counts how often either the global or local queue limit has been reached. Tested by: bms, silby Reviewed by: bms, silby
# 125952	18-Feb-2004	mlaier	Backout MT_TAG removal (i.e. bring back MT_TAGs) for now, as dummynet is not working properly with the patch in place. Approved by: bms(mentor)
# 125941	17-Feb-2004	ume	IPSEC and FAST_IPSEC have the same internal API now; so merge these (IPSEC has an extra ipsecstat) Submitted by: "Bjoern A. Zeeb" <bzeeb+freebsd@zabbadoz.net>
# 125784	13-Feb-2004	mlaier	This set of changes eliminates the use of MT_TAG "pseudo mbufs", replacing them mostly with packet tags (one case is handled by using an mbuf flag since the linkage between "caller" and "callee" is direct and there's no need to incur the overhead of a packet tag). This is (mostly) work from: sam Silence from: -arch Approved by: bms(mentor), sam, rwatson
# 125783	13-Feb-2004	bms	Brucification. Submitted by: bde
# 125740	12-Feb-2004	bms	Remove an unnecessary initialization that crept in from the code which verifies TCP-MD5 digests. Noticed by: njl
# 125680	11-Feb-2004	bms	Initial import of RFC 2385 (TCP-MD5) digest support. This is the first of two commits; bringing in the kernel support first. This can be enabled by compiling a kernel with options TCP_SIGNATURE and FAST_IPSEC. For the uninitiated, this is a TCP option which provides for a means of authenticating TCP sessions which came into being before IPSEC. It is still relevant today, however, as it is used by many commercial router vendors, particularly with BGP, and as such has become a requirement for interconnect at many major Internet points of presence. Several parts of the TCP and IP headers, including the segment payload, are digested with MD5, including a shared secret. The PF_KEY interface is used to manage the secrets using security associations in the SADB. There is a limitation here in that as there is no way to map a TCP flow per-port back to an SPI without polluting tcpcb or using the SPD; the code to do the latter is unstable at this time. Therefore this code only supports per-host keying granularity. Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6), TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective users of this feature, this will not pose any problem. This implementation is output-only; that is, the option is honoured when responding to a host initiating a TCP session, but no effort is made [yet] to authenticate inbound traffic. This is, however, sufficient to interwork with Cisco equipment. Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with local patches. Patches for tcpdump to validate TCP-MD5 sessions are also available from me upon request. Sponsored by: sentex.net
# 125396	03-Feb-2004	ume	pass pcb rather than so. it is expected that per socket policy works again.
# 124761	20-Jan-2004	hsu	Merge from DragonFlyBSD rev 1.10: date: 2003/09/02 10:04:47; author: hsu; state: Exp; lines: +5 -6 Account for when Limited Transmit is not congestion window limited. Obtained from: DragonFlyBSD
# 124258	08-Jan-2004	andre	Limiters and sanity checks for TCP MSS (maximum segement size) resource exhaustion attacks. For network link optimization TCP can adjust its MSS and thus packet size according to the observed path MTU. This is done dynamically based on feedback from the remote host and network components along the packet path. This information can be abused to pretend an extremely low path MTU. The resource exhaustion works in two ways: o during tcp connection setup the advertized local MSS is exchanged between the endpoints. The remote endpoint can set this arbitrarily low (except for a minimum MTU of 64 octets enforced in the BSD code). When the local host is sending data it is forced to send many small IP packets instead of a large one. For example instead of the normal TCP payload size of 1448 it forces TCP payload size of 12 (MTU 64) and thus we have a 120 times increase in workload and packets. On fast links this quickly saturates the local CPU and may also hit pps processing limites of network components along the path. This type of attack is particularly effective for servers where the attacker can download large files (WWW and FTP). We mitigate it by enforcing a minimum MTU settable by sysctl net.inet.tcp.minmss defaulting to 256 octets. o the local host is reveiving data on a TCP connection from the remote host. The local host has no control over the packet size the remote host is sending. The remote host may chose to do what is described in the first attack and send the data in packets with an TCP payload of at least one byte. For each packet the tcp_input() function will be entered, the packet is processed and a sowakeup() is signalled to the connected process. For example an attack with 2 Mbit/s gives 4716 packets per second and the same amount of sowakeup()s to the process (and context switches). This type of attack is particularly effective for servers where the attacker can upload large amounts of data. Normally this is the case with WWW server where large POSTs can be made. We mitigate this by calculating the average MSS payload per second. If it goes below 'net.inet.tcp.minmss' and the pps rate is above 'net.inet.tcp.minmssoverload' defaulting to 1000 this particular TCP connection is resetted and dropped. MITRE CVE: CAN-2004-0002 Reviewed by: sam (mentor) MFC after: 1 day
# 124199	06-Jan-2004	andre	Enable the following TCP options by default to give it more exposure: rfc3042 Limited retransmit rfc3390 Increasing TCP's initial congestion Window inflight TCP inflight bandwidth limiting All my production server have it enabled and there have been no issues. I am confident about having them on by default and it gives us better overall TCP performance. Reviewed by: sam (mentor)
# 122987	25-Nov-2003	andre	Restructure a too broad ifdef which was disabling the setting of the tcp flightsize sysctl value for local networks in the !INET6 case. Approved by: re (scottl)
# 122922	20-Nov-2003	andre	Introduce tcp_hostcache and remove the tcp specific metrics from the routing table. Move all usage and references in the tcp stack from the routing table metrics to the tcp hostcache. It caches measured parameters of past tcp sessions to provide better initial start values for following connections from or to the same source or destination. Depending on the network parameters to/from the remote host this can lead to significant speedups for new tcp connections after the first one because they inherit and shortcut the learning curve. tcp_hostcache is designed for multiple concurrent access in SMP environments with high contention and is hash indexed by remote ip address. It removes significant locking requirements from the tcp stack with regard to the routing table. Reviewed by: sam (mentor), bms Reviewed by: -net, -current, core@kame.net (IPv6 parts) Approved by: re (scottl)
# 122875	18-Nov-2003	rwatson	Introduce a MAC label reference in 'struct inpcb', which caches the MAC label referenced from 'struct socket' in the IPv4 and IPv6-based protocols. This permits MAC labels to be checked during network delivery operations without dereferencing inp->inp_socket to get to so->so_label, which will eventually avoid our having to grab the socket lock during delivery at the network layer. This change introduces 'struct inpcb' as a labeled object to the MAC Framework, along with the normal circus of entry points: initialization, creation from socket, destruction, as well as a delivery access control check. For most policies, the inpcb label will simply be a cache of the socket label, so a new protocol switch method is introduced, pr_sosetlabel() to notify protocols that the socket layer label has been updated so that the cache can be updated while holding appropriate locks. Most protocols implement this using pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use the the worker function in_pcbsosetlabel(), which calls into the MAC Framework to perform a cache update. Biba, LOMAC, and MLS implement these entry points, as do the stub policy, and test policy. Reviewed by: sam, bms Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 122576	12-Nov-2003	andre	dropwithreset is not needed in this case as tcp_drop() is already notifying the other side. Before we were sending two RST packets.
# 122327	08-Nov-2003	sam	o correct locking problem: the inpcb must be held across tcp_respond o add assertions in tcp_respond to validate inpcb locking assumptions o use local variable instead of chasing pointers in tcp_respond Supported by: FreeBSD Foundation
# 121628	28-Oct-2003	sam	speedup stream socket recv handling by tracking the tail of the mbuf chain instead of walking the list for each append Submitted by: ps/jayanth Obtained from: netbsd (jason thorpe)
# 121285	20-Oct-2003	ume	enclose IPv6 part with ifdef INET6. Obtained from: KAME
# 121283	20-Oct-2003	ume	correct linkmtu handling. Obtained from: KAME
# 121161	17-Oct-2003	ume	- add dom_if{attach,detach} framework. - transition to use ifp->if_afdata. Obtained from: KAME
# 118861	13-Aug-2003	harti	A number of patches in the last years have created new return paths in tcp_input that leave the function before hitting the tcp_trace function call for the TCPDEBUG option. This has made TCPDEBUG mostly useless (and tools like ports/benchmarks/dbs not working). Add tcp_trace calls to the return paths that could be identified in this maze. This is a NOP unless you compile with TCPDEBUG.
# 117650	15-Jul-2003	hsu	Unify the "send high" and "recover" variables as specified in the lastest rev of the spec. Use an explicit flag for Fast Recovery. [1] Fix bug with exiting Fast Recovery on a retransmit timeout diagnosed by Lu Guohan. [2] Reviewed by: Thomas Henderson <thomas.r.henderson@boeing.com> Reported and tested by: Lu Guohan <lguohan00@mails.tsinghua.edu.cn> [2] Approved by: Thomas Henderson <thomas.r.henderson@boeing.com>, Sally Floyd <floyd@acm.org> [1]
# 115503	31-May-2003	phk	Add /* FALLTHROUGH */ Found by: FlexeLint
# 114794	07-May-2003	rwatson	Correct a bug introduced with reduced TCP state handling; make sure that the MAC label on TCP responses during TIMEWAIT is properly set from either the socket (if available), or the mbuf that it's responding to. Unfortunately, this is made somewhat difficult by the TCP code, as tcp_twstart() calls tcp_twrespond() after discarding the socket but without a reference to the mbuf that causes the "response". Passing both the socket and the mbuf works arounds this--eventually it might be good to make sure the mbuf always gets passed in in "response" scenarios but working through this provided to complicate things too much. Approved by: re (scottl) Reviewed by: hsu Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 113799	21-Apr-2003	obrien	Explicitly declare 'int' parameters.
# 112957	01-Apr-2003	hsu	Observe conservation of packets when entering Fast Recovery while doing Limited Transmit. Only artificially inflate the congestion window by 1 segment instead of the usual 3 to take into account the 2 already sent by Limited Transmit. Approved in principle by: Mark Allman <mallman@grc.nasa.gov>, Hari Balakrishnan <hari@nms.lcs.mit.edu>, Sally Floyd <floyd@icir.org>
# 112191	13-Mar-2003	hsu	Greatly simplify the unlocking logic by holding the TCP protocol lock until after FIN_WAIT_2 processing. Helped with debugging: Doug Barton
# 112171	13-Mar-2003	hsu	Add support for RFC 3390, which allows for a variable-sized initial congestion window.
# 112162	12-Mar-2003	hsu	Implement the Limited Transmit algorithm (RFC 3042).
# 112009	08-Mar-2003	jlemon	Remove a panic(); if the zone allocator can't provide more timewait structures, reuse the oldest one. Also move the expiry timer from a per-structure callout to the tcp slow timer. Sponsored by: DARPA, NAI Labs
# 111560	26-Feb-2003	jlemon	In timewait state, if the incoming segment is a pure in-sequence ack that matches snd_max, then do not respond with an ack, just drop the segment. This fixes a problem where a simultaneous close results in an ack loop between two time-wait states. Test case supplied by: Tim Robbins <tjr@FreeBSD.ORG> Sponsored by: DARPA, NAI Labs
# 111549	26-Feb-2003	jlemon	The TCP protocol lock may still be held if the reassembly queue dropped FIN. Detect this case and drop the lock accordingly. Sponsored by: DARPA, NAI Labs
# 111389	24-Feb-2003	hsu	tcp_twstart() need to be called with the TCP protocol lock held to avoid a race condition with the TCP timer routines.
# 111386	24-Feb-2003	hsu	Pass the right function to callout_reset() for a compressed TIME-WAIT control block.
# 111319	23-Feb-2003	jlemon	Yesterday just wasn't my day. Remove testing delta that crept into the diff. Pointy hat provided by: sam
# 111266	22-Feb-2003	jlemon	Check to see if the TF_DELACK flag is set before returning from tcp_input(). This unbreaks delack handling, while still preserving correct T/TCP behavior Tested by: maxim Sponsored by: DARPA, NAI Labs
# 111145	19-Feb-2003	jlemon	Add a TCP TIMEWAIT state which uses less space than a fullblown TCP control block. Allow the socket and tcpcb structures to be freed earlier than inpcb. Update code to understand an inp w/o a socket. Reviewed by: hsu, silby, jayanth Sponsored by: DARPA, NAI Labs
# 111140	19-Feb-2003	jlemon	Correct comments.
# 111139	19-Feb-2003	jlemon	Clean up delayed acks and T/TCP interactions: - delay acks for T/TCP regardless of delack setting - fix bug where a single pass through tcp_input might not delay acks - use callout_active() instead of callout_pending() Sponsored by: DARPA, NAI Labs
# 110830	13-Feb-2003	hsu	The protocol lock is always held in the dropafterack case, so we don't need to check for it at runtime.
# 110251	02-Feb-2003	cjc	Add the TCP flags to the log message whenever log_in_vain is 1, not just when set to 2. PR: kern/43348 MFC after: 5 days
# 109175	13-Jan-2003	hsu	Fix NewReno. Reviewed by: Tom Henderson <thomas.r.henderson@boeing.com>
# 108464	30-Dec-2002	dillon	Remove the PAWS ack-on-ack debugging printf(). Note that the original RFC 1323 (PAWS) says in 4.2.1 that the out of order / reverse-time-indexed packet should be acknowledged as specified in RFC-793 page 69 then dropped. The original PAWS code in FreeBSD (1994) simply acknowledged the segment unconditionally, which is incorrect, and was fixed in 1.183 (2002). At the moment we do not do checks for SYN or FIN in addition to (tlen != 0), which may or may not be correct, but the worst that ought to happen should be a retry by the sender.
# 108123	20-Dec-2002	hsu	Unravel a nested conditional. Remove an unneeded local variable.
# 107961	17-Dec-2002	dillon	Fix syntax in last commit.
# 107854	14-Dec-2002	dillon	Bruce forwarded this tidbit from an analysis Van Jacobson did on an apparent ack-on-ack problem with FreeBSD. Prof. Jacobson noticed a case in our TCP stack which would acknowledge a received ack-only packet, which is not legal in TCP. Submitted by: Van Jacobson <van@packetdesign.com>, bmah@packetdesign.com (Bruce A. Mah) MFC after: 7 days
# 106736	10-Nov-2002	sam	a better solution to building FAST_IPSEC w/o INET6 Submitted by: Jeffrey Hsu <hsu@FreeBSD.org>
# 106679	08-Nov-2002	sam	fixup FAST_IPSEC build w/o INET6
# 106271	31-Oct-2002	jeff	- Consistently update snd_wl1, snd_wl2, and rcv_up in the header prediction code. Previously, 2GB worth of header predicted data could leave these variables too far out of sequence which would cause problems after receiving a packet that did not match the header prediction. Submitted by: Bill Baumann <bbaumann@isilon.com> Sponsored by: Isilon Systems, Inc. Reviewed by: hsu, pete@isilon.com, neal@isilon.com, aaronp@isilon.com
# 106198	30-Oct-2002	hsu	Don't need to check if SO_OOBINLINE is defined. Don't need to protect isipv6 conditional with INET6. Fix leading indentation in 2 lines.
# 105199	16-Oct-2002	sam	Tie new "Fast IPsec" code into the build. This involves the usual configuration stuff as well as conditional code in the IPv4 and IPv6 areas. Everything is conditional on FAST_IPSEC which is mutually exclusive with IPSEC (KAME IPsec implmentation). As noted previously, don't use FAST_IPSEC with INET6 at the moment. Reviewed by: KAME, rwatson Approved by: silence Supported by: Vernier Networks
# 105194	16-Oct-2002	sam	Replace aux mbufs with packet tags: o instead of a list of mbufs use a list of m_tag structures a la openbsd o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit ABI/module number cookie o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and use this in defining openbsd-compatible m_tag_find and m_tag_get routines o rewrite KAME use of aux mbufs in terms of packet tags o eliminate the most heavily used aux mbufs by adding an additional struct inpcb parameter to ip_output and ip6_output to allow the IPsec code to locate the security policy to apply to outbound packets o bump __FreeBSD_version so code can be conditionalized o fixup ipfilter's call to ip_output based on __FreeBSD_version Reviewed by: julian, luigi (silent), -arch, -net, darren Approved by: julian, silence from everyone else Obtained from: openbsd (mostly) MFC after: 1 month
# 104226	30-Sep-2002	dillon	Guido found another bug. There is a situation with timestamped TCP packets where FreeBSD will send DATA+FIN and A W2K box will ack just the DATA portion. If this occurs after FreeBSD has done a (NewReno) fast-retransmit and is recovering it (dupacks > threshold) it triggers a case in tcp_newreno_partial_ack() (tcp_newreno() in stable) where tcp_output() is called with the expectation that the retransmit timer will be reloaded. But tcp_output() falls through and returns without doing anything, causing the persist timer to be loaded instead. This causes the connection to hang until W2K gives up. This occurs because in the case where only the FIN must be acked, the 'len' calculation in tcp_output() will be 0, a lot of checks will be skipped, and the FIN check will also be skipped because it is designed to handle FIN retransmits, not forced transmits from tcp_newreno(). The solution is to simply set TF_ACKNOW before calling tcp_output() to absolute guarentee that it will run the send code and reset the retransmit timer. TF_ACKNOW is already used for this purpose in other cases. For some unknown reason this patch also seems to greatly reduce the number of duplicate acks received when Guido runs his tests over a lossy network. It is quite possible that there are other tcp_newreno{_partial_ack()} cases which were not generating the expected output which this patch also fixes. X-MFC after: Will be MFC'd after the freeze is over
# 103776	22-Sep-2002	silby	Fix issue where shutdown(socket, SHUT_RD) was effectively ignored for TCP sockets. NetBSD PR: 18185 Submitted by: Sean Boudreau <seanb@qnx.com> MFC after: 3 days
# 103505	17-Sep-2002	dillon	Guido reported an interesting bug where an FTP connection between a Windows 2000 box and a FreeBSD box could stall. The problem turned out to be a timestamp reply bug in the W2K TCP stack. FreeBSD sends a timestamp with the SYN, W2K returns a timestamp of 0 in the SYN+ACK causing FreeBSD to calculate an insane SRTT and RTT, resulting in a maximal retransmit timeout (60 seconds). If there is any packet loss on the connection for the first six or so packets the retransmit case may be hit (the window will still be too small for fast-retransmit), causing a 60+ second pause. The W2K box gives up and closes the connection. This commit works around the W2K bug. 15:04:59.374588 FREEBSD.20 > W2K.1036: S 1420807004:1420807004(0) win 65535 <mss 1460,nop,wscale 2,nop,nop,timestamp 188297344 0> (DF) [tos 0x8] 15:04:59.377558 W2K.1036 > FREEBSD.20: S 4134611565:4134611565(0) ack 1420807005 win 17520 <mss 1460,nop,wscale 0,nop,nop,timestamp 0 0> (DF) Bug reported by: Guido van Rooij <guido@gvr.org>
# 102412	25-Aug-2002	charnier	Replace various spelling with FALLTHROUGH which is lint()able
# 102131	19-Aug-2002	jmallett	Enclose IPv6 addresses in brackets when they are displayed printable with a TCP/UDP port seperated by a colon. This is for the log_in_vain facility. Pointed out by: Edward J. M. Brocklesby Reviewed by: ume MFC after: 2 weeks
# 102017	17-Aug-2002	dillon	Implement TCP bandwidth delay product window limiting, similar to (but not meant to duplicate) TCP/Vegas. Add four sysctls and default the implementation to 'off'. net.inet.tcp.inflight_enable enable algorithm (defaults to 0=off) net.inet.tcp.inflight_debug debugging (defaults to 1=on) net.inet.tcp.inflight_min minimum window limit net.inet.tcp.inflight_max maximum window limit MFC after: 1 week
# 102002	17-Aug-2002	hsu	Cosmetic-only changes for readability. Reviewed by: (early form passed by) bde Approved by: itojun (from core@kame.net)
# 101934	15-Aug-2002	rwatson	Rename mac_check_socket_receive() to mac_check_socket_deliver() so that we can use the names _receive() and _send() for the receive() and send() checks. Rename related constants, policy implementations, etc. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 101928	15-Aug-2002	hsu	Reset dupack count in header prediction. Follow-on to rev 1.39. Reviewed by: jayanth, Thomas R Henderson <thomas.r.henderson@boeing.com>, silby, dillon
# 101106	31-Jul-2002	rwatson	Introduce support for Mandatory Access Control and extensible kernel access control. Instrument the TCP socket code for packet generation and delivery: label outgoing mbufs with the label of the socket, and check socket and mbuf labels before permitting delivery to a socket. Assign labels to newly accepted connections when the syncache/cookie code has done its business. Also set peer labels as convenient. Currently, MAC policies cannot influence the PCB matching algorithm, so cannot implement polyinstantiation. Note that there is at least one case where a PCB is not available due to the TCP packet not being associated with any socket, so we don't label in that case, but need to handle it in a special manner. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 100534	22-Jul-2002	ru	Don't shrink socket buffers in tcp_mss(), application might have already configured them with setsockopt(SO_*BUF), for RFC1323's scaled windows. PR: kern/11966 MFC after: 1 week
# 100373	19-Jul-2002	dillon	Add the tcps_sndrexmitbad statistic, keep track of late acks that caused unnecessary retransmissions.
# 98781	24-Jun-2002	hsu	Avoid unlocking the inp twice if badport_bandlim() returns -1. Reported by: jlemon
# 98769	24-Jun-2002	hsu	Style bug: fix 4 space indentations that should have been tabs. Submitted by: jlemon
# 98703	23-Jun-2002	luigi	Move two global variables to automatic variables within the only function where they are used (they are used with TCPDEBUG only).
# 98613	22-Jun-2002	luigi	Remove (almost all) global variables that were used to hold packet forwarding state ("annotations") during ip processing. The code is considerably cleaner now. The variables removed by this change are: ip_divert_cookie used by divert sockets ip_fw_fwd_addr used for transparent ip redirection last_pkt used by dynamic pipes in dummynet Removal of the first two has been done by carrying the annotations into volatile structs prepended to the mbuf chains, and adding appropriate code to add/remove annotations in the routines which make use of them, i.e. ip_input(), ip_output(), tcp_input(), bdg_forward(), ether_demux(), ether_output_frame(), div_output(). On passing, remove a bug in divert handling of fragmented packet. Now it is the fragment at offset 0 which sets the divert status of the whole packet, whereas formerly it was the last incoming fragment to decide. Removal of last_pkt required a change in the interface of ip_fw_chk() and dummynet_io(). On passing, use the same mechanism for dummynet annotations and for divert/forward annotations. option IPFIREWALL_FORWARD is effectively useless, the code to implement it is very small and is now in by default to avoid the obfuscation of conditionally compiled code. NOTES: * there is at least one global variable left, sro_fwd, in ip_output(). I am not sure if/how this can be removed. * I have deliberately avoided gratuitous style changes in this commit to avoid cluttering the diffs. Minor stule cleanup will likely be necessary * this commit only focused on the IP layer. I am sure there is a number of global variables used in the TCP and maybe UDP stack. * despite the number of files touched, there are absolutely no API's or data structures changed by this commit (except the interfaces of ip_fw_chk() and dummynet_io(), which are internal anyways), so an MFC is quite safe and unintrusive (and desirable, given the improved readability of the code). MFC after: 10 days
# 98385	18-Jun-2002	tanimura	Remove so*_locked(), which were backed out by mistake.
# 98102	10-Jun-2002	hsu	Lock up inpcb. Submitted by: Jennifer Yang <yangjihui@yahoo.com>
# 97658	31-May-2002	tanimura	Back out my lats commit of locking down a socket, it conflicts with hsu's work. Requested by: hsu
# 96972	20-May-2002	tanimura	Lock down a socket, milestone 1. o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a socket buffer. The mutex in the receive buffer also protects the data in struct socket. o Determine the lock strategy for each members in struct socket. o Lock down the following members: - so_count - so_options - so_linger - so_state o Remove *_locked() socket APIs. Make the following socket APIs touching the members above now require a locked socket: - sodisconnect() - soisconnected() - soisconnecting() - soisdisconnected() - soisdisconnecting() - sofree() - soref() - sorele() - sorwakeup() - sotryfree() - sowakeup() - sowwakeup() Reviewed by: alfred
# 95883	01-May-2002	alfred	Redo the sigio locking. Turn the sigio sx into a mutex. Sigio lock is really only needed to protect interrupts from dereferencing the sigio pointer in an object when the sigio itself is being destroyed. In order to do this in the most unintrusive manner change pgsigio's sigio * argument into a **, that way we can lock internally to the function.
# 95759	30-Apr-2002	tanimura	Revert the change of #includes in sys/filedesc.h and sys/socketvar.h. Requested by: bde Since locking sigio_lock is usually followed by calling pgsigio(), move the declaration of sigio_lock and the definitions of SIGIO_*() to sys/signalvar.h. While I am here, sort include files alphabetically, where possible.
# 95552	27-Apr-2002	tanimura	Add a global sx sigio_lock to protect the pointer to the sigio object of a socket. This avoids lock order reversal caused by locking a process in pgsigio(). sowakeup() and the callers of it (sowwakeup, soisconnected, etc.) now require sigio_lock to be locked. Provide sowwakeup_locked(), soisconnected_locked(), and so on in case where we have to modify a socket and wake up a process atomically.
# 95023	19-Apr-2002	suz	just merged cosmetic changes from KAME to ease sync between KAME and FreeBSD. (based on freebsd4-snap-20020128) Reviewed by: ume MFC after: 1 week
# 94390	10-Apr-2002	silby	Remove some ISN generation code which has been unused since the syncache went in. MFC after: 3 days
# 93085	24-Mar-2002	bde	Fixed some style bugs in the removal of __P(()). Continuation lines were not outdented to preserve non-KNF lining up of code with parentheses. Switch to KNF formatting.
# 92723	19-Mar-2002	alfred	Remove __P.
# 91374	27-Feb-2002	cjc	Change the wording of the inline comments from the previous commit. Objection from: ru
# 91234	25-Feb-2002	cjc	The TCP code did not do sufficient checks on whether incoming packets were destined for a broadcast IP address. All TCP packets with a broadcast destination must be ignored. The system only ignored packets that were _link-layer_ broadcasts or multicast. We need to check the IP address too since it is quite possible for a broadcast IP address to come in with a unicast link-layer address. Note that the check existed prior to CSRG revision 7.35, but was removed. This commit effectively backs out that nine-year-old change. PR: misc/35022
# 90868	18-Feb-2002	mike	o Move NTOHL() and associated macros into <sys/param.h>. These are deprecated in favor of the POSIX-defined lowercase variants. o Change all occurrences of NTOHL() and associated marcros in the source tree to use the lowercase function variants. o Add missing license bits to sparc64's <machine/endian.h>. Approved by: jake o Clean up <machine/endian.h> files. o Remove unused __uint16_swap_uint32() from i386's <machine/endian.h>. o Remove prototypes for non-existent bswapXX() functions. o Include <machine/endian.h> in <arpa/inet.h> to define the POSIX-required ntohl() family of functions. o Do similar things to expose the ntohl() family in libstand, <netinet/in.h>, and <sys/param.h>. o Prepend underscores to the ntohl() family to help deal with complexities associated with having MD (asm and inline) versions, and having to prevent exposure of these functions in other headers that happen to make use of endian-specific defines. o Create weak aliases to the canonical function name to help deal with third-party software forgetting to include an appropriate header. o Remove some now unneeded pollution from <sys/types.h>. o Add missing <arpa/inet.h> includes in userland. Tested on: alpha, i386 Reviewed by: bde, jake, tmm
# 88884	04-Jan-2002	rwatson	o Spelling fix in comment: tcp_ouput -> tcp_output
# 87778	13-Dec-2001	jlemon	Fix up tabs in comments.
# 87193	02-Dec-2001	dillon	Fix a bug with transmitter restart after receiving a 0 window. The receiver was not sending an immediate ack with delayed acks turned on when the input buffer is drained, preventing the transmitter from restarting immediately. Propogate the TCP_NODELAY option to accept()ed sockets. (Helps tbench and is a good idea anyway). Some cleanup. Identify additonal issues in comments. MFC after: 1 day
# 86764	22-Nov-2001	jlemon	Introduce a syncache, which enables FreeBSD to withstand a SYN flood DoS in an improved fashion over the existing code. Reviewed by: silby (in a previous iteration) Sponsored by: DARPA, NAI Labs
# 86744	21-Nov-2001	jlemon	Move initialization of snd_recover into tcp_sendseqinit().
# 83366	12-Sep-2001	julian	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
# 82884	03-Sep-2001	julian	Patches from Keiichi SHIMA <keiichi@iij.ad.jp> to make ip use the standard protosw structure again. Obtained from: Well, KAME I guess.
# 82529	29-Aug-2001	jayanth	when newreno is turned on, if dupacks = 1 or dupacks = 2 and new data is acknowledged, reset the dupacks to 0. The problem was spotted when a connection had its send buffer full because the congestion window was only 1 MSS and was not being incremented because dupacks was not reset to 0. Obtained from: Yahoo!
# 82238	23-Aug-2001	dd	Correct a typo in a comment: FIN_WAIT2 -> FIN_WAIT_2 PR: 29970 Submitted by: Joseph Mallett <jmallett@xMach.org>
# 82122	22-Aug-2001	silby	Much delayed but now present: RFC 1948 style sequence numbers In order to ensure security and functionality, RFC 1948 style initial sequence number generation has been implemented. Barring any major crypographic breakthroughs, this algorithm should be unbreakable. In addition, the problems with TIME_WAIT recycling which affect our currently used algorithm are not present. Reviewed by: jesper
# 79413	08-Jul-2001	silby	Temporary feature: Runtime tuneable tcp initial sequence number generation scheme. Users may now select between the currently used OpenBSD algorithm and the older random positive increment method. While the OpenBSD algorithm is more secure, it also breaks TIME_WAIT handling; this is causing trouble for an increasing number of folks. To switch between generation schemes, one sets the sysctl net.inet.tcp.tcp_seq_genscheme. 0 = random positive increments, 1 = the OpenBSD algorithm. 1 is still the default. Once a secure _and_ compatible algorithm is implemented, this sysctl will be removed. Reviewed by: jlemon Tested by: numerous subscribers of -net
# 78667	23-Jun-2001	ru	Add netstat(1) knob to reset net.inet.{ip\|icmp\|tcp\|udp\|igmp}.stats. For example, ``netstat -s -p ip -z'' will show and reset IP stats. PR: bin/17338
# 78642	23-Jun-2001	silby	Eliminate the allocation of a tcp template structure for each connection. The information contained in a tcptemp can be reconstructed from a tcpcb when needed. Previously, tcp templates required the allocation of one mbuf per connection. On large systems, this change should free up a large number of mbufs. Reviewed by: bmilekic, jlemon, ru MFC after: 2 weeks
# 78064	11-Jun-2001	ume	Sync with recent KAME. This work was based on kame-20010528-freebsd43-snap.tgz and some critical problem after the snap was out were fixed. There are many many changes since last KAME merge. TODO: - The definitions of SADB_* in sys/net/pfkeyv2.h are still different from RFC2407/IANA assignment because of binary compatibility issue. It should be fixed under 5-CURRENT. - ip6po_m member of struct ip6_pktopts is no longer used. But, it is still there because of binary compatibility issue. It should be removed under 5-CURRENT. Reviewed by: itojun Obtained from: KAME MFC after: 3 weeks
# 77830	06-Jun-2001	jesper	Silby's take one on increasing FreeBSD's resistance to SYN floods: One way we can reduce the amount of traffic we send in response to a SYN flood is to eliminate the RST we send when removing a connection from the listen queue. Since we are being flooded, we can assume that the majority of connections in the queue are bogus. Our RST is unwanted by these hosts, just as our SYN-ACK was. Genuine connection attempts will result in hosts responding to our SYN-ACK with an ACK packet. We will automatically return a RST response to their ACK when it gets to us if the connection has been dropped, so the early RST doesn't serve the genuine class of connections much. In summary, we can reduce the number of packets we send by a factor of two without any loss in functionality by ensuring that RST packets are not sent when dropping a connection from the listen queue. Submitted by: Mike Silbersack <silby@silby.com> Reviewed by: jesper MFC after: 2 weeks
# 77427	29-May-2001	jesper	Inline TCP_REASS() in the single location where it's used, just as OpenBSD and NetBSD has done. No functional difference. MFC after: 2 weeks
# 77421	29-May-2001	jesper	properly delay acks in half-closed TCP connections PR: 24962 Submitted by: Tony Finch <dot@dotat.at> MFC after: 2 weeks
# 75733	20-Apr-2001	jesper	Say goodbye to TCP_COMPAT_42 Reviewed by: wollman Requested by: wollman
# 75619	17-Apr-2001	kris	Randomize the TCP initial sequence numbers more thoroughly. Obtained from: OpenBSD Reviewed by: jesper, peter, -developers
# 74494	19-Mar-2001	des	Axe TCP_RESTRICT_RST. It was never a particularly good idea except for a few very specific scenarios, and now that we have had net.inet.tcp.blackhole for quite some time there is really no reason to use it any more. (last of three commits)
# 73031	25-Feb-2001	jlemon	Do not delay a new ack if there already is a delayed ack pending on the connection, but send it immediately. Prior to this change, it was possible to delay a delayed-ack for multiple times, resulting in degraded TCP behavior in certain corner cases.
# 72357	11-Feb-2001	bmilekic	Clean up RST ratelimiting. Previously, ratelimiting occured before tests were performed to determine if the received packet should be reset. This created erroneous ratelimiting and false alarms in some cases. The code has now been reorganized so that the checks for validity come before the call to badport_bandlim. Additionally, a few changes in the symbolic names of the bandlim types have been made, as well as a clarification of exactly which type each RST case falls under. Submitted by: Mike Silbersack <silby@silby.com>
# 71594	24-Jan-2001	wollman	Correct a comment.
# 70070	15-Dec-2000	bmilekic	Change the following: 1. ICMP ECHO and TSTAMP replies are now rate limited. 2. RSTs generated due to packets sent to open and unopen ports are now limited by seperate counters. 3. Each rate limiting queue now has its own description, as follows: Limiting icmp unreach response from 439 to 200 packets per second Limiting closed port RST response from 283 to 200 packets per second Limiting open port RST response from 18724 to 200 packets per second Limiting icmp ping response from 211 to 200 packets per second Limiting icmp tstamp response from 394 to 200 packets per second Submitted by: Mike Silbersack <silby@silby.com>
# 69781	08-Dec-2000	dwmalone	Convert more malloc+bzero to malloc+M_ZERO. Submitted by: josh@zipperup.org Submitted by: Robert Drehmel <robd@gmx.net>
# 68318	04-Nov-2000	jlemon	tp->snd_recover is part of the New Reno recovery algorithm, and should only be checked if the system is currently performing New Reno style fast recovery. However, this value was being checked regardless of the NR state, with the end result being that the congestion window was never opened. Change the logic to check t_dupack instead; the only code path that allows it to be nonzero at this point is NewReno, so if it is nonzero, we are in fast recovery mode and should not touch the congestion window. Tested by: phk
# 63745	21-Jul-2000	jayanth	When a connection is being dropped due to a listen queue overflow, delete the cloned route that is associated with the connection. This does not exhaust the routing table memory when the system is under a SYN flood attack. The route entry is not deleted if there is any prior information cached in it. Reviewed by: Peter Wemm,asmodai
# 62846	09-Jul-2000	itojun	be more cautious about tcp option length field. drop bogus ones earlier. not sure if there is a real threat or not, but it seems that there's possibility for overrun/underrun (like non-NOP option with optlen > cnt).
# 62587	04-Jul-2000	itojun	sync with kame tree as of july00. tons of bug fixes/improvements. API changes: - additional IPv6 ioctls - IPsec PF_KEY API was changed, it is mandatory to upgrade setkey(8). (also syntax change)
# 60798	22-May-2000	dan	sysctl'ize ICMP_BANDLIM and ICMP_BANDLIM_SUPPRESS_OUTPUT. Suggested by: des/nbm
# 60687	18-May-2000	jayanth	snd_cwnd was updated twice in the tcp_newreno function.
# 60662	17-May-2000	jayanth	Sigh, fix a rookie patch merge error. Also-missed-by: peter
# 60619	16-May-2000	jayanth	snd_una was being updated incorrectly, this resulted in the newreno code retransmitting data from the wrong offset. As a footnote, the newreno code was partially derived from NetBSD and Tom Henderson <tomh@cs.berkeley.edu>
# 60067	06-May-2000	jlemon	Implement TCP NewReno, as documented in RFC 2582. This allows better recovery for multiple packet losses in a single window. The algorithm can be toggled via the sysctl net.inet.tcp.newreno, which defaults to "on". Submitted by: Jayanth Vijayaraghavan <jayanth@yahoo-inc.com>
# 59334	17-Apr-2000	sumikawa	ND6_HINT() should not be called unless the connection status is ESTABLISHED. Obtained from: KAME Project
# 58907	01-Apr-2000	shin	Support per socket based IPv4 mapped IPv6 addr enable/disable control. Submitted by: ume
# 58698	27-Mar-2000	jlemon	Add support for offloading IP/TCP/UDP checksums to NIC hardware which supports them.
# 57903	11-Mar-2000	shin	IPv6 6to4 support. Now most big problem of IPv6 is getting IPv6 address assignment. 6to4 solve the problem. 6to4 addr is defined like below, 2002: 4byte v4 addr : 2byte SLA ID : 8byte interface ID The most important point of the address format is that an IPv4 addr is embeded in it. So any user who has IPv4 addr can get IPv6 address block with 2byte subnet space. Also, the IPv4 addr is used for semi-automatic IPv6 over IPv4 tunneling. With 6to4, getting IPv6 addr become dramatically easy. The attached patch enable 6to4 extension, and confirmed to work, between "Richard Seaman, Jr." <dick@tar.com> and me. Approved by: jkh Reviewed by: itojun
# 56724	28-Jan-2000	imp	Mitigate the stream.c attacks o Drop all broadcast and multicast source addresses in tcp_input. o Enable ICMP_BANDLIM in GENERIC. o Change default to 200/s from 100/s. This will still stop the attack, but is conservative enough to do this close to code freeze. This is not the optimal patch for the problem, but is likely the least intrusive patch that can be made for this. Obtained from: Don Lewis and Matt Dillon. Reviewed by: freebsd-security
# 56565	25-Jan-2000	shin	Avoid m_len and m_pkthdr.len inconsistency when changing m_len for an mbuf whose M_PKTHDR is set. PR: related to kern/15175 Reviewed by: archie
# 56041	15-Jan-2000	shin	Fixed the problem that IPsec connection hangs when bigger data is sent. -opt_ipsec.h was missing on some tcp files (sorry for basic mistake) -made buildable as above fix -also added some missing IPv4 mapped IPv6 addr consideration into ipsec4_getpolicybysock
# 55875	13-Jan-2000	shin	add a comment for some possible? IPv4 option processing.
# 55679	09-Jan-2000	shin	tcp updates to support IPv6. also a small patch to sys/nfs/nfs_socket.c, as max_hdr size change. Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project
# 55009	22-Dec-1999	shin	IPSEC support in the kernel. pr_input() routines prototype is also changed to support IPSEC and IPV6 chained protocol headers. Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project
# 54601	14-Dec-1999	jlemon	Use SEQ_* macros for comparing sequence space numbers. Reviewed by: truckman
# 54421	11-Dec-1999	jlemon	According to RFC 793, a reset should be honored if the sequence number is within the receive window. Follow this behavior, instead of only allowing resets at last_ack_sent. Pointed out by: jayanth@yahoo-inc.com
# 54263	07-Dec-1999	shin	udp IPv6 support, IPv6/IPv4 tunneling support in kernel, packet divert at kernel for IPv6/IPv4 translater daemon This includes queue related patch submitted by jburkhol@home.com. Submitted by: queue related patch from jburkhol@home.com Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project
# 52070	09-Oct-1999	green	Implement RLIMIT_SBSIZE in the kernel. This is a per-uid sockbuf total usage limit.
# 51279	14-Sep-1999	des	Fix some more disordering, as well as the description string for the net.inet.tcp.drop_synfin sysctl, which for some mysterious reason said "Drop TCP packets with FIN+ACK set" (instead of "...with SYN+FIN set")
# 51209	12-Sep-1999	des	Add the net.inet.tcp.restrict_rst and net.inet.tcp.drop_synfin sysctl variables, conditional on the TCP_RESTRICT_RST and TCP_DROP_SYNFIN kernel options, respectively. See the comments in LINT for details.
# 50673	30-Aug-1999	jlemon	Restructure TCP timeout handling: - eliminate the fast/slow timeout lists for TCP and instead use a callout entry for each timer. - increase the TCP timer granularity to HZ - implement "bad retransmit" recovery, as presented in "On Estimating End-to-End Network Path Properties", by Allman and Paxson. Submitted by: jlemon, wollmann
# 50596	29-Aug-1999	obrien	Remove extra indenting of `break' statements introducted in rev 1.89, plus wrap some long lines from that revision. While here, wrap some other long lines.
# 50477	28-Aug-1999	peter	$Id$ -> $FreeBSD$
# 50043	19-Aug-1999	csgr	Fix breakage if blackhole=1 and tiflags & TH_SYN, plus style(9) fixes Submitted by: Jonathon Lemon
# 50015	18-Aug-1999	csgr	Slight tweak to tcp.blackhole to add optional behaviour to drop any segment arriving at a closed port. tcp.blackhole=1 - only drop SYN without RST tcp.blackhole=2 - drop everything without RST tcp.blackhole=0 - always send RST - default behaviour This confuses nmap -sF or -sX or -sN quite badly.
# 49968	17-Aug-1999	csgr	Add net.inet.tcp.blackhole and net.inet.udp.blackhole sysctl knobs. With these knobs on, refused connection attempts are dropped without sending a RST, or Port unreachable in the UDP case. In the TCP case, sending of RST is inhibited iff the incoming segment was a SYN. Docs and rc.conf settings to follow.
# 48886	18-Jul-1999	jmb	fix comment re: RST received in TIME_WAIT to match the code.
# 46568	06-May-1999	peter	Add sufficient braces to keep egcs happy about potentially ambiguous if/else nesting.
# 46381	03-May-1999	billf	Add sysctl descriptions to many SYSCTL_XXXs PR: kern/11197 Submitted by: Adrian Chadd <adrian@FreeBSD.org> Reviewed by: billf(spelling/style/minor nits) Looked at by: bde(style)
# 43691	06-Feb-1999	fenner	Use snd_nxt, not rcv_nxt, when calculating the ISS during TIME_WAIT. This was missed in the 4.4-Lite2 merge. Noticed by: Mohan Parthasarathy <Mohan.Parthasarathy@eng.Sun.COM> and jayanth@loc201.tandem.com (vijayaraghavan_jayanth) on the tcp-impl mailing list.
# 43305	27-Jan-1999	dillon	Fix warnings in preparation for adding -Wall -Wcast-qual to the kernel compile
# 41487	03-Dec-1998	dillon	Reviewed by: freebsd-current Add ICMP_BANDLIM option and 'net.inet.icmp.icmplim' sysctl. If option is specified in kernel config, icmplim defaults to 100 pps. Setting it to 0 will disable the feature. This feature limits ICMP error responses for packets sent to bad tcp or udp ports, which does a lot to help the machine handle network D.O.S. attacks. The kernel will report packet rates that exceed the limit at a rate of one kernel printf per second. There is one issue in regards to the 'tail end' of an attack... the kernel will not output the last report until some unrelated and valid icmp error packet is return at some point after the attack is over. This is a minor reporting issue only.
# 39078	11-Sep-1998	wollman	Fix RST validation. PR: 7892 Submitted by: Don.Lewis@tsc.tdk.com
# 38513	24-Aug-1998	dfr	Re-implement tcp and ip fragment reassembly to not store pointers in the ip header which can't work on alpha since pointers are too big. Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>
# 37409	06-Jul-1998	julian	Support for IPFW based transparent forwarding. Any packet that can be matched by a ipfw rule can be redirected transparently to another port or machine. Redirection to another port mostly makes sense with tcp, where a session can be set up between a proxy and an unsuspecting client. Redirection to another machine requires that the other machine also be expecting to receive the forwarded packets, as their headers will not have been modified. /sbin/ipfw must be recompiled!!! Reviewed by: Peter Wemm <peter@freebsd.org> Submitted by: Chrisy Luke <chrisy@flix.net>
# 36529	31-May-1998	peter	Let the sowwakeup macro decide when to call sowakeup rather than have tcp "know" about it. A pending upcall would be missed, eg: used by NFS. Obtained from: NetBSD
# 36161	18-May-1998	guido	Grumble...It seems I'm suffering from some mental disease. Do it correct now.
# 36159	18-May-1998	guido	Add some parenthesis for clarity and fix a bug Pointed out by: Garrett Wollmand
# 35698	04-May-1998	guido	Refuse accellerated opens on listening sockets that have not set the TCP_NOPUSH socket option. This disables TAO for those services that do not know about T/TCP. Reviewed by: Garrett Wollman Submitted by: Peter Wemm
# 35421	24-Apr-1998	dg	At the request of Garrett, changed sysctl: net.inet.tcp.delack_enabled -> net.inet.tcp.delayed_ack
# 35256	17-Apr-1998	des	Seventy-odd "its" / "it's" typos in comments fixed as per kern/6108.
# 35056	06-Apr-1998	phk	Remove the last traces of TUBA. Inspired by: PR kern/3317
# 34697	20-Mar-1998	fenner	Remove the check for SYN in SYN_RECEIVED state; it breaks simultaneous connect. This check was added as part of the defense against the "land" attack, to prevent attacks which guess the ISS from going into ESTABLISHED. The "src == dst" check will still prevent the single-homed case of the "land" attack, and guessing ISS's should be hard anyway. Submitted by: David Borman <dab@bsdi.com>
# 33846	26-Feb-1998	dg	Changes to support the addition of a new sysctl variable: net.inet.tcp.delack_enabled Which defaults to 1 and can be set to 0 to disable TCP delayed-ack processing (i.e. all acks are immediate).
# 32821	27-Jan-1998	dg	Improved connection establishment performance by doing local port lookups via a hashed port list. In the new scheme, in_pcblookup() goes away and is replaced by a new routine, in_pcblookup_local() for doing the local port check. Note that this implementation is space inefficient in that the PCB struct is now too large to fit into 128 bytes. I might deal with this in the future by using the new zone allocator, but I wanted these changes to be extensively tested in their current form first. Also: 1) Fixed off-by-one errors in the port lookup loops in in_pcbbind(). 2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash() to do the initialial hash insertion. 3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability. 4) Added a new routine, in_pcbremlists() to remove the PCB from the various hash lists. 5) Added/deleted comments where appropriate. 6) Removed unnecessary splnet() locking. In general, the PCB functions should be called at splnet()...there are unfortunately a few exceptions, however. 7) Reorganized a few structs for better cache line behavior. 8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in the future, however. These changes have been tested on wcarchive for more than a month. In tests done here, connection establishment overhead is reduced by more than 50 times, thus getting rid of one of the major networking scalability problems. Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult. WARNING: Anything that knows about inpcb and tcpcb structs will have to be recompiled; at the very least, this includes netstat(1).
# 32662	21-Jan-1998	fenner	A more complete fix for the "land" attack, removing the "quick fix" from rev 1.66. This fix contains both belt and suspenders. Belt: ignore packets where src == dst and srcport == dstport in TCPS_LISTEN. These packets can only legitimately occur when connecting a socket to itself, which doesn't go through TCPS_LISTEN (it goes CLOSED->SYN_SENT->SYN_RCVD-> ESTABLISHED). This prevents the "standard" "land" attack, although doesn't prevent the multi-homed variation. Suspenders: send a RST in response to a SYN/ACK in SYN_RECEIVED state. The only packets we should get in SYN_RECEIVED are 1. A retransmitted SYN, or 2. An ack of our SYN/ACK. The "land" attack depends on us accepting our own SYN/ACK as an ACK; in SYN_RECEIVED state; this should prevent all "land" attacks. We also move up the sequence number check for the ACK in SYN_RECEIVED. This neither helps nor hurts with respect to the "land" attack, but puts more of the validation checking in one spot. PR: kern/5103
# 31882	19-Dec-1997	bde	Don't use ANSI string concatenation to misformat a string.
# 31323	20-Nov-1997	wollman	Add Matt Dillon's quick fix hack for the self-connect DoS. PR: 5103
# 31016	07-Nov-1997	phk	Remove a bunch of variables which were unused both in GENERIC and LINT. Found by: -Wunused
# 30813	28-Oct-1997	bde	Removed unused #includes.
# 30052	02-Oct-1997	dg	Killed the SYN_RECEIVED addition from rev 1.52. It results in legitimate RST's being ignored, keeping a connection around until it times out, and thus has the opposite effect of what was intended (which is to make the system more robust to DoS attacks).
# 30005	30-Sep-1997	fenner	Don't consider a SYN/ACK with CC but no CCECHO a proper T/TCP handshake. Reviewed by: Rich Stevens <rstevens@kohala.com>
# 29514	16-Sep-1997	joerg	Make TCPDEBUG a new-style option.
# 28270	16-Aug-1997	wollman	Fix all areas of the system (or at least all those in LINT) to avoid storing socket addresses in mbufs. (Socket buffers are the one exception.) A number of kernel APIs needed to get fixed in order to make this happen. Also, fix three protocol families which kept PCBs in mbufs to not malloc them instead. Delete some old compatibility cruft while we're at it, and add some new routines in the in_cksum family.
# 27135	01-Jul-1997	jdp	Fix a bug (apparently very old) that can cause a TCP connection to be dropped when it has an unusual traffic pattern. For full details as well as a test case that demonstrates the failure, see the referenced PR. Under certain circumstances involving the persist state, it is possible for the receive side's tp->rcv_nxt to advance beyond its tp->rcv_adv. This causes (tp->rcv_adv - tp->rcv_nxt) to become negative. However, in the code affected by this fix, that difference was interpreted as an unsigned number by max(). Since it was negative, it was taken as a huge unsigned number. The effect was to cause the receiver to believe that its receive window had negative size, thereby rejecting all received segments including ACKs. As the test case shows, this led to fruitless retransmissions and eventually to a dropped connection. Even connections using the loopback interface could be dropped. The fix substitutes the signed imax() for the unsigned max() function. PR: closes kern/3998 Reviewed by: davidg, fenner, wollman
# 25201	27-Apr-1997	wollman	The long-awaited mega-massive-network-code- cleanup. Part I. This commit includes the following changes: 1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility glue for them is deleted, and the kernel will panic on boot if any are compiled in. 2) Certain protocol entry points are modified to take a process structure, so they they can easily tell whether or not it is possible to sleep, and also to access credentials. 3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt() call. Protocols should use the process pointer they are now passed. 4) The PF_LOCAL and PF_ROUTE families have been updated to use the new style, as has the `raw' skeleton family. 5) PF_LOCAL sockets now obey the process's umask when creating a socket in the filesystem. As a result, LINT is now broken. I'm hoping that some enterprising hacker with a bit more time will either make the broken bits work (should be easy for netipx) or dike them out.
# 22975	22-Feb-1997	peter	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# 21673	14-Jan-1997	jkh	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# 19597	10-Nov-1996	fenner	Re-enable the TCP SYN-attack protection code. I was the one who didn't understand the socket state flag. 2.2 candidate.
# 18874	11-Oct-1996	pst	Fix two bugs I accidently put into the syn code at the last minute (yes I had tested the hell out of this). I've also temporarily disabled the code so that it behaves as it previously did (tail drop's the syns) pending discussion with fenner about some socket state flags that I don't fully understand. Submitted by: fenner
# 18795	07-Oct-1996	dg	Improved in_pcblookuphash() to support wildcarding, and changed relavent callers of it to take advantage of this. This reduces new connection request overhead in the face of a large number of PCBs in the system. Thanks to David Filo <filo@yahoo.com> for suggesting this and providing a sample implementation (which wasn't used, but showed that it could be done). Reviewed by: wollman
# 18787	07-Oct-1996	pst	Increase robustness of FreeBSD against high-rate connection attempt denial of service attacks. Reviewed by: bde,wollman,olah Inspired by: vjs@sgi.com
# 18437	21-Sep-1996	pst	I don't understand, I committed this fix (move a counter and fixed a typo) this evening. I think I'm going insane.
# 18436	21-Sep-1996	ache	Syntax error: so_incom -> so_incomp
# 18431	20-Sep-1996	pst	If the incomplete listen queue for a given socket is full, drop the oldest entry in the queue. There was a fair bit of discussion as to whether or not the proper action is to drop a random entry in the queue. It's my conclusion that a random drop is better than a head drop, however profiling this section of code (done by John Capo) shows that a head-drop results in a significant performance increase. There are scenarios where a random drop is more appropriate. If I find one in reality, I'll add the random drop code under a conditional. Obtained from: discussions and code done by Vernon Schryver (vjs@sgi.com).
# 18280	13-Sep-1996	pst	Make the misnamed tcp initial keepalive timer value (which is really the time, in seconds, that state for non-established TCP sessions stays about) a sysctl modifyable variable. [part 1 of two commits, I just realized I can't play with the indices as I was typing this commit message.]
# 18278	13-Sep-1996	pst	Receipt of two SYN's are sufficient to set the t_timer[TCPT_KEEP] to "keepidle". this should not occur unless the connection has been established via the 3-way handshake which requires an ACK Submitted by: jmb Obtained from: problem discussed in Stevens vol. 3
# 15525	02-May-1996	fenner	Back out my stupid braino; I was thinking strlen and not sizeof.
# 15524	02-May-1996	fenner	Size temp var correctly; buf[4*sizeof "123"] is not long enough to store "192.252.119.189\0".
# 15414	27-Apr-1996	ache	inet_ntoa buffer was evaluated twice in log_in_vain, fix it. Thanx to: jdp
# 15396	26-Apr-1996	wollman	Delete #ifdef notdef blocks containing old method of srtt calculation. Requested by: davidg
# 15154	09-Apr-1996	pst	Logging UDP and TCP connection attempts should not be enabled by default. It's trivial to create a denial of service attack on a box so enabled. These messages, if enabled at all, must be rate-limited. (!)
# 15038	04-Apr-1996	phk	Log TCP syn packets for ports we don't listen on. Controlled by: sysctl net.inet.tcp.log_in_vain: 1 Log UDP syn packets for ports we don't listen on. Controlled by: sysctl net.inet.udp.log_in_vain: 1 Suggested by: Warren Toomey <wkt@cs.adfa.oz.au>
# 14819	25-Mar-1996	wollman	Slight modification of RTO floor calculation.
# 14753	22-Mar-1996	wollman	A number of performance-reducing flaws fixed based on comments from Larry Peterson &co. at Arizona: - Header prediction for ACKs did not exclude Fast Retransmit/Recovery. - srtt calculation tended to get ``stuck'' and could never decrease when below 8. It still can't, but the scaling factors are adjusted so that this artifact does not cause as bad an effect on the RTO value as it used to. The paper also points out the incr/8 error that has been long since fixed, and the problems with ACKing frequency resulting from the use of options which I suspect to be fixed already as well (as part of the T/TCP work). Obtained from: Brakmo & Peterson, ``Performance Problems in BSD4.4 TCP''
# 14546	11-Mar-1996	dg	Move or add #include <queue.h> in preparation for upcoming struct socket changes.
# 14268	26-Feb-1996	guido	Add a counter for the number of times the listen queue was overflowed to the tcpstat structure. (netstat -s) Reviewed by: wollman Obtained from: Steves, TCP/IP Ill. vol.3, page 189
# 14181	22-Feb-1996	dg	Fixed bug in Path MTU Discovery that caused the system to have to re- discover the Path MTU for each connection if the connecting host didn't offer an initial MSS. Submitted by: davidg & olah
# 13779	31-Jan-1996	olah	Fix a bug related to the interworking of T/TCP and window scaling: when a connection enters the ESTBLS state using T/TCP, then window scaling wasn't properly handled. The fix is twofold. 1) When the 3WHS completes, make sure that we update our window scaling state variables. 2) When setting the `virtual advertized window', then make sure that we do not try to offer a window that is larger than the maximum window without scaling (TCP_MAXWIN). Reviewed by: davidg Reported by: Jerry Chen <chen@Ipsilon.COM>
# 12820	14-Dec-1995	phk	Another mega commit to staticize things.
# 12296	14-Nov-1995	phk	New style sysctl & staticize alot of stuff.
# 12172	09-Nov-1995	phk	Start adding new style sysctl here too.
# 12047	03-Nov-1995	olah	Cosmetic changes to processing of segments in the SYN_SENT state: - remove a redundant condition; - complete all validity checks on segment before calling soisconnected(so). Reviewed by: Richard Stevens, davidg, wollman
# 11458	13-Oct-1995	wollman	Routes can be asymmetric. Always offer to /accept/ an MSS of up to the capacity of the link, even if the route's MTU indicates that we cannot send that much in their direction. (This might actually make it possible to test Path MTU discovery in a useful variety of cases.)
# 11150	03-Oct-1995	wollman	Finish 4.4-Lite-2 merge: randomize TCP initial sequence numbers to make ISS-guessing spoofing attacks harder.
# 9818	31-Jul-1995	olah	Remove a redundant `if' from tcp_reass(). Correct a typo in a comment (SEND_SYN -> NEEDSYN). Reviewed by: David Greenman
# 9470	10-Jul-1995	wollman	tcp_input.c - keep track of how many times a route contained a cached rtt or ssthresh that we were able to use tcp_var.h - declare tcpstat entries for above; declare tcp_{send,recv}space in_rmx.c - fill in the MTU and pipe sizes with the defaults TCP would have used anyway in the absence of values here
# 9373	29-Jun-1995	wollman	Keep track of the number of samples through the srtt filter so that we know better when to cache values in the route, rather than relying on a heuristic involving sequence numbers that broke when tcp_sendspace was increased to 16k.
# 8876	30-May-1995	rgrimes	Remove trailing whitespace.
# 8429	11-May-1995	dg	#ifdef'd my Nagel/ACK hack with "TCP_ACK_HACK", disabled by default. I'm currently considering reducing the TCP fasttimo to 100ms to help improve things, but this would be done as a seperate step at some point in the future. This was done because it was causing some sometimes serious performance problems with T/TCP.
# 8377	09-May-1995	olah	Fix a misspelled constant in tcp_input.c. On Tue, 09 May 1995 04:35:27 PDT, Richard Stevens wrote: > In tcp_dooptions() under the case TCPOPT_CC there is an assignment > > to->to_flag \|= TCPOPT_CC; > > that should be > > to->to_flag \|= TOF_CC; > > I haven't thought through the ramifications of what's been happening ... > > Rich Stevens Submitted by: rstevens@noao.edu (Richard Stevens)
# 8235	03-May-1995	dg	Changed in_pcblookuphash() to not automatically call in_pcblookup() if the lookup fails. Updated callers to deal with this. Call in_pcblookuphash instead of in_pcblookup() in in_pcbconnect; this improves performance of UDP output by about 17% in the standard case.
# 7738	10-Apr-1995	dg	Further satisfy my paranoia by making sure that the ACKNOW is only set when ti_len is non-zero.
# 7737	10-Apr-1995	dg	Fixed bug I introduced with my Nagel hack which caused tcp_input and tcp_output to loop endlessly. This was freefall's problem during the past day.
# 7684	09-Apr-1995	dg	Implemented PCB hashing. Includes new functions in_pcbinshash, in_pcbrehash, and in_pcblookuphash.
# 7634	05-Apr-1995	olah	Fix a bug in tcp_input reported by Rick Jones <raj@hpisrdq.cup.hp.com>. If a goto findpcb occurred during the processing of a segment, the TCP and IP headers were dropped twice from the mbuf which resulted in data acked by TCP but not delivered to the user. Reviewed by: davidg
# 7417	27-Mar-1995	dg	Re-apply my "breakage" to the Nagel congestion avoidence. This version differs slightly in the logic from the previous version; packets are now acked immediately if the sender set PUSH.
# 7090	16-Mar-1995	bde	Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) and most of the warnings from `gcc -Wnested-externs'. Fix all the bugs found. There were no serious ones.
# 6480	16-Feb-1995	wollman	Avoid deadlock situation described by Stevens using his suggested replacement code. Obtained from: Stevens, vol. 2, pp. 959-960
# 6475	16-Feb-1995	wollman	Transaction TCP support now standard. Hack away!
# 6363	14-Feb-1995	phk	YFfix.
# 6348	14-Feb-1995	wollman	Get rid of some unneeded #ifdef TTCP lines. Also, get rid of some bogus commons declared in header files.
# 6283	09-Feb-1995	wollman	Merge Transaction TCP, courtesy of Andras Olah <olah@cs.utwente.nl> and Bob Braden <braden@isi.edu>. NB: This has not had David's TCP ACK hack re-integrated. It is not clear what the correct solution to this problem is, if any. If a better solution doesn't pop up in response to this message, I'll put David's code back in (or he's welcome to do so himself).
# 3561	13-Oct-1994	wollman	As suggested by Sally Floyd, don't add the ``small fraction of the window size'' when doing congestion avoidance. Submitted by: Mark Andrews
# 3311	02-Oct-1994	phk	GCC cleanup. Reviewed by: Submitted by: Obtained from:
# 2788	15-Sep-1994	dg	Made TCPDEBUG truely optional. Based on changes I made in FreeBSD 1.1.5. Fixed somebody's idea of a joke - about the first half of the lines in in_proto.c were spaced over by one space.
# 2304	26-Aug-1994	wollman	Obey RFC 793, section 3.4: Several examples of connection initiation follow. Although these examples do not show connection synchronization using data-carrying segments, this is perfectly legitimate, so long as the receiving TCP doesn't deliver the data to the user until it is clear the data is valid (i.e., the data must be buffered at the receiver until the connection reaches the ESTABLISHED state).
# 2112	18-Aug-1994	wollman	Fix up some sloppy coding practices: - Delete redundant declarations. - Add -Wredundant-declarations to Makefile.i386 so they don't come back. - Delete sloppy COMMON-style declarations of uninitialized data in header files. - Add a few prototypes. - Clean up warnings resulting from the above. NB: ioconf.c will still generate a redundant-declaration warning, which is unavoidable unless somebody volunteers to make `config' smarter.
# 1817	02-Aug-1994	dg	Added $Id$
# 1812	01-Aug-1994	dg	Fixed bug with Nagel Congestion Avoidance where a tcp connection would stall unnecessarily - always send an ACK when a packet len of < mss is received.
# 1565	26-May-1994	dg	Added missing ntohl()'s that are needed before calling IN_MULTICAST in a couple of places. Submitted by: Johannes Helander
# 1549	25-May-1994	rgrimes	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
# 1541	24-May-1994	rgrimes	BSD 4.4 Lite Kernel Sources