History log of /freebsd-10.3-release/sys/netinet/tcp_reass.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 296373 04-Mar-2016 marius

- Copy stable/10@296371 to releng/10.3 in preparation for 10.3-RC1
builds.
- Update newvers.sh to reflect RC1.
- Update __FreeBSD_version to reflect 10.3.
- Update default pkg(8) configuration to use the quarterly branch.

Approved by: re (implicit)

# 285976 28-Jul-2015 delphij

Fix patch(1) shell injection vulnerability. [SA-15:14]

Fix resource exhaustion in TCP reassembly. [SA-15:15]

Fix OpenSSH multiple vulnerabilities. [SA-15:16]


# 273736 27-Oct-2014 hselasky

MFC r263710, r273377, r273378, r273423 and r273455:

- De-vnet hash sizes and hash masks.
- Fix multiple issues related to arguments passed to SYSCTL macros.

Sponsored by: Mellanox Technologies


# 265122 30-Apr-2014 delphij

Fix devfs rules not applied by default for jails.

Fix OpenSSL use-after-free vulnerability.

Fix TCP reassembly vulnerability.

Security: FreeBSD-SA-14:07.devfs
Security: CVE-2014-3001
Security: FreeBSD-SA-14:08.tcp
Security: CVE-2014-3000
Security: FreeBSD-SA-14:09.openssl
Security: CVE-2010-5298


# 256281 10-Oct-2013 gjb

Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


# 246208 01-Feb-2013 andre

uma_zone_set_max() directly returns the rounded effective zone
limit. Use the return value directly instead of doing a second
uma_zone_set_max() step.

MFC after: 1 week


# 244680 25-Dec-2012 glebius

Fix sysctl_handle_int() usage. Either arg1 or arg2 should be supplied,
and arg2 doesn't pass size of arg1.


# 242253 28-Oct-2012 andre

Simplify implementation of net.inet.tcp.reass.maxsegments and
net.inet.tcp.reass.cursegments.

MFC after: 2 weeks


# 228016 27-Nov-2011 lstewart

Plug a TCP reassembly UMA zone leak introduced in r226113 by only using the
backup stack queue entry when the zone is exhausted, otherwise we leak a zone
allocation each time we plug a hole in the reassembly queue.

Reported by: many on freebsd-stable@ (thread: "TCP Reassembly Issues")
Tested by: many on freebsd-stable@ (thread: "TCP Reassembly Issues")
Reviewed by: bz (very brief sanity check)
MFC after: 3 days


# 227309 07-Nov-2011 ed

Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.

The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.


# 226113 07-Oct-2011 andre

Prevent TCP sessions from stalling indefinitely in reassembly
when reaching the zone limit of reassembly queue entries.

When the zone limit was reached not even the missing segment
that would complete the sequence space could be processed
preventing the TCP session forever from making any further
progress.

Solve this deadlock by using a temporary on-stack queue entry
for the missing segment followed by an immediate dequeue again
by delivering the contiguous sequence space to the socket.

Add logging under net.inet.tcp.log_debug for reassembly queue
issues.

Reviewed by: lsteward (previous version)
Tested by: Steven Hartland <killing-at-multiplay.co.uk>
MFC after: 3 days


# 217554 18-Jan-2011 mdf

Specify a CTLTYPE_FOO so that a future sysctl(8) change does not need
to rely on the format string. For SYSCTL_PROC instances that I
noticed a discrepancy between the CTLTYPE and the format specifier,
fix the CTLTYPE.


# 217126 07-Jan-2011 jhb

Trim extra spaces before tabs.


# 215701 22-Nov-2010 dim

After some off-list discussion, revert a number of changes to the
DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various
people working on the affected files. A better long-term solution is
still being considered. This reversal may give some modules empty
set_pcpu or set_vnet sections, but these are harmless.

Changes reverted:

------------------------------------------------------------------------
r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines

Instead of unconditionally emitting .globl's for the __start_set_xxx and
__stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu
sections are actually defined.

------------------------------------------------------------------------
r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.

------------------------------------------------------------------------
r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines

Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.


# 215434 17-Nov-2010 gnn

Add new, per connection, statistics for TCP, including:
Retransmitted Packets
Zero Window Advertisements
Out of Order Receives

These statistics are available via the -T argument to
netstat(1).
MFC after: 2 weeks


# 215317 14-Nov-2010 dim

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.


# 213913 16-Oct-2010 lstewart

Retire the system-wide, per-reassembly queue segment limit. The mechanism is far
too coarse grained to be useful and the default value significantly degrades TCP
performance on moderate to high bandwidth-delay product paths with non-zero loss
(e.g. 5+Mbps connections across the public Internet often suffer).

Replace the outgoing mechanism with an individual per-queue limit based on the
number of MSS segments that fit into the socket's receive buffer. This should
strike a good balance between performance and the potential for resource
exhaustion when FreeBSD is acting as a TCP receiver. With socket buffer
autotuning (which is enabled by default), the reassembly queue tracks the
socket buffer and benefits too.

As the XXX comment suggests, my testing uncovered some unexpected behaviour
which requires further investigation. By using so->so_rcv.sb_hiwat
instead of sbspace(&so->so_rcv), we allow more segments to be held across both
the socket receive buffer and reassembly queue than we probably should. The
tradeoff is better performance in at least one common scenario, versus a devious
sender's ability to consume more resources on a FreeBSD receiver.

Sponsored by: FreeBSD Foundation
Reviewed by: andre, gnn, rpaulo
MFC after: 2 weeks


# 213912 16-Oct-2010 lstewart

- Switch the "net.inet.tcp.reass.cursegments" and
"net.inet.tcp.reass.maxsegments" sysctl variables to be based on UMA zone
stats. The value returned by the cursegments sysctl is approximate owing to
the way in which uma_zone_get_cur is implemented.

- Discontinue use of V_tcp_reass_qsize as a global reassembly segment count
variable in the reassembly implementation. The variable was used without
proper synchronisation and was duplicating accounting done by UMA already. The
lack of synchronisation was particularly problematic on SMP systems
terminating many TCP sessions, resulting in poor TCP performance for
connections with non-zero packet loss.

Sponsored by: FreeBSD Foundation
Reviewed by: andre, gnn, rpaulo (as part of a larger patch)
MFC after: 2 weeks


# 213158 25-Sep-2010 lstewart

Internalise reassembly queue related functionality and variables which should
not be used outside of the reassembly queue implementation. Provide a new
function to flush all segments from a reassembly queue and call it from the
appropriate places instead of manipulating the queue directly.

Sponsored by: FreeBSD Foundation
Reviewed by: andre, gnn, rpaulo
MFC after: 2 weeks


# 207369 29-Apr-2010 bz

MFP4: @176978-176982, 176984, 176990-176994, 177441

"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by: jhb
Discussed with: rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
MFC after: 6 days


# 204838 07-Mar-2010 bz

Destroy TCP UMA zones (empty or not) upon network stack teardown
to not leak them, otherwise making UMA/vmstat unhappy with every stoped vnet.
We will still leak pages (especially for zones marked NOFREE).

Reshuffle cleanup order in tcp_destroy() to get rid of what we can
easily free first.

Sponsored by: ISPsystem
Reviewed by: rwatson
MFC after: 5 days


# 196019 01-Aug-2009 rwatson

Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by: bz
Approved by: re (vimage blanket)


# 195727 16-Jul-2009 rwatson

Remove unused VNET_SET() and related macros; only VNET_GET() is
ever actually used. Rename VNET_GET() to VNET() to shorten
variable references.

Discussed with: bz, julian
Reviewed by: bz
Approved by: re (kensmith, kib)


# 195699 14-Jul-2009 rwatson

Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)


# 192761 25-May-2009 rwatson

Remove comment about moving tcp_reass() to its own file named tcp_reass.c,
that happened a while ago.

MFC after: 3 days


# 190948 11-Apr-2009 rwatson

Update stats in struct tcpstat using two new macros, TCPSTAT_ADD() and
TCPSTAT_INC(), rather than directly manipulating the fields across the
kernel. This will make it easier to change the implementation of
these statistics, such as using per-CPU versions of the data structures.

MFC after: 3 days


# 190787 06-Apr-2009 zec

First pass at separating per-vnet initializer functions
from existing functions for initializing global state.

At this stage, the new per-vnet initializer functions are
directly called from the existing global initialization code,
which should in most cases result in compiler inlining those
new functions, hence yielding a near-zero functional change.

Modify the existing initializer functions which are invoked via
protosw, like ip_init() et. al., to allow them to be invoked
multiple times, i.e. per each vnet. Global state, if any,
is initialized only if such functions are called within the
context of vnet0, which will be determined via the
IS_DEFAULT_VNET(curvnet) check (currently always true).

While here, V_irtualize a few remaining global UMA zones
used by net/netinet/netipsec networking code. While it is
not yet clear to me or anybody else whether this is the right
thing to do, at this stage this makes the code more readable,
and makes it easier to track uncollected UMA-zone-backed
objects on vnet removal. In the long run, it's quite possible
that some form of shared use of UMA zone pools among multiple
vnets should be considered.

Bump __FreeBSD_version due to changes in layout of structs
vnet_ipfw, vnet_inet and vnet_net.

Approved by: julian (mentor)


# 185571 02-Dec-2008 bz

Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation


# 185088 19-Nov-2008 zec

Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions. As a rule,
initialization at instatiation for such variables should never be
introduced again from now on. Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact. In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 183550 02-Oct-2008 zec

Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 181803 17-Aug-2008 bz

Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch


# 178285 17-Apr-2008 rwatson

Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.

This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.

MFC after: 3 months
Tested by: kris (superset of committered patch)


# 172467 07-Oct-2007 silby

Add FBSDID to all files in netinet so that people can more
easily include file version information in bug reports.

Approved by: re (kensmith)


# 169541 13-May-2007 andre

Complete the (mechanical) move of the TCP reassembly and timewait
functions from their origininal place to their own files.

TCP Reassembly from tcp_input.c -> tcp_reass.c
TCP Timewait from tcp_subr.c -> tcp_timewait.c


# 169481 11-May-2007 andre

Drop everything that doesn't belong into this new file.
It's neither functional nor connected to the build yet.


# 169454 10-May-2007 rwatson

Move universally to ANSI C function declarations, with relatively
consistent style(9)-ish layout.


# 169417 09-May-2007 maxim

o Fix style(9) bugs introduced in the last commit.

Pointed out by: bde


# 169405 09-May-2007 maxim

o Unbreak "options TCPDEBUG" && "nooptions INET6" kernel build.

PR: kern/112517
Submitted by: vd


# 169317 06-May-2007 andre

Use existing TF_SACK_PERMIT flag in struct tcpcb t_flags field instead of
a decdicated sack_enable int for this bool. Change all users accordingly.


# 169316 06-May-2007 andre

o Remove redundant tcp reassembly check in header prediction code
o Rearrange code to make intent in TCPS_SYN_SENT case more clear
o Assorted style cleanup
o Comment clarification for tcp_dropwithreset()


# 169315 06-May-2007 andre

Reorder the TCP header prediction test to check for the most volatile
values first to spend less time on a fallback to normal processing.


# 169314 06-May-2007 andre

Remove the defunct remains of the TCPS_TIME_WAIT cases from tcp_do_segment
and change it to a void function.

We use a compressed structure for TCPS_TIME_WAIT to save memory. Any late
late segments arriving for such a connection is handled directly in the TW
code.


# 169268 04-May-2007 rwatson

Tweak comment at end of tcp_input() when calling into tcp_do_segment(): the
pcbinfo lock will be released as well, not just the pcb lock.


# 168986 23-Apr-2007 andre

o Fix INP lock leak in the minttl case
o Remove indirection in the decision of unlocking inp
o Further annotation of locking in tcp_input()


# 168906 20-Apr-2007 andre

o Remove unncessary TOF_SIGLEN flag from struct tcpopt
o Correctly set to->to_signature in tcp_dooptions()
o Update comments


# 168905 20-Apr-2007 andre

Add more KASSERT's.


# 168903 20-Apr-2007 andre

Remove bogus check for accept queue length and associated failure handling
from the incoming SYN handling section of tcp_input().

Enforcement of the accept queue limits is done by sonewconn() after the
3WHS is completed. It is not necessary to have an earlier check before a
connection request enters the SYN cache awaiting the full handshake. It
rather limits the effectiveness of the syncache by preventing legit and
illegit connections from entering it and having them shaken out before we
hit the real limit which may have vanished by then.

Change return value of syncache_add() to void. No status communication
is required.


# 168902 20-Apr-2007 andre

Simplifly syncache_expand() and clarify its semantics. Zero is returned
when the ACK is invalid and doesn't belong to any registered connection,
either in syncache or through SYN cookies. True but a NULL struct socket
is returned when the 3WHS completed but the socket could not be created
due to insufficient resources or limits reached.

For both cases an RST is sent back in tcp_input().

A logic error leading to a panic is fixed where syncache_expand() would
free the mbuf on socket allocation failure but tcp_input() later supplies
it to tcp_dropwithreset() to issue a RST to the peer.

Reported by: kris (the panic)


# 168769 15-Apr-2007 rwatson

Remove unused variable tcbinfo_mtx.


# 168615 11-Apr-2007 andre

Change the TCP timer system from using the callout system five times
directly to a merged model where only one callout, the next to fire,
is registered.

Instead of callout_reset(9) and callout_stop(9) the new function
tcp_timer_activate() is used which then internally manages the callout.

The single new callout is a mutex callout on inpcb simplifying the
locking a bit.

tcp_timer() is the called function which handles all race conditions
in one place and then dispatches the individual timer functions.

Reviewed by: rwatson (earlier version)


# 168369 04-Apr-2007 andre

Add INP_INFO_UNLOCK_ASSERT() and use it in tcp_input(). Also add some
further INP_INFO_WLOCK_ASSERT() while there.


# 168368 04-Apr-2007 andre

Move last tcpcb initialization for the inbound connection case from
tcp_input() to syncache_socket() where it belongs and the majority
of it already happens.

The "tp->snd_up = tp->snd_una" is removed as it is done with the
tcp_sendseqinit() macro a few lines earlier.


# 168364 04-Apr-2007 andre

Retire unused TCP_SACK_DEBUG.


# 168363 04-Apr-2007 andre

In tcp_dooptions() skip over SACK options if it is a SYN segment.


# 167989 28-Mar-2007 andre

When blackholing do a 'dropunlock' in the new world order to prevent the
INP_INFO_LOCK from leaking.

Reported by: ache
Found by: rwatson


# 167873 24-Mar-2007 maxim

o Use a define for a buffer size.

Prodded by: db

o Add missed vars for TCPDEBUG in tcp_do_segment().

Prodded by: tinderbox


# 167839 23-Mar-2007 andre

Split tcp_input() into its two functional parts:

o tcp_input() now handles TCP segment sanity checks and preparations
including the INPCB lookup and syncache.
o tcp_do_segment() handles all data and ACK processing and is IPv4/v6
agnostic.

Change all KASSERT() messages to ("%s: ", __func__).

The changes in this commit are primarily of mechanical nature and no
functional changes besides the function split are made.

Discussed with: rwatson


# 167834 23-Mar-2007 andre

Tidy up some code to conform better to surroundings and style(9), 0 = NULL
and space/tab.


# 167833 23-Mar-2007 andre

Bring SACK option handling in tcp_dooptions() in line with all other
options and ajust users accordingly.


# 167785 21-Mar-2007 andre

ANSIfy function declarations and remove register keywords for variables.
Consistently apply style to all function declarations.


# 167779 21-Mar-2007 andre

Tidy up IPFIREWALL_FORWARD sections and comments.


# 167778 21-Mar-2007 andre

Update and clarify comments in first section of tcp_input().


# 167777 21-Mar-2007 andre

Tidy up the ACCEPTCONN section of tcp_input(), ajust comments and remove
old dead T/TCP code.


# 167775 21-Mar-2007 andre

Tidy up tcp_log_in_vain and blackhole.


# 167774 21-Mar-2007 andre

Make TCP_DROP_SYNFIN a standard part of TCP. Disabled by default it
doesn't impede normal operation negatively and is only a few lines of
code. It's close relatives blackhole and log_in_vain aren't options
either.


# 167772 21-Mar-2007 andre

Remove tcp_minmssoverload DoS detection logic. The problem it tried to
protect us from wasn't really there and it only bloats the code. Should
the problem surface in the future we can simply resurrect it from cvs
history.


# 167721 19-Mar-2007 andre

Match up SYSCTL declaration style.


# 167606 15-Mar-2007 andre

Consolidate insertion of TCP options into a segment from within tcp_output()
and syncache_respond() into its own generic function tcp_addoptions().

tcp_addoptions() is alignment agnostic and does optimal packing in all cases.

In struct tcpopt rename to_requested_s_scale to just to_wscale.

Add a comment with quote from RFC1323: "The Window field in a SYN (i.e.,
a <SYN> or <SYN,ACK>) segment itself is never scaled."

Reviewed by: silby, mohans, julian
Sponsored by: TCP/IP Optimization Fundraise 2005


# 167310 07-Mar-2007 qingli

This patch is provided to fix a couple of deployment issues observed
in the field. In one situation, one end of the TCP connection sends
a back-to-back RST packet, with delayed ack, the last_ack_sent variable
has not been update yet. When tcp_insecure_rst is turned off, the code
treats the RST as invalid because last_ack_sent instead of rcv_nxt is
compared against th_seq. Apparently there is some kind of firewall that
sits in between the two ends and that RST packet is the only RST
packet received. With short lived HTTP connections, the symptom is
a large accumulation of connections over a short period of time .

The +/-(1) factor is to take care of implementations out there that
generate RST packets with these types of sequence numbers. This
behavior has also been observed in live environments.

Reviewed by: silby, Mike Karels
MFC after: 1 week


# 167120 28-Feb-2007 mohans

In the SYN_SENT case, Initialize the snd_wnd before the call to tcp_mss().
The TCP hostcache logic in tcp_mss() depends on the snd_wnd being initialized.


# 167036 26-Feb-2007 mohans

Reap FIN_WAIT_2 connections marked SOCANTRCVMORE faster. This mitigate
potential issues where the peer does not close, potentially leaving
thousands of connections in FIN_WAIT_2. This is controlled by a new sysctl
fast_finwait2_recycle, which is disabled by default.

Reviewed by: gnn, silby.


# 166842 20-Feb-2007 rwatson

Rename two identically named log_in_vain variables: tcp_input.c's static
log_in_vain to tcp_log_in_vain, and udp_usrreq's global log_in_vain to
udp_log_in_vain.

MFC after: 1 week


# 166405 01-Feb-2007 andre

Auto sizing TCP socket buffers.

Normally the socket buffers are static (either derived from global
defaults or set with setsockopt) and do not adapt to real network
conditions. Two things happen: a) your socket buffers are too small
and you can't reach the full potential of the network between both
hosts; b) your socket buffers are too big and you waste a lot of
kernel memory for data just sitting around.

With automatic TCP send and receive socket buffers we can start with a
small buffer and quickly grow it in parallel with the TCP congestion
window to match real network conditions.

FreeBSD has a default 32K send socket buffer. This supports a maximal
transfer rate of only slightly more than 2Mbit/s on a 100ms RTT
trans-continental link. Or at 200ms just above 1Mbit/s. With TCP send
buffer auto scaling and the default values below it supports 20Mbit/s
at 100ms and 10Mbit/s at 200ms. That's an improvement of factor 10, or
1000%. For the receive side it looks slightly better with a default of
64K buffer size.

New sysctls are:
net.inet.tcp.sendbuf_auto=1 (enabled)
net.inet.tcp.sendbuf_inc=8192 (8K, step size)
net.inet.tcp.sendbuf_max=262144 (256K, growth limit)
net.inet.tcp.recvbuf_auto=1 (enabled)
net.inet.tcp.recvbuf_inc=16384 (16K, step size)
net.inet.tcp.recvbuf_max=262144 (256K, growth limit)

Tested by: many (on HEAD and RELENG_6)
Approved by: re
MFC after: 1 month


# 165118 12-Dec-2006 bz

MFp4: 92972, 98913 + one more change

In ip6_sprintf no longer use and return one of eight static buffers
for printing/logging ipv6 addresses.
The caller now has to hand in a sufficiently large buffer as first
argument.


# 163606 22-Oct-2006 rwatson

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# 162642 26-Sep-2006 jmg

fix calculating to_tsecr... This prevents the rtt calculations from
going all wonky...


# 162580 23-Sep-2006 bms

Always set the IP version in the TCP input path, to preserve
the header field for possible later IPSEC SPD lookup, even
when the kernel is built without 'options INET6'.

PR: kern/57760
MFC after: 1 week
Submitted by: Joachim Schueth


# 162277 13-Sep-2006 andre

Rewrite of TCP syncookies to remove locking requirements and to enhance
functionality:

- Remove a rwlock aquisition/release per generated syncookie. Locking
is now integrated with the bucket row locking of syncache itself and
syncookies no longer add any additional lock overhead.
- Syncookie secrets are different for and stored per syncache buck row.
Secrets expire after 16 seconds and are reseeded on-demand.
- The computational overhead for syncookie generation and verification
is one MD5 hash computation as before.
- Syncache can be turned off and run with syncookies only by setting the
sysctl net.inet.tcp.syncookies_only=1.

This implementation extends the orginal idea and first implementation
of FreeBSD by using not only the initial sequence number field to store
information but also the timestamp field if present. This way we can
keep track of the entire state we need to know to recreate the session in
its original form. Almost all TCP speakers implement RFC1323 timestamps
these days. For those that do not we still have to live with the known
shortcomings of the ISN only SYN cookies. The use of the timestamp field
causes the timestamps to be randomized if syncookies are enabled.

The idea of SYN cookies is to encode and include all necessary information
about the connection setup state within the SYN-ACK we send back and thus
to get along without keeping any local state until the ACK to the SYN-ACK
arrives (if ever). Everything we need to know should be available from
the information we encoded in the SYN-ACK.

A detailed description of the inner working of the syncookies mechanism
is included in the comments in tcp_syncache.c.

Reviewed by: silby (slightly earlier version)
Sponsored by: TCP/IP Optimization Fundraise 2005


# 162111 07-Sep-2006 ru

Back when we had T/TCP support, we used to apply different
timeouts for TCP and T/TCP connections in the TIME_WAIT
state, and we had two separate timed wait queues for them.
Now that is has gone, the timeout is always 2*MSL again,
and there is no reason to keep two queues (the first was
unused anyway!).

Also, reimplement the remaining queue using a TAILQ (it
was technically impossible before, with two queues).


# 162084 06-Sep-2006 andre

First step of TSO (TCP segmentation offload) support in our network stack.

o add IFCAP_TSO[46] for drivers to announce this capability for IPv4 and IPv6
o add CSUM_TSO flag to mbuf pkthdr csum_flags field
o add tso_segsz field to mbuf pkthdr
o enhance ip_output() packet length check to allow for large TSO packets
o extend tcp_maxmtu[46]() with a flag pointer to pass interface capabilities
o adjust all callers of tcp_maxmtu[46]() accordingly

Discussed on: -current, -net
Sponsored by: TCP/IP Optimization Fundraise 2005


# 161226 11-Aug-2006 mohans

Fixes an edge case bug in timewait handling where ticks rolling over causing
the timewait expiry to be exactly 0 corrupts the timewait queues (and that entry).
Reviewed by: silby


# 160024 29-Jun-2006 bz

Use INPLOOKUP_WILDCARD instead of just 1 more consistently.

OKed by: rwatson (some weeks ago)


# 159950 26-Jun-2006 andre

Some cleanups and janitorial work to tcp_syncache:

o don't assign remote/local host/port information manually between provided
struct in_conninfo and struct syncache, bcopy() it instead
o rename sc_tsrecent to sc_tsreflect in struct syncache to better capture
the purpose of this field
o rename sc_request_r_scale to sc_requested_r_scale for ditto reasons
o fix IPSEC error case printf's to report correct function name
o in syncache_socket() only transpose enhanced tcp options parameters to
struct tcpcb when the inpcb doesn't has TF_NOOPT set
o in syncache_respond() reorder stack variables
o in syncache_respond() remove bogus KASSERT()

No functional changes.

Sponsored by: TCP/IP Optimization Fundraise 2005


# 159949 26-Jun-2006 andre

Some cleanups and janitorial work to tcp_dooptions():

o redefine the parameter 'is_syn' to 'flags', add TO_SYN flag and adjust its
usage accordingly
o update the comments to the tcp_dooptions() invocation in
tcp_input():after_listen to reflect reality
o move the logic checking the echoed timestamp out of tcp_dooptions() to the
only place that uses it next to the invocation described in the previous
item
o adjust parsing of TCPOPT_SACK_PERMITTED to use the same style as the others
o add comments in to struct tcpopt.to_flags #defines

No functional changes.

Sponsored by: TCP/IP Optimization Fundraise 2005


# 159772 19-Jun-2006 dwmalone

When we receive an out-of-window SYN for an "ESTABLISHED" connection,
ACK the SYN as required by RFC793, rather than ignoring it. NetBSD
have had a similar change since 1999.

PR: 93236
Submitted by: Grant Edwards <grante@visi.com>
MFC after: 1 month


# 159695 17-Jun-2006 andre

Add locking to TCP syncache and drop the global tcpinfo lock as early
as possible for the syncache_add() case. The syncache timer no longer
aquires the tcpinfo lock and timeout/retransmit runs can happen in
parallel with bucket granularity.

On a P4 the additional locks cause a slight degression of 0.7% in tcp
connections per second. When IP and TCP input are deserialized and
can run in parallel this little overhead can be neglected. The syncookie
handling still leaves room for improvement and its random salts may be
moved to the syncache bucket head structures to remove the second lock
operation currently required for it. However this would be a more
involved change from the way syncookies work at the moment.

Reviewed by: rwatson
Tested by: rwatson, ps (earlier version)
Sponsored by: TCP/IP Optimization Fundraise 2005


# 157927 21-Apr-2006 ps

Allow for nmbclusters and maxsockets to be increased via sysctl.
An eventhandler is used to update all the various zones that depend
on these values.


# 157609 09-Apr-2006 rwatson

Modify tcp_timewait() to accept an inpcb reference, not a tcptw
reference. For now, we allow the possibility that the in_ppcb
pointer in the inpcb may be NULL if a timewait socket has had its
tcptw structure recycled. This allows tcp_timewait() to
consistently unlock the inpcb.

Reported by: Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after: 3 months


# 157534 05-Apr-2006 rwatson

Don't unlock a timewait structure if the pointer is NULL in
tcp_timewait(). This corrects a bug (or lack of fixing of a bug)
in tcp_input.c:1.295.

Submitted by: Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after: 3 months


# 157474 04-Apr-2006 rwatson

Before dereferencing intotw() when INP_TIMEWAIT, check for inp_ppcb being
NULL. We currently do allow this to happen, but may want to remove that
possibility in the future. This case can occur when a socket is left
open after TCP wraps up, and the timewait state is recycled. This will
be cleaned up in the future.

Found by: Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after: 3 months


# 157432 03-Apr-2006 rwatson

Change inp_ppcb from caddr_t to void *, fix/remove associated related
casts.

Consistently use intotw() to cast inp_ppcb pointers to struct tcptw *
pointers.

Consistently use intotcpcb() to cast inp_ppcb pointers to struct tcpcb *
pointers.

Don't assign tp to the results to intotcpcb() during variable declation
at the top of functions, as that is before the asserts relating to
locking have been performed. Do this later in the function after
appropriate assertions have run to allow that operation to be conisdered
safe.

MFC after: 3 months


# 157376 01-Apr-2006 rwatson

Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():

- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.

- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.

- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.

- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.

- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.

- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.

- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.

- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.

- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.

These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.

MFC after: 3 months


# 157136 26-Mar-2006 rwatson

Explicitly assert socket pointer is non-NULL in tcp_input() so as to
provide better debugging information.

Prefer explicit comparison to NULL for tcpcb pointers rather than
treating them as booleans.

MFC after: 1 month


# 156125 28-Feb-2006 andre

Rework TCP window scaling (RFC1323) to properly scale the send window
right from the beginning and partly clean up the differences in handling
between SYN_SENT and SYN_RCVD (syncache).

Further changes to this code to come. This is a first incremental step
to a general overhaul and streamlining of the TCP code.

PR: kern/15095
PR: kern/92690 (partly)
Reviewed by: qingli (and tested with ANVL)
Sponsored by: TCP/IP Optimization Fundraise 2005


# 155961 23-Feb-2006 qingli

This patch fixes the problem where the current TCP code can not handle
simultaneous open. Both the bug and the patch were verified using the
ANVL test suite.

PR: kern/74935
Submitted by: qingli (before I became committer)
Reviewed by: andre
MFC after: 5 days


# 155819 18-Feb-2006 andre

Remove unneeded includes and provide more accurate description
to others.

Submitted by: garys
PR: kern/86437


# 155767 16-Feb-2006 andre

Have TCP Inflight disable itself if the RTT is below a certain
threshold. Inflight doesn't make sense on a LAN as it has
trouble figuring out the maximal bandwidth because of the coarse
tick granularity.

The sysctl net.inet.tcp.inflight.rttthresh specifies the threshold
in milliseconds below which inflight will disengage. It defaults
to 10ms.

Tested by: Joao Barros <joao.barros-at-gmail.com>,
Rich Murphey <rich-at-whiteoaklabs.com>
Sponsored by: TCP/IP Optimization Fundraise 2005


# 154528 18-Jan-2006 andre

Do not derefence the ip header pointer in the IPv6 case.
This fixes a bug in the previous commit.

Found by: Coverity Prevent(tm)
Coverity ID: CID253
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 days


# 154366 14-Jan-2006 gnn

Check the correct TTL in both the IPv6 and IPv4 cases.

Submitted by: glebius
Reviewed by: gnn, bz
Found with: Coverity Prevent(tm)


# 152592 18-Nov-2005 andre

Consolidate all IP Options handling functions into ip_options.[ch] and
include ip_options.h into all files making use of IP Options functions.

From ip_input.c rev 1.306:
ip_dooptions(struct mbuf *m, int pass)
save_rte(m, option, dst)
ip_srcroute(m0)
ip_stripoptions(m, mopt)

From ip_output.c rev 1.249:
ip_insertoptions(m, opt, phlen)
ip_optcopy(ip, jp)
ip_pcbopts(struct inpcb *inp, int optname, struct mbuf *m)

No functional changes in this commit.

Discussed with: rwatson
Sponsored by: TCP/IP Optimization Fundraise 2005


# 151464 19-Oct-2005 rwatson

Convert if (tp->t_state == TCPS_LISTEN) panic() into a KASSERT.

MFC after: 2 weeks


# 149404 24-Aug-2005 ps

Remove a KASSERT in the sack path that fails because of a interaction
between sack and a bug in the "bad retransmit recovery" logic. This is
a workaround, the underlying bug will be fixed later.

Submitted by: Mohan Srinivasan, Noritoshi Demizu


# 149371 22-Aug-2005 andre

Add socketoption IP_MINTTL. May be used to set the minimum acceptable
TTL a packet must have when received on a socket. All packets with a
lower TTL are silently dropped. Works on already connected/connecting
and listening sockets for RAW/UDP/TCP.

This option is only really useful when set to 255 preventing packets
from outside the directly connected networks reaching local listeners
on sockets.

Allows userland implementation of 'The Generalized TTL Security Mechanism
(GTSM)' according to RFC3682. Examples of such use include the Cisco IOS
BGP implementation command "neighbor ttl-security".

MFC after: 2 weeks
Sponsored by: TCP/IP Optimization Fundraise 2005


# 147781 05-Jul-2005 ps

Fix for a bug in newreno partial ack handling where if a large amount
of data is partial acked, snd_cwnd underflows, causing a burst.

Found, Submitted by: Noritoshi Demizu
Approved by: re


# 147735 01-Jul-2005 ps

Fix for a bug in the change that defers sack option processing until
after PAWS checks. The symptom of this is an inconsistency in the cached
sack state, caused by the fact that the sack scoreboard was not being
updated for an ACK handled in the header prediction path.

Found by: Andrey Chernov.
Submitted by: Noritoshi Demizu, Raja Mukerji.
Approved by: re


# 147734 01-Jul-2005 ps

Fix for a SACK crash caused by a bug in tcp_reass(). tcp_reass()
does not clear tlen and frees the mbuf (leaving th pointing at
freed memory), if the data segment is a complete duplicate.
This change works around that bug. A fix for the tcp_reass() bug
will appear later (that bug is benign for now, as neither th nor
tlen is referenced in tcp_input() after the call to tcp_reass()).

Found by: Pawel Jakub Dawidek.
Submitted by: Raja Mukerji, Noritoshi Demizu.
Approved by: re


# 147666 29-Jun-2005 simon

Fix ipfw packet matching errors with address tables.

The ipfw tables lookup code caches the result of the last query. The
kernel may process multiple packets concurrently, performing several
concurrent table lookups. Due to an insufficient locking, a cached
result can become corrupted that could cause some addresses to be
incorrectly matched against a lookup table.

Submitted by: ru
Reviewed by: csjp, mlaier
Security: CAN-2005-2019
Security: FreeBSD-SA-05:13.ipfw

Correct bzip2 permission race condition vulnerability.

Obtained from: Steve Grubb via RedHat
Security: CAN-2005-0953
Security: FreeBSD-SA-05:14.bzip2
Approved by: obrien

Correct TCP connection stall denial of service vulnerability.

A TCP packets with the SYN flag set is accepted for established
connections, allowing an attacker to overwrite certain TCP options.

Submitted by: Noritoshi Demizu
Reviewed by: andre, Mohan Srinivasan
Security: CAN-2005-2068
Security: FreeBSD-SA-05:15.tcp

Approved by: re (security blanket), cperciva


# 147637 27-Jun-2005 ps

- Postpone SACK option processing until after PAWS checks. SACK option
processing is now done in the ACK processing case.
- Merge tcp_sack_option() and tcp_del_sackholes() into a new function
called tcp_sack_doack().
- Test (SEG.ACK < SND.MAX) before processing the ACK.

Submitted by: Noritoshi Demizu
Reveiewed by: Mohan Srinivasan, Raja Mukerji
Approved by: re


# 147605 25-Jun-2005 ups

Fix a timer ticks wrap around bug for minmssoverload processing.

Approved by: re (scottl,dwhite)
MFC after: 4 weeks


# 146863 01-Jun-2005 rwatson

Assert that tcbinfo is locked in tcp_input() before calling into
tcp_drop().

MFC after: 7 days


# 146862 01-Jun-2005 rwatson

Assert the tcbinfo lock whenever tcp_close() is to be called by
tcp_input().

MFC after: 7 days


# 146630 25-May-2005 ps

This is conform with the terminology in

M.Mathis and J.Mahdavi,
"Forward Acknowledgement: Refining TCP Congestion Control"
SIGCOMM'96, August 1996.

Submitted by: Noritoshi Demizu, Raja Mukerji


# 146123 11-May-2005 ps

When looking for the next hole to retransmit from the scoreboard,
or to compute the total retransmitted bytes in this sack recovery
episode, the scoreboard is traversed. While in sack recovery, this
traversal occurs on every call to tcp_output(), every dupack and
every partial ack. The scoreboard could potentially get quite large,
making this traversal expensive.

This change optimizes this by storing hints (for the next hole to
retransmit and the total retransmitted bytes in this sack recovery
episode) reducing the complexity to find these values from O(n) to
constant time.

The debug code that sanity checks the hints against the computed
value will be removed eventually.

Submitted by: Mohan Srinivasan, Noritoshi Demizu, Raja Mukerji.


# 145087 14-Apr-2005 ps

Fix for a TCP SACK bug where more than (win/2) bytes could have been
in flight in SACK recovery.

Found by: Noritoshi Demizu
Submitted by: Mohan Srinivasan <mohans at yahoo-inc dot com>
Noritoshi Demizu <demizu at dd dot ij4u dot or dot jp>
Raja Mukerji <raja at moselle dot com>


# 144858 10-Apr-2005 ps

- Tighten up the Timestamp checks to prevent a spoofed segment from
setting ts_recent to an arbitrary value, stopping further
communication between the two hosts.
- If the Echoed Timestamp is greater than the current time,
fall back to the non RFC 1323 RTT calculation.

Submitted by: Raja Mukerji (raja at moselle dot com)
Reviewed by: Noritoshi Demizu, Mohan Srinivasan


# 144857 10-Apr-2005 ps

- If the reassembly queue limit was reached or if we couldn't allocate
a reassembly queue state structure, don't update (receiver) sack
report.
- Similarly, if tcp_drain() is called, freeing up all items on the
reassembly queue, clean the sack report.

Found, Submitted by: Noritoshi Demizu <demizu at dd dot iij4u dot or dot jp>
Reviewed by: Mohan Srinivasan (mohans at yahoo-inc dot com),
Raja Mukerji (raja at moselle dot com).


# 142031 17-Feb-2005 ps

Remove 2 (SACK) fields from the tcpcb. These are only used by a
function that is called from tcp_input(), so they oughta be passed on
the stack instead of stuck in the tcpcb.

Submitted by: Mohan Srinivasan


# 141961 16-Feb-2005 ps

Fix for a SACK (receiver) bug where incorrect SACK blocks are
reported to the sender - in the case where the sender sends data
outside the window (as WinXP does :().

Reported by: Sam Jensen <sam at wand dot net dot nz>
Submitted by: Mohan Srinivasan


# 141928 14-Feb-2005 ps

- Retransmit just one segment on initiation of SACK recovery.
Remove the SACK "initburst" sysctl.
- Fix bugs in SACK dupack and partialack handling that can cause
large bursts while in SACK recovery.

Submitted by: Mohan Srinivasan


# 139823 07-Jan-2005 imp

/* -> /*- for license, minor formatting changes


# 139606 03-Jan-2005 silby

Add a sysctl (net.inet.tcp.insecure_rst) which allows one to specify
that the RFC 793 specification for accepting RST packets should be
following. When followed, this makes one vulnerable to the attacks
described in "slipping in the window", but it may be necessary in
some odd circumstances.


# 139298 25-Dec-2004 rwatson

In the dropafterack case of tcp_input(), it's OK to release the TCP
pcbinfo lock before calling tcp_output(), as holding just the inpcb
lock is sufficient to prevent garbage collection.


# 139297 25-Dec-2004 rwatson

Revert parts of tcp_input.c:1.255 associated with the header predicted
cases for tcp_input():

While it is true that the pcbinfo lock provides a pseudo-reference to
inpcbs, both the inpcb and pcbinfo locks are required to free an
un-referenced inpcb. As such, we can release the pcbinfo lock as
long as the inpcb remains locked with the confidence that it will not
be garbage-collected. This leads to a less conservative locking
strategy that should reduce contention on the TCP pcbinfo lock.

Discussed with: sam


# 138148 28-Nov-2004 rwatson

Assert the inpcb lock in tcp_xmit_timer() as it performs read-modify-
write of various time/rtt-related fields in the tcpcb.


# 138147 28-Nov-2004 rwatson

Expand coverage of the receive socket buffer lock when handling urgent
pointer updates: test available space while holding the socket buffer
mutex, and continue to hold until until the pointer update has been
performed.

MFC after: 2 weeks


# 138098 25-Nov-2004 silby

Fix a problem where our TCP stack would ignore RST packets if the receive
window was 0 bytes in size. This may have been the cause of unsolved
"connection not closing" reports over the years.

Thanks to Michiel Boland for providing the fix and providing a concise
test program for the problem.

Submitted by: Michiel Boland
MFC after: 2 weeks


# 138040 23-Nov-2004 rwatson

In tcp_reass(), assert the inpcb lock on the passed tcpcb, since the
contents of the tcpcb are read and modified in volume.

In tcp_input(), replace th comparison with 0 with a comparison with
NULL.

At the 'findpcb', 'dropafterack', and 'dropwithreset' labels in
tcp_input(), assert 'headlocked'. Try to improve consistency between
various assertions regarding headlocked to be more informative.

MFC after: 2 weeks


# 138025 23-Nov-2004 rwatson

tcp_timewait() performs multiple non-atomic reads on the tcptw
structure, so assert the inpcb lock associated with the tcptw.
Also assert the tcbinfo lock, as tcp_timewait() may call
tcp_twclose() or tcp_2msl_rest(), which require it. Since
tcp_timewait() is already called with that lock from tcp_input(),
this doesn't change current locking, merely documents reasons for
it.

In tcp_twstart(), assert the tcbinfo lock, as tcp_timer_2msl_rest()
is called, which requires that lock.

In tcp_twclose(), assert the tcbinfo lock, as tcp_timer_2msl_stop()
is called, which requires that lock.

Document the locking strategy for the time wait queues in tcp_timer.c,
which consists of protecting the time wait queues in the same manner
as the tcbinfo structure (using the tcbinfo lock).

In tcp_timer_2msl_reset(), assert the tcbinfo lock, as the time wait
queues are modified.

In tcp_timer_2msl_stop(), assert the tcbinfo lock, as the time wait
queues may be modified.

In tcp_timer_2msl_tw(), assert the tcbinfo lock, as the time wait
queues may be modified.

MFC after: 2 weeks


# 137988 22-Nov-2004 rwatson

Remove "Unlocked read" annotations associated with previously unlocked
use of socket buffer fields in the TCP input code. These references
are now protected by use of the receive socket buffer lock.

MFC after: 1 week


# 137349 07-Nov-2004 rwatson

Do some re-sorting of TCP pcbinfo locking and assertions: make sure to
retain the pcbinfo lock until we're done using a pcb in the in-bound
path, as the pcbinfo lock acts as a pseuo-reference to prevent the pcb
from potentially being recycled. Clean up assertions and make sure to
assert that the pcbinfo is locked at the head of code subsections where
it is needed. Free the mbuf at the end of tcp_input after releasing
any held locks to reduce the time the locks are held.

MFC after: 3 weeks


# 137139 02-Nov-2004 andre

Remove RFC1644 T/TCP support from the TCP side of the network stack.

A complete rationale and discussion is given in this message
and the resulting discussion:

http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706

Note that this commit removes only the functional part of T/TCP
from the tcp_* related functions in the kernel. Other features
introduced with RFC1644 are left intact (socket layer changes,
sendmsg(2) on connection oriented protocols) and are meant to
be reused by a simpler and less intrusive reimplemention of the
previous T/TCP functionality.

Discussed on: -arch


# 136151 05-Oct-2004 ps

- Estimate the amount of data in flight in sack recovery and use it
to control the packets injected while in sack recovery (for both
retransmissions and new data).
- Cleanups to the sack codepaths in tcp_output.c and tcp_sack.c.
- Add a new sysctl (net.inet.tcp.sack.initburst) that controls the
number of sack retransmissions done upon initiation of sack recovery.

Submitted by: Mohan Srinivasan <mohans@yahoo-inc.com>


# 133920 17-Aug-2004 andre

Convert ipfw to use PFIL_HOOKS. This is change is transparent to userland
and preserves the ipfw ABI. The ipfw core packet inspection and filtering
functions have not been changed, only how ipfw is invoked is different.

However there are many changes how ipfw is and its add-on's are handled:

In general ipfw is now called through the PFIL_HOOKS and most associated
magic, that was in ip_input() or ip_output() previously, is now done in
ipfw_check_[in|out]() in the ipfw PFIL handler.

IPDIVERT is entirely handled within the ipfw PFIL handlers. A packet to
be diverted is checked if it is fragmented, if yes, ip_reass() gets in for
reassembly. If not, or all fragments arrived and the packet is complete,
divert_packet is called directly. For 'tee' no reassembly attempt is made
and a copy of the packet is sent to the divert socket unmodified. The
original packet continues its way through ip_input/output().

ipfw 'forward' is done via m_tag's. The ipfw PFIL handlers tag the packet
with the new destination sockaddr_in. A check if the new destination is a
local IP address is made and the m_flags are set appropriately. ip_input()
and ip_output() have some more work to do here. For ip_input() the m_flags
are checked and a packet for us is directly sent to the 'ours' section for
further processing. Destination changes on the input path are only tagged
and the 'srcrt' flag to ip_forward() is set to disable destination checks
and ICMP replies at this stage. The tag is going to be handled on output.
ip_output() again checks for m_flags and the 'ours' tag. If found, the
packet will be dropped back to the IP netisr where it is going to be picked
up by ip_input() again and the directly sent to the 'ours' section. When
only the destination changes, the route's 'dst' is overwritten with the
new destination from the forward m_tag. Then it jumps back at the route
lookup again and skips the firewall check because it has been marked with
M_SKIP_FIREWALL. ipfw 'forward' has to be compiled into the kernel with
'option IPFIREWALL_FORWARD' to enable it.

DUMMYNET is entirely handled within the ipfw PFIL handlers. A packet for
a dummynet pipe or queue is directly sent to dummynet_io(). Dummynet will
then inject it back into ip_input/ip_output() after it has served its time.
Dummynet packets are tagged and will continue from the next rule when they
hit the ipfw PFIL handlers again after re-injection.

BRIDGING and IPFW_ETHER are not changed yet and use ipfw_chk() directly as
they did before. Later this will be changed to dedicated ETHER PFIL_HOOKS.

More detailed changes to the code:

conf/files
Add netinet/ip_fw_pfil.c.

conf/options
Add IPFIREWALL_FORWARD option.

modules/ipfw/Makefile
Add ip_fw_pfil.c.

net/bridge.c
Disable PFIL_HOOKS if ipfw for bridging is active. Bridging ipfw
is still directly invoked to handle layer2 headers and packets would
get a double ipfw when run through PFIL_HOOKS as well.

netinet/ip_divert.c
Removed divert_clone() function. It is no longer used.

netinet/ip_dummynet.[ch]
Neither the route 'ro' nor the destination 'dst' need to be stored
while in dummynet transit. Structure members and associated macros
are removed.

netinet/ip_fastfwd.c
Removed all direct ipfw handling code and replace it with the new
'ipfw forward' handling code.

netinet/ip_fw.h
Removed 'ro' and 'dst' from struct ip_fw_args.

netinet/ip_fw2.c
(Re)moved some global variables and the module handling.

netinet/ip_fw_pfil.c
New file containing the ipfw PFIL handlers and module initialization.

netinet/ip_input.c
Removed all direct ipfw handling code and replace it with the new
'ipfw forward' handling code. ip_forward() does not longer require
the 'next_hop' struct sockaddr_in argument. Disable early checks
if 'srcrt' is set.

netinet/ip_output.c
Removed all direct ipfw handling code and replace it with the new
'ipfw forward' handling code.

netinet/ip_var.h
Add ip_reass() as general function. (Used from ipfw PFIL handlers
for IPDIVERT.)

netinet/raw_ip.c
Directly check if ipfw and dummynet control pointers are active.

netinet/tcp_input.c
Rework the 'ipfw forward' to local code to work with the new way of
forward tags.

netinet/tcp_sack.c
Remove include 'opt_ipfw.h' which is not needed here.

sys/mbuf.h
Remove m_claim_next() macro which was exclusively for ipfw 'forward'
and is no longer needed.

Approved by: re (scottl)


# 133874 16-Aug-2004 rwatson

White space cleanup for netinet before branch:

- Trailing tab/space cleanup
- Remove spurious spaces between or before tabs

This change avoids touching files that Andre likely has in his working
set for PFIL hooks changes for IPFW/DUMMYNET.

Approved by: re (scottl)
Submitted by: Xin LI <delphij@frontfree.net>


# 132044 12-Jul-2004 rwatson

After each label in tcp_input(), assert the inpcbinfo and inpcb lock
state that we expect.


# 131427 01-Jul-2004 jayanth

On receiving 3 duplicate acknowledgements, SACK recovery was not being entered correctly.
Fix this problem by separating out the SACK and the newreno cases. Also, check
if we are in FASTRECOVERY for the sack case and if so, turn off dupacks.

Fix an issue where the congestion window was not being incremented by ssthresh.

Thanks to Mohan Srinivasan for finding this problem.


# 131151 26-Jun-2004 rwatson

Reduce the number of unnecessary unlock-relocks on socket buffer mutexes
associated with performing a wakeup on the socket buffer:

- When performing an sbappend*() followed by a so[rw]wakeup(), explicitly
acquire the socket buffer lock and use the _locked() variants of both
calls. Note that the _locked() sowakeup() versions unlock the mutex on
return. This is done in uipc_send(), divert_packet(), mroute
socket_send(), raw_append(), tcp_reass(), tcp_input(), and udp_append().

- When the socket buffer lock is dropped before a sowakeup(), remove the
explicit unlock and use the _locked() sowakeup() variant. This is done
in soisdisconnecting(), soisdisconnected() when setting the can't send/
receive flags and dropping data, and in uipc_rcvd() which adjusting
back-pressure on the sockets.

For UNIX domain sockets running mpsafe with a contention-intensive SMP
mysql benchmark, this results in a 1.6% query rate improvement due to
reduce mutex costs.


# 131079 25-Jun-2004 ps

White space & spelling fixes

Submitted by: Xin LI <delphij@frontfree.net>


# 131018 24-Jun-2004 rwatson

Broaden scope of the socket buffer lock when processing an ACK so that
the read and write of sb_cc are atomic. Call sbdrop_locked() instead
of sbdrop() since we already hold the socket buffer lock.


# 131017 24-Jun-2004 rwatson

Protect so_oobmark with with SOCKBUF_LOCK(&so->so_rcv), and broaden
locking in tcp_input() for TCP packets with urgent data pointers to
hold the socket buffer lock across testing and updating oobmark
from just protecting sb_state.

Update socket locking annotations


# 131006 24-Jun-2004 rwatson

Introduce sbreserve_locked(), which asserts the socket buffer lock on
the socket buffer having its limits adjusted. sbreserve() now acquires
the lock before calling sbreserve_locked(). In soreserve(), acquire
socket buffer locks across read-modify-writes of socket buffer fields,
and calls into sbreserve/sbrelease; make sure to acquire in keeping
with the socket buffer lock order. In tcp_mss(), acquire the socket
buffer lock in the calling context so that we have atomic read-modify
-write on buffer sizes.


# 130989 23-Jun-2004 ps

Add support for TCP Selective Acknowledgements. The work for this
originated on RELENG_4 and was ported to -CURRENT.

The scoreboarding code was obtained from OpenBSD, and many
of the remaining changes were inspired by OpenBSD, but not
taken directly from there.

You can enable/disable sack using net.inet.tcp.do_sack. You can
also limit the number of sack holes that all senders can have in
the scoreboard with net.inet.tcp.sackhole_limit.

Reviewed by: gnn
Obtained from: Yahoo! (Mohan Srinivasan, Jayanth Vijayaraghavan)


# 130811 20-Jun-2004 rwatson

Assert the inpcb lock before letting MAC check whether we can deliver
to the inpcb in tcp_input().


# 130584 16-Jun-2004 bms

Fix build for IPSEC && !INET6

PR: kern/66125
Submitted by: Cyrille Lefevre


# 130513 15-Jun-2004 rwatson

Grab the socket buffer send or receive mutex when performing a
read-modify-write on the sb_state field. This commit catches only
the "easy" ones where it doesn't interact with as yet unmerged
locking.


# 130480 14-Jun-2004 rwatson

The socket field so_state is used to hold a variety of socket related
flags relating to several aspects of socket functionality. This change
breaks out several bits relating to send and receive operation into a
new per-socket buffer field, sb_state, in order to facilitate locking.
This is required because, in order to provide more granular locking of
sockets, different state fields have different locking properties. The
following fields are moved to sb_state:

SS_CANTRCVMORE (so_state)
SS_CANTSENDMORE (so_state)
SS_RCVATMARK (so_state)

Rename respectively to:

SBS_CANTRCVMORE (so_rcv.sb_state)
SBS_CANTSENDMORE (so_snd.sb_state)
SBS_RCVATMARK (so_rcv.sb_state)

This facilitates locking by isolating fields to be located with other
identically locked fields, and permits greater granularity in socket
locking by avoiding storing fields with different locking semantics in
the same short (avoiding locking conflicts). In the future, we may
wish to coallesce sb_state and sb_flags; for the time being I leave
them separate and there is no additional memory overhead due to the
packing/alignment of shorts in the socket buffer structure.


# 130398 13-Jun-2004 rwatson

Socket MAC labels so_label and so_peerlabel are now protected by
SOCK_LOCK(so):

- Hold socket lock over calls to MAC entry points reading or
manipulating socket labels.

- Assert socket lock in MAC entry point implementations.

- When externalizing the socket label, first make a thread-local
copy while holding the socket lock, then release the socket lock
to externalize to userspace.


# 128829 02-May-2004 darrenr

Rename m_claim_next_hop() to m_claim_next(), as suggested by Max Laier.


# 128828 02-May-2004 darrenr

oops, I forgot this file in a prior commit (change was still sitting here,
uncommitted):

Rename ip_claim_next_hop() to m_claim_next_hop(), give it an extra arg
(the type of tag to claim) and push it out of ip_var.h into mbuf.h
alongside all of the other macros that work ok mbuf's and tag's.


# 128653 26-Apr-2004 silby

Tighten up reset handling in order to make reset attacks as difficult as
possible while maintaining compatibility with the widest range of TCP stacks.

The algorithm is as follows:

---
For connections in the ESTABLISHED state, only resets with
sequence numbers exactly matching last_ack_sent will cause a reset,
all other segments will be silently dropped.

For connections in all other states, a reset anywhere in the window
will cause the connection to be reset. All other segments will be
silently dropped.
---

The necessity of accepting all in-window resets was discovered
by jayanth and jlemon, both of whom have seen TCP stacks that
will respond to FIN-ACK packets with resets not meeting the
strict last_ack_sent check.

Idea by: Darren Reed
Reviewed by: truckman, jlemon, others(?)


# 128592 23-Apr-2004 andre

Correct an edge case in tcp_mss() where the cached path MTU
from tcp_hostcache would have overridden a (now) lower MTU of
an interface or route that changed since first PMTU discovery.
The bug would have caused TCP to redo the PMTU discovery when
not strictly necessary.

Make a comment about already pre-initialized default values
more clear.

Reviewed by: sam


# 128019 07-Apr-2004 imp

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson


# 126456 01-Mar-2004 ume

fix -O0 compilation without INET6.

Pointed out by: ru


# 126351 28-Feb-2004 rwatson

Remove now unneeded arguments to tcp_twrespond() -- so and msrc. These
were needed by the MAC Framework until inpcbs gained labels.

Submitted by: sam


# 126239 25-Feb-2004 mlaier

Re-remove MT_TAGs. The problems with dummynet have been fixed now.

Tested by: -current, bms(mentor), me
Approved by: bms(mentor), sam


# 126220 25-Feb-2004 hsu

Relax a KASSERT condition to allow for a valid corner case where
the FIN on the last segment consumes an extra sequence number.

Spurious panic reported by Mike Silbersack <silby@silby.com>.


# 126193 24-Feb-2004 andre

Convert the tcp segment reassembly queue to UMA and limit the maximum
amount of segments it will hold.

The following tuneables and sysctls control the behaviour of the tcp
segment reassembly queue:

net.inet.tcp.reass.maxsegments (loader tuneable)
specifies the maximum number of segments all tcp reassemly queues can
hold (defaults to 1/16 of nmbclusters).

net.inet.tcp.reass.maxqlen
specifies the maximum number of segments any individual tcp session queue
can hold (defaults to 48).

net.inet.tcp.reass.cursegments (readonly)
counts the number of segments currently in all reassembly queues.

net.inet.tcp.reass.overflows (readonly)
counts how often either the global or local queue limit has been reached.

Tested by: bms, silby
Reviewed by: bms, silby


# 125952 18-Feb-2004 mlaier

Backout MT_TAG removal (i.e. bring back MT_TAGs) for now, as dummynet is
not working properly with the patch in place.

Approved by: bms(mentor)


# 125941 17-Feb-2004 ume

IPSEC and FAST_IPSEC have the same internal API now;
so merge these (IPSEC has an extra ipsecstat)

Submitted by: "Bjoern A. Zeeb" <bzeeb+freebsd@zabbadoz.net>


# 125784 13-Feb-2004 mlaier

This set of changes eliminates the use of MT_TAG "pseudo mbufs", replacing
them mostly with packet tags (one case is handled by using an mbuf flag
since the linkage between "caller" and "callee" is direct and there's no
need to incur the overhead of a packet tag).

This is (mostly) work from: sam

Silence from: -arch
Approved by: bms(mentor), sam, rwatson


# 125783 13-Feb-2004 bms

Brucification.

Submitted by: bde


# 125740 12-Feb-2004 bms

Remove an unnecessary initialization that crept in from the code which
verifies TCP-MD5 digests.

Noticed by: njl


# 125680 11-Feb-2004 bms

Initial import of RFC 2385 (TCP-MD5) digest support.

This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.

For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.

Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.

There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.

Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.

This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.

Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.

Sponsored by: sentex.net


# 125396 03-Feb-2004 ume

pass pcb rather than so. it is expected that per socket policy
works again.


# 124761 20-Jan-2004 hsu

Merge from DragonFlyBSD rev 1.10:

date: 2003/09/02 10:04:47; author: hsu; state: Exp; lines: +5 -6
Account for when Limited Transmit is not congestion window limited.

Obtained from: DragonFlyBSD


# 124258 08-Jan-2004 andre

Limiters and sanity checks for TCP MSS (maximum segement size)
resource exhaustion attacks.

For network link optimization TCP can adjust its MSS and thus
packet size according to the observed path MTU. This is done
dynamically based on feedback from the remote host and network
components along the packet path. This information can be
abused to pretend an extremely low path MTU.

The resource exhaustion works in two ways:

o during tcp connection setup the advertized local MSS is
exchanged between the endpoints. The remote endpoint can
set this arbitrarily low (except for a minimum MTU of 64
octets enforced in the BSD code). When the local host is
sending data it is forced to send many small IP packets
instead of a large one.

For example instead of the normal TCP payload size of 1448
it forces TCP payload size of 12 (MTU 64) and thus we have
a 120 times increase in workload and packets. On fast links
this quickly saturates the local CPU and may also hit pps
processing limites of network components along the path.

This type of attack is particularly effective for servers
where the attacker can download large files (WWW and FTP).

We mitigate it by enforcing a minimum MTU settable by sysctl
net.inet.tcp.minmss defaulting to 256 octets.

o the local host is reveiving data on a TCP connection from
the remote host. The local host has no control over the
packet size the remote host is sending. The remote host
may chose to do what is described in the first attack and
send the data in packets with an TCP payload of at least
one byte. For each packet the tcp_input() function will
be entered, the packet is processed and a sowakeup() is
signalled to the connected process.

For example an attack with 2 Mbit/s gives 4716 packets per
second and the same amount of sowakeup()s to the process
(and context switches).

This type of attack is particularly effective for servers
where the attacker can upload large amounts of data.
Normally this is the case with WWW server where large POSTs
can be made.

We mitigate this by calculating the average MSS payload per
second. If it goes below 'net.inet.tcp.minmss' and the pps
rate is above 'net.inet.tcp.minmssoverload' defaulting to
1000 this particular TCP connection is resetted and dropped.

MITRE CVE: CAN-2004-0002
Reviewed by: sam (mentor)
MFC after: 1 day


# 124199 06-Jan-2004 andre

Enable the following TCP options by default to give it more exposure:

rfc3042 Limited retransmit
rfc3390 Increasing TCP's initial congestion Window
inflight TCP inflight bandwidth limiting

All my production server have it enabled and there have been no
issues. I am confident about having them on by default and it gives
us better overall TCP performance.

Reviewed by: sam (mentor)


# 122987 25-Nov-2003 andre

Restructure a too broad ifdef which was disabling the setting of the
tcp flightsize sysctl value for local networks in the !INET6 case.

Approved by: re (scottl)


# 122922 20-Nov-2003 andre

Introduce tcp_hostcache and remove the tcp specific metrics from
the routing table. Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.

It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination. Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.

tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.

It removes significant locking requirements from the tcp stack with
regard to the routing table.

Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)


# 122875 18-Nov-2003 rwatson

Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.

This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.

For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.

Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.

Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 122576 12-Nov-2003 andre

dropwithreset is not needed in this case as tcp_drop() is already notifying
the other side. Before we were sending two RST packets.


# 122327 08-Nov-2003 sam

o correct locking problem: the inpcb must be held across tcp_respond
o add assertions in tcp_respond to validate inpcb locking assumptions
o use local variable instead of chasing pointers in tcp_respond

Supported by: FreeBSD Foundation


# 121628 28-Oct-2003 sam

speedup stream socket recv handling by tracking the tail of
the mbuf chain instead of walking the list for each append

Submitted by: ps/jayanth
Obtained from: netbsd (jason thorpe)


# 121285 20-Oct-2003 ume

enclose IPv6 part with ifdef INET6.

Obtained from: KAME


# 121283 20-Oct-2003 ume

correct linkmtu handling.

Obtained from: KAME


# 121161 17-Oct-2003 ume

- add dom_if{attach,detach} framework.
- transition to use ifp->if_afdata.

Obtained from: KAME


# 118861 13-Aug-2003 harti

A number of patches in the last years have created new return paths
in tcp_input that leave the function before hitting the tcp_trace
function call for the TCPDEBUG option. This has made TCPDEBUG mostly
useless (and tools like ports/benchmarks/dbs not working). Add
tcp_trace calls to the return paths that could be identified in this
maze.

This is a NOP unless you compile with TCPDEBUG.


# 117650 15-Jul-2003 hsu

Unify the "send high" and "recover" variables as specified in the
lastest rev of the spec. Use an explicit flag for Fast Recovery. [1]

Fix bug with exiting Fast Recovery on a retransmit timeout
diagnosed by Lu Guohan. [2]

Reviewed by: Thomas Henderson <thomas.r.henderson@boeing.com>
Reported and tested by: Lu Guohan <lguohan00@mails.tsinghua.edu.cn> [2]
Approved by: Thomas Henderson <thomas.r.henderson@boeing.com>,
Sally Floyd <floyd@acm.org> [1]


# 115503 31-May-2003 phk

Add /* FALLTHROUGH */

Found by: FlexeLint


# 114794 07-May-2003 rwatson

Correct a bug introduced with reduced TCP state handling; make
sure that the MAC label on TCP responses during TIMEWAIT is
properly set from either the socket (if available), or the mbuf
that it's responding to.

Unfortunately, this is made somewhat difficult by the TCP code,
as tcp_twstart() calls tcp_twrespond() after discarding the socket
but without a reference to the mbuf that causes the "response".
Passing both the socket and the mbuf works arounds this--eventually
it might be good to make sure the mbuf always gets passed in in
"response" scenarios but working through this provided to
complicate things too much.

Approved by: re (scottl)
Reviewed by: hsu
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 113799 21-Apr-2003 obrien

Explicitly declare 'int' parameters.


# 112957 01-Apr-2003 hsu

Observe conservation of packets when entering Fast Recovery while
doing Limited Transmit. Only artificially inflate the congestion
window by 1 segment instead of the usual 3 to take into account
the 2 already sent by Limited Transmit.

Approved in principle by: Mark Allman <mallman@grc.nasa.gov>,
Hari Balakrishnan <hari@nms.lcs.mit.edu>, Sally Floyd <floyd@icir.org>


# 112191 13-Mar-2003 hsu

Greatly simplify the unlocking logic by holding the TCP protocol lock until
after FIN_WAIT_2 processing.

Helped with debugging: Doug Barton


# 112171 13-Mar-2003 hsu

Add support for RFC 3390, which allows for a variable-sized
initial congestion window.


# 112162 12-Mar-2003 hsu

Implement the Limited Transmit algorithm (RFC 3042).


# 112009 08-Mar-2003 jlemon

Remove a panic(); if the zone allocator can't provide more timewait
structures, reuse the oldest one. Also move the expiry timer from
a per-structure callout to the tcp slow timer.

Sponsored by: DARPA, NAI Labs


# 111560 26-Feb-2003 jlemon

In timewait state, if the incoming segment is a pure in-sequence ack
that matches snd_max, then do not respond with an ack, just drop the
segment. This fixes a problem where a simultaneous close results in
an ack loop between two time-wait states.

Test case supplied by: Tim Robbins <tjr@FreeBSD.ORG>
Sponsored by: DARPA, NAI Labs


# 111549 26-Feb-2003 jlemon

The TCP protocol lock may still be held if the reassembly queue dropped FIN.
Detect this case and drop the lock accordingly.

Sponsored by: DARPA, NAI Labs


# 111389 24-Feb-2003 hsu

tcp_twstart() need to be called with the TCP protocol lock held to avoid
a race condition with the TCP timer routines.


# 111386 24-Feb-2003 hsu

Pass the right function to callout_reset() for a compressed
TIME-WAIT control block.


# 111319 23-Feb-2003 jlemon

Yesterday just wasn't my day. Remove testing delta that crept into the diff.

Pointy hat provided by: sam


# 111266 22-Feb-2003 jlemon

Check to see if the TF_DELACK flag is set before returning from
tcp_input(). This unbreaks delack handling, while still preserving
correct T/TCP behavior

Tested by: maxim
Sponsored by: DARPA, NAI Labs


# 111145 19-Feb-2003 jlemon

Add a TCP TIMEWAIT state which uses less space than a fullblown TCP
control block. Allow the socket and tcpcb structures to be freed
earlier than inpcb. Update code to understand an inp w/o a socket.

Reviewed by: hsu, silby, jayanth
Sponsored by: DARPA, NAI Labs


# 111140 19-Feb-2003 jlemon

Correct comments.


# 111139 19-Feb-2003 jlemon

Clean up delayed acks and T/TCP interactions:
- delay acks for T/TCP regardless of delack setting
- fix bug where a single pass through tcp_input might not delay acks
- use callout_active() instead of callout_pending()

Sponsored by: DARPA, NAI Labs


# 110830 13-Feb-2003 hsu

The protocol lock is always held in the dropafterack case, so we don't
need to check for it at runtime.


# 110251 02-Feb-2003 cjc

Add the TCP flags to the log message whenever log_in_vain is 1, not
just when set to 2.

PR: kern/43348
MFC after: 5 days


# 109175 13-Jan-2003 hsu

Fix NewReno.

Reviewed by: Tom Henderson <thomas.r.henderson@boeing.com>


# 108464 30-Dec-2002 dillon

Remove the PAWS ack-on-ack debugging printf().

Note that the original RFC 1323 (PAWS) says in 4.2.1 that the out of
order / reverse-time-indexed packet should be acknowledged as specified
in RFC-793 page 69 then dropped. The original PAWS code in FreeBSD (1994)
simply acknowledged the segment unconditionally, which is incorrect, and
was fixed in 1.183 (2002). At the moment we do not do checks for SYN or FIN
in addition to (tlen != 0), which may or may not be correct, but the
worst that ought to happen should be a retry by the sender.


# 108123 20-Dec-2002 hsu

Unravel a nested conditional.
Remove an unneeded local variable.


# 107961 17-Dec-2002 dillon

Fix syntax in last commit.


# 107854 14-Dec-2002 dillon

Bruce forwarded this tidbit from an analysis Van Jacobson did on an
apparent ack-on-ack problem with FreeBSD. Prof. Jacobson noticed a
case in our TCP stack which would acknowledge a received ack-only packet,
which is not legal in TCP.

Submitted by: Van Jacobson <van@packetdesign.com>,
bmah@packetdesign.com (Bruce A. Mah)
MFC after: 7 days


# 106736 10-Nov-2002 sam

a better solution to building FAST_IPSEC w/o INET6

Submitted by: Jeffrey Hsu <hsu@FreeBSD.org>


# 106679 08-Nov-2002 sam

fixup FAST_IPSEC build w/o INET6


# 106271 31-Oct-2002 jeff

- Consistently update snd_wl1, snd_wl2, and rcv_up in the header
prediction code. Previously, 2GB worth of header predicted data
could leave these variables too far out of sequence which would cause
problems after receiving a packet that did not match the header
prediction.

Submitted by: Bill Baumann <bbaumann@isilon.com>
Sponsored by: Isilon Systems, Inc.
Reviewed by: hsu, pete@isilon.com, neal@isilon.com, aaronp@isilon.com


# 106198 30-Oct-2002 hsu

Don't need to check if SO_OOBINLINE is defined.
Don't need to protect isipv6 conditional with INET6.
Fix leading indentation in 2 lines.


# 105199 16-Oct-2002 sam

Tie new "Fast IPsec" code into the build. This involves the usual
configuration stuff as well as conditional code in the IPv4 and IPv6
areas. Everything is conditional on FAST_IPSEC which is mutually
exclusive with IPSEC (KAME IPsec implmentation).

As noted previously, don't use FAST_IPSEC with INET6 at the moment.

Reviewed by: KAME, rwatson
Approved by: silence
Supported by: Vernier Networks


# 105194 16-Oct-2002 sam

Replace aux mbufs with packet tags:

o instead of a list of mbufs use a list of m_tag structures a la openbsd
o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit
ABI/module number cookie
o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and
use this in defining openbsd-compatible m_tag_find and m_tag_get routines
o rewrite KAME use of aux mbufs in terms of packet tags
o eliminate the most heavily used aux mbufs by adding an additional struct
inpcb parameter to ip_output and ip6_output to allow the IPsec code to
locate the security policy to apply to outbound packets
o bump __FreeBSD_version so code can be conditionalized
o fixup ipfilter's call to ip_output based on __FreeBSD_version

Reviewed by: julian, luigi (silent), -arch, -net, darren
Approved by: julian, silence from everyone else
Obtained from: openbsd (mostly)
MFC after: 1 month


# 104226 30-Sep-2002 dillon

Guido found another bug. There is a situation with
timestamped TCP packets where FreeBSD will send DATA+FIN and
A W2K box will ack just the DATA portion. If this occurs
after FreeBSD has done a (NewReno) fast-retransmit and is
recovering it (dupacks > threshold) it triggers a case in
tcp_newreno_partial_ack() (tcp_newreno() in stable) where
tcp_output() is called with the expectation that the retransmit
timer will be reloaded. But tcp_output() falls through and
returns without doing anything, causing the persist timer to be
loaded instead. This causes the connection to hang until W2K gives up.
This occurs because in the case where only the FIN must be acked, the
'len' calculation in tcp_output() will be 0, a lot of checks will be
skipped, and the FIN check will also be skipped because it is designed
to handle FIN retransmits, not forced transmits from tcp_newreno().

The solution is to simply set TF_ACKNOW before calling tcp_output()
to absolute guarentee that it will run the send code and reset the
retransmit timer. TF_ACKNOW is already used for this purpose in other
cases.

For some unknown reason this patch also seems to greatly reduce
the number of duplicate acks received when Guido runs his tests over
a lossy network. It is quite possible that there are other
tcp_newreno{_partial_ack()} cases which were not generating the expected
output which this patch also fixes.

X-MFC after: Will be MFC'd after the freeze is over


# 103776 22-Sep-2002 silby

Fix issue where shutdown(socket, SHUT_RD) was effectively
ignored for TCP sockets.

NetBSD PR: 18185
Submitted by: Sean Boudreau <seanb@qnx.com>
MFC after: 3 days


# 103505 17-Sep-2002 dillon

Guido reported an interesting bug where an FTP connection between a
Windows 2000 box and a FreeBSD box could stall. The problem turned out
to be a timestamp reply bug in the W2K TCP stack. FreeBSD sends a
timestamp with the SYN, W2K returns a timestamp of 0 in the SYN+ACK
causing FreeBSD to calculate an insane SRTT and RTT, resulting in
a maximal retransmit timeout (60 seconds). If there is any packet
loss on the connection for the first six or so packets the retransmit
case may be hit (the window will still be too small for fast-retransmit),
causing a 60+ second pause. The W2K box gives up and closes the
connection.

This commit works around the W2K bug.

15:04:59.374588 FREEBSD.20 > W2K.1036: S 1420807004:1420807004(0) win 65535 <mss 1460,nop,wscale 2,nop,nop,timestamp 188297344 0> (DF) [tos 0x8]
15:04:59.377558 W2K.1036 > FREEBSD.20: S 4134611565:4134611565(0) ack 1420807005 win 17520 <mss 1460,nop,wscale 0,nop,nop,timestamp 0 0> (DF)

Bug reported by: Guido van Rooij <guido@gvr.org>


# 102412 25-Aug-2002 charnier

Replace various spelling with FALLTHROUGH which is lint()able


# 102131 19-Aug-2002 jmallett

Enclose IPv6 addresses in brackets when they are displayed printable with a
TCP/UDP port seperated by a colon. This is for the log_in_vain facility.

Pointed out by: Edward J. M. Brocklesby
Reviewed by: ume
MFC after: 2 weeks


# 102017 17-Aug-2002 dillon

Implement TCP bandwidth delay product window limiting, similar to (but
not meant to duplicate) TCP/Vegas. Add four sysctls and default the
implementation to 'off'.

net.inet.tcp.inflight_enable enable algorithm (defaults to 0=off)
net.inet.tcp.inflight_debug debugging (defaults to 1=on)
net.inet.tcp.inflight_min minimum window limit
net.inet.tcp.inflight_max maximum window limit

MFC after: 1 week


# 102002 17-Aug-2002 hsu

Cosmetic-only changes for readability.

Reviewed by: (early form passed by) bde
Approved by: itojun (from core@kame.net)


# 101934 15-Aug-2002 rwatson

Rename mac_check_socket_receive() to mac_check_socket_deliver() so that
we can use the names _receive() and _send() for the receive() and send()
checks. Rename related constants, policy implementations, etc.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 101928 15-Aug-2002 hsu

Reset dupack count in header prediction.
Follow-on to rev 1.39.

Reviewed by: jayanth, Thomas R Henderson <thomas.r.henderson@boeing.com>, silby, dillon


# 101106 31-Jul-2002 rwatson

Introduce support for Mandatory Access Control and extensible
kernel access control.

Instrument the TCP socket code for packet generation and delivery:
label outgoing mbufs with the label of the socket, and check socket and
mbuf labels before permitting delivery to a socket. Assign labels
to newly accepted connections when the syncache/cookie code has done
its business. Also set peer labels as convenient. Currently,
MAC policies cannot influence the PCB matching algorithm, so cannot
implement polyinstantiation. Note that there is at least one case
where a PCB is not available due to the TCP packet not being associated
with any socket, so we don't label in that case, but need to handle
it in a special manner.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 100534 22-Jul-2002 ru

Don't shrink socket buffers in tcp_mss(), application might have already
configured them with setsockopt(SO_*BUF), for RFC1323's scaled windows.

PR: kern/11966
MFC after: 1 week


# 100373 19-Jul-2002 dillon

Add the tcps_sndrexmitbad statistic, keep track of late acks that caused
unnecessary retransmissions.


# 98781 24-Jun-2002 hsu

Avoid unlocking the inp twice if badport_bandlim() returns -1.

Reported by: jlemon


# 98769 24-Jun-2002 hsu

Style bug: fix 4 space indentations that should have been tabs.

Submitted by: jlemon


# 98703 23-Jun-2002 luigi

Move two global variables to automatic variables within the
only function where they are used (they are used with TCPDEBUG only).


# 98613 22-Jun-2002 luigi

Remove (almost all) global variables that were used to hold
packet forwarding state ("annotations") during ip processing.
The code is considerably cleaner now.

The variables removed by this change are:

ip_divert_cookie used by divert sockets
ip_fw_fwd_addr used for transparent ip redirection
last_pkt used by dynamic pipes in dummynet

Removal of the first two has been done by carrying the annotations
into volatile structs prepended to the mbuf chains, and adding
appropriate code to add/remove annotations in the routines which
make use of them, i.e. ip_input(), ip_output(), tcp_input(),
bdg_forward(), ether_demux(), ether_output_frame(), div_output().

On passing, remove a bug in divert handling of fragmented packet.
Now it is the fragment at offset 0 which sets the divert status of
the whole packet, whereas formerly it was the last incoming fragment
to decide.

Removal of last_pkt required a change in the interface of ip_fw_chk()
and dummynet_io(). On passing, use the same mechanism for dummynet
annotations and for divert/forward annotations.

option IPFIREWALL_FORWARD is effectively useless, the code to
implement it is very small and is now in by default to avoid the
obfuscation of conditionally compiled code.

NOTES:
* there is at least one global variable left, sro_fwd, in ip_output().
I am not sure if/how this can be removed.

* I have deliberately avoided gratuitous style changes in this commit
to avoid cluttering the diffs. Minor stule cleanup will likely be
necessary

* this commit only focused on the IP layer. I am sure there is a
number of global variables used in the TCP and maybe UDP stack.

* despite the number of files touched, there are absolutely no API's
or data structures changed by this commit (except the interfaces of
ip_fw_chk() and dummynet_io(), which are internal anyways), so
an MFC is quite safe and unintrusive (and desirable, given the
improved readability of the code).

MFC after: 10 days


# 98385 18-Jun-2002 tanimura

Remove so*_locked(), which were backed out by mistake.


# 98102 10-Jun-2002 hsu

Lock up inpcb.

Submitted by: Jennifer Yang <yangjihui@yahoo.com>


# 97658 31-May-2002 tanimura

Back out my lats commit of locking down a socket, it conflicts with hsu's work.

Requested by: hsu


# 96972 20-May-2002 tanimura

Lock down a socket, milestone 1.

o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

- so_count
- so_options
- so_linger
- so_state

o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:

- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()

Reviewed by: alfred


# 95883 01-May-2002 alfred

Redo the sigio locking.

Turn the sigio sx into a mutex.

Sigio lock is really only needed to protect interrupts from dereferencing
the sigio pointer in an object when the sigio itself is being destroyed.

In order to do this in the most unintrusive manner change pgsigio's
sigio * argument into a **, that way we can lock internally to the
function.


# 95759 30-Apr-2002 tanimura

Revert the change of #includes in sys/filedesc.h and sys/socketvar.h.

Requested by: bde

Since locking sigio_lock is usually followed by calling pgsigio(),
move the declaration of sigio_lock and the definitions of SIGIO_*() to
sys/signalvar.h.

While I am here, sort include files alphabetically, where possible.


# 95552 27-Apr-2002 tanimura

Add a global sx sigio_lock to protect the pointer to the sigio object
of a socket. This avoids lock order reversal caused by locking a
process in pgsigio().

sowakeup() and the callers of it (sowwakeup, soisconnected, etc.) now
require sigio_lock to be locked. Provide sowwakeup_locked(),
soisconnected_locked(), and so on in case where we have to modify a
socket and wake up a process atomically.


# 95023 19-Apr-2002 suz

just merged cosmetic changes from KAME to ease sync between KAME and FreeBSD.
(based on freebsd4-snap-20020128)

Reviewed by: ume
MFC after: 1 week


# 94390 10-Apr-2002 silby

Remove some ISN generation code which has been unused since the
syncache went in.

MFC after: 3 days


# 93085 24-Mar-2002 bde

Fixed some style bugs in the removal of __P(()). Continuation lines
were not outdented to preserve non-KNF lining up of code with parentheses.
Switch to KNF formatting.


# 92723 19-Mar-2002 alfred

Remove __P.


# 91374 27-Feb-2002 cjc

Change the wording of the inline comments from the previous commit.

Objection from: ru


# 91234 25-Feb-2002 cjc

The TCP code did not do sufficient checks on whether incoming packets
were destined for a broadcast IP address. All TCP packets with a
broadcast destination must be ignored. The system only ignored packets
that were _link-layer_ broadcasts or multicast. We need to check the
IP address too since it is quite possible for a broadcast IP address
to come in with a unicast link-layer address.

Note that the check existed prior to CSRG revision 7.35, but was
removed. This commit effectively backs out that nine-year-old change.

PR: misc/35022


# 90868 18-Feb-2002 mike

o Move NTOHL() and associated macros into <sys/param.h>. These are
deprecated in favor of the POSIX-defined lowercase variants.
o Change all occurrences of NTOHL() and associated marcros in the
source tree to use the lowercase function variants.
o Add missing license bits to sparc64's <machine/endian.h>.
Approved by: jake
o Clean up <machine/endian.h> files.
o Remove unused __uint16_swap_uint32() from i386's <machine/endian.h>.
o Remove prototypes for non-existent bswapXX() functions.
o Include <machine/endian.h> in <arpa/inet.h> to define the
POSIX-required ntohl() family of functions.
o Do similar things to expose the ntohl() family in libstand, <netinet/in.h>,
and <sys/param.h>.
o Prepend underscores to the ntohl() family to help deal with
complexities associated with having MD (asm and inline) versions, and
having to prevent exposure of these functions in other headers that
happen to make use of endian-specific defines.
o Create weak aliases to the canonical function name to help deal with
third-party software forgetting to include an appropriate header.
o Remove some now unneeded pollution from <sys/types.h>.
o Add missing <arpa/inet.h> includes in userland.

Tested on: alpha, i386
Reviewed by: bde, jake, tmm


# 88884 04-Jan-2002 rwatson

o Spelling fix in comment: tcp_ouput -> tcp_output


# 87778 13-Dec-2001 jlemon

Fix up tabs in comments.


# 87193 02-Dec-2001 dillon

Fix a bug with transmitter restart after receiving a 0 window. The
receiver was not sending an immediate ack with delayed acks turned on
when the input buffer is drained, preventing the transmitter from
restarting immediately.

Propogate the TCP_NODELAY option to accept()ed sockets. (Helps tbench and
is a good idea anyway).

Some cleanup. Identify additonal issues in comments.

MFC after: 1 day


# 86764 22-Nov-2001 jlemon

Introduce a syncache, which enables FreeBSD to withstand a SYN flood
DoS in an improved fashion over the existing code.

Reviewed by: silby (in a previous iteration)
Sponsored by: DARPA, NAI Labs


# 86744 21-Nov-2001 jlemon

Move initialization of snd_recover into tcp_sendseqinit().


# 83366 12-Sep-2001 julian

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 82884 03-Sep-2001 julian

Patches from Keiichi SHIMA <keiichi@iij.ad.jp>
to make ip use the standard protosw structure again.

Obtained from: Well, KAME I guess.


# 82529 29-Aug-2001 jayanth

when newreno is turned on, if dupacks = 1 or dupacks = 2 and
new data is acknowledged, reset the dupacks to 0.
The problem was spotted when a connection had its send buffer full
because the congestion window was only 1 MSS and was not being incremented
because dupacks was not reset to 0.

Obtained from: Yahoo!


# 82238 23-Aug-2001 dd

Correct a typo in a comment: FIN_WAIT2 -> FIN_WAIT_2

PR: 29970
Submitted by: Joseph Mallett <jmallett@xMach.org>


# 82122 22-Aug-2001 silby

Much delayed but now present: RFC 1948 style sequence numbers

In order to ensure security and functionality, RFC 1948 style
initial sequence number generation has been implemented. Barring
any major crypographic breakthroughs, this algorithm should be
unbreakable. In addition, the problems with TIME_WAIT recycling
which affect our currently used algorithm are not present.

Reviewed by: jesper


# 79413 08-Jul-2001 silby

Temporary feature: Runtime tuneable tcp initial sequence number
generation scheme. Users may now select between the currently used
OpenBSD algorithm and the older random positive increment method.

While the OpenBSD algorithm is more secure, it also breaks TIME_WAIT
handling; this is causing trouble for an increasing number of folks.

To switch between generation schemes, one sets the sysctl
net.inet.tcp.tcp_seq_genscheme. 0 = random positive increments,
1 = the OpenBSD algorithm. 1 is still the default.

Once a secure _and_ compatible algorithm is implemented, this sysctl
will be removed.

Reviewed by: jlemon
Tested by: numerous subscribers of -net


# 78667 23-Jun-2001 ru

Add netstat(1) knob to reset net.inet.{ip|icmp|tcp|udp|igmp}.stats.
For example, ``netstat -s -p ip -z'' will show and reset IP stats.

PR: bin/17338


# 78642 23-Jun-2001 silby

Eliminate the allocation of a tcp template structure for each
connection. The information contained in a tcptemp can be
reconstructed from a tcpcb when needed.

Previously, tcp templates required the allocation of one
mbuf per connection. On large systems, this change should
free up a large number of mbufs.

Reviewed by: bmilekic, jlemon, ru
MFC after: 2 weeks


# 78064 11-Jun-2001 ume

Sync with recent KAME.
This work was based on kame-20010528-freebsd43-snap.tgz and some
critical problem after the snap was out were fixed.
There are many many changes since last KAME merge.

TODO:
- The definitions of SADB_* in sys/net/pfkeyv2.h are still different
from RFC2407/IANA assignment because of binary compatibility
issue. It should be fixed under 5-CURRENT.
- ip6po_m member of struct ip6_pktopts is no longer used. But, it
is still there because of binary compatibility issue. It should
be removed under 5-CURRENT.

Reviewed by: itojun
Obtained from: KAME
MFC after: 3 weeks


# 77830 06-Jun-2001 jesper

Silby's take one on increasing FreeBSD's resistance to SYN floods:

One way we can reduce the amount of traffic we send in response to a SYN
flood is to eliminate the RST we send when removing a connection from
the listen queue. Since we are being flooded, we can assume that the
majority of connections in the queue are bogus. Our RST is unwanted
by these hosts, just as our SYN-ACK was. Genuine connection attempts
will result in hosts responding to our SYN-ACK with an ACK packet. We
will automatically return a RST response to their ACK when it gets to us
if the connection has been dropped, so the early RST doesn't serve the
genuine class of connections much. In summary, we can reduce the number
of packets we send by a factor of two without any loss in functionality
by ensuring that RST packets are not sent when dropping a connection
from the listen queue.

Submitted by: Mike Silbersack <silby@silby.com>
Reviewed by: jesper
MFC after: 2 weeks


# 77427 29-May-2001 jesper

Inline TCP_REASS() in the single location where it's used,
just as OpenBSD and NetBSD has done.

No functional difference.

MFC after: 2 weeks


# 77421 29-May-2001 jesper

properly delay acks in half-closed TCP connections

PR: 24962
Submitted by: Tony Finch <dot@dotat.at>
MFC after: 2 weeks


# 75733 20-Apr-2001 jesper

Say goodbye to TCP_COMPAT_42

Reviewed by: wollman
Requested by: wollman


# 75619 17-Apr-2001 kris

Randomize the TCP initial sequence numbers more thoroughly.

Obtained from: OpenBSD
Reviewed by: jesper, peter, -developers


# 74494 19-Mar-2001 des

Axe TCP_RESTRICT_RST. It was never a particularly good idea except for a few
very specific scenarios, and now that we have had net.inet.tcp.blackhole for
quite some time there is really no reason to use it any more.

(last of three commits)


# 73031 25-Feb-2001 jlemon

Do not delay a new ack if there already is a delayed ack pending on the
connection, but send it immediately. Prior to this change, it was possible
to delay a delayed-ack for multiple times, resulting in degraded TCP
behavior in certain corner cases.


# 72357 11-Feb-2001 bmilekic

Clean up RST ratelimiting. Previously, ratelimiting occured before tests
were performed to determine if the received packet should be reset. This
created erroneous ratelimiting and false alarms in some cases. The code
has now been reorganized so that the checks for validity come before
the call to badport_bandlim. Additionally, a few changes in the symbolic
names of the bandlim types have been made, as well as a clarification of
exactly which type each RST case falls under.

Submitted by: Mike Silbersack <silby@silby.com>


# 71594 24-Jan-2001 wollman

Correct a comment.


# 70070 15-Dec-2000 bmilekic

Change the following:

1. ICMP ECHO and TSTAMP replies are now rate limited.
2. RSTs generated due to packets sent to open and unopen ports
are now limited by seperate counters.
3. Each rate limiting queue now has its own description, as
follows:

Limiting icmp unreach response from 439 to 200 packets per second
Limiting closed port RST response from 283 to 200 packets per second
Limiting open port RST response from 18724 to 200 packets per second
Limiting icmp ping response from 211 to 200 packets per second
Limiting icmp tstamp response from 394 to 200 packets per second

Submitted by: Mike Silbersack <silby@silby.com>


# 69781 08-Dec-2000 dwmalone

Convert more malloc+bzero to malloc+M_ZERO.

Submitted by: josh@zipperup.org
Submitted by: Robert Drehmel <robd@gmx.net>


# 68318 04-Nov-2000 jlemon

tp->snd_recover is part of the New Reno recovery algorithm, and should
only be checked if the system is currently performing New Reno style
fast recovery. However, this value was being checked regardless of the
NR state, with the end result being that the congestion window was never
opened.

Change the logic to check t_dupack instead; the only code path that
allows it to be nonzero at this point is NewReno, so if it is nonzero,
we are in fast recovery mode and should not touch the congestion window.

Tested by: phk


# 63745 21-Jul-2000 jayanth

When a connection is being dropped due to a listen queue overflow,
delete the cloned route that is associated with the connection.
This does not exhaust the routing table memory when the system
is under a SYN flood attack. The route entry is not deleted if there
is any prior information cached in it.

Reviewed by: Peter Wemm,asmodai


# 62846 09-Jul-2000 itojun

be more cautious about tcp option length field. drop bogus ones earlier.
not sure if there is a real threat or not, but it seems that there's
possibility for overrun/underrun (like non-NOP option with optlen > cnt).


# 62587 04-Jul-2000 itojun

sync with kame tree as of july00. tons of bug fixes/improvements.

API changes:
- additional IPv6 ioctls
- IPsec PF_KEY API was changed, it is mandatory to upgrade setkey(8).
(also syntax change)


# 60798 22-May-2000 dan

sysctl'ize ICMP_BANDLIM and ICMP_BANDLIM_SUPPRESS_OUTPUT.

Suggested by: des/nbm


# 60687 18-May-2000 jayanth

snd_cwnd was updated twice in the tcp_newreno function.


# 60662 17-May-2000 jayanth

Sigh, fix a rookie patch merge error.

Also-missed-by: peter


# 60619 16-May-2000 jayanth

snd_una was being updated incorrectly, this resulted in the newreno
code retransmitting data from the wrong offset.

As a footnote, the newreno code was partially derived from NetBSD
and Tom Henderson <tomh@cs.berkeley.edu>


# 60067 06-May-2000 jlemon

Implement TCP NewReno, as documented in RFC 2582. This allows
better recovery for multiple packet losses in a single window.
The algorithm can be toggled via the sysctl net.inet.tcp.newreno,
which defaults to "on".

Submitted by: Jayanth Vijayaraghavan <jayanth@yahoo-inc.com>


# 59334 17-Apr-2000 sumikawa

ND6_HINT() should not be called unless the connection status is
ESTABLISHED.

Obtained from: KAME Project


# 58907 01-Apr-2000 shin

Support per socket based IPv4 mapped IPv6 addr enable/disable control.

Submitted by: ume


# 58698 27-Mar-2000 jlemon

Add support for offloading IP/TCP/UDP checksums to NIC hardware which
supports them.


# 57903 11-Mar-2000 shin

IPv6 6to4 support.

Now most big problem of IPv6 is getting IPv6 address
assignment.
6to4 solve the problem. 6to4 addr is defined like below,

2002: 4byte v4 addr : 2byte SLA ID : 8byte interface ID

The most important point of the address format is that an IPv4 addr
is embeded in it. So any user who has IPv4 addr can get IPv6 address
block with 2byte subnet space. Also, the IPv4 addr is used for
semi-automatic IPv6 over IPv4 tunneling.

With 6to4, getting IPv6 addr become dramatically easy.
The attached patch enable 6to4 extension, and confirmed to work,
between "Richard Seaman, Jr." <dick@tar.com> and me.

Approved by: jkh

Reviewed by: itojun


# 56724 28-Jan-2000 imp

Mitigate the stream.c attacks

o Drop all broadcast and multicast source addresses in tcp_input.
o Enable ICMP_BANDLIM in GENERIC.
o Change default to 200/s from 100/s. This will still stop the attack, but
is conservative enough to do this close to code freeze.

This is not the optimal patch for the problem, but is likely the least
intrusive patch that can be made for this.

Obtained from: Don Lewis and Matt Dillon.
Reviewed by: freebsd-security


# 56565 25-Jan-2000 shin

Avoid m_len and m_pkthdr.len inconsistency when changing m_len
for an mbuf whose M_PKTHDR is set.

PR: related to kern/15175
Reviewed by: archie


# 56041 15-Jan-2000 shin

Fixed the problem that IPsec connection hangs when bigger data is sent.
-opt_ipsec.h was missing on some tcp files (sorry for basic mistake)
-made buildable as above fix
-also added some missing IPv4 mapped IPv6 addr consideration into
ipsec4_getpolicybysock


# 55875 13-Jan-2000 shin

add a comment for some possible? IPv4 option processing.


# 55679 09-Jan-2000 shin

tcp updates to support IPv6.
also a small patch to sys/nfs/nfs_socket.c, as max_hdr size change.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 55009 22-Dec-1999 shin

IPSEC support in the kernel.
pr_input() routines prototype is also changed to support IPSEC and IPV6
chained protocol headers.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 54601 14-Dec-1999 jlemon

Use SEQ_* macros for comparing sequence space numbers.

Reviewed by: truckman


# 54421 11-Dec-1999 jlemon

According to RFC 793, a reset should be honored if the sequence number
is within the receive window. Follow this behavior, instead of only
allowing resets at last_ack_sent.

Pointed out by: jayanth@yahoo-inc.com


# 54263 07-Dec-1999 shin

udp IPv6 support, IPv6/IPv4 tunneling support in kernel,
packet divert at kernel for IPv6/IPv4 translater daemon

This includes queue related patch submitted by jburkhol@home.com.

Submitted by: queue related patch from jburkhol@home.com
Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 52070 09-Oct-1999 green

Implement RLIMIT_SBSIZE in the kernel. This is a per-uid sockbuf total
usage limit.


# 51279 14-Sep-1999 des

Fix some more disordering, as well as the description string for the
net.inet.tcp.drop_synfin sysctl, which for some mysterious reason said
"Drop TCP packets with FIN+ACK set" (instead of "...with SYN+FIN set")


# 51209 12-Sep-1999 des

Add the net.inet.tcp.restrict_rst and net.inet.tcp.drop_synfin sysctl
variables, conditional on the TCP_RESTRICT_RST and TCP_DROP_SYNFIN kernel
options, respectively. See the comments in LINT for details.


# 50673 30-Aug-1999 jlemon

Restructure TCP timeout handling:

- eliminate the fast/slow timeout lists for TCP and instead use a
callout entry for each timer.
- increase the TCP timer granularity to HZ
- implement "bad retransmit" recovery, as presented in
"On Estimating End-to-End Network Path Properties", by Allman and Paxson.

Submitted by: jlemon, wollmann


# 50596 29-Aug-1999 obrien

Remove extra indenting of `break' statements introducted in rev 1.89,
plus wrap some long lines from that revision.

While here, wrap some other long lines.


# 50477 28-Aug-1999 peter

$Id$ -> $FreeBSD$


# 50043 19-Aug-1999 csgr

Fix breakage if blackhole=1 and tiflags & TH_SYN, plus
style(9) fixes

Submitted by: Jonathon Lemon


# 50015 18-Aug-1999 csgr

Slight tweak to tcp.blackhole to add optional behaviour to
drop any segment arriving at a closed port.
tcp.blackhole=1 - only drop SYN without RST
tcp.blackhole=2 - drop everything without RST
tcp.blackhole=0 - always send RST - default behaviour

This confuses nmap -sF or -sX or -sN quite badly.


# 49968 17-Aug-1999 csgr

Add net.inet.tcp.blackhole and net.inet.udp.blackhole
sysctl knobs.

With these knobs on, refused connection attempts are dropped
without sending a RST, or Port unreachable in the UDP case.
In the TCP case, sending of RST is inhibited iff the incoming
segment was a SYN.

Docs and rc.conf settings to follow.


# 48886 18-Jul-1999 jmb

fix comment re: RST received in TIME_WAIT to match the code.


# 46568 06-May-1999 peter

Add sufficient braces to keep egcs happy about potentially ambiguous
if/else nesting.


# 46381 03-May-1999 billf

Add sysctl descriptions to many SYSCTL_XXXs

PR: kern/11197
Submitted by: Adrian Chadd <adrian@FreeBSD.org>
Reviewed by: billf(spelling/style/minor nits)
Looked at by: bde(style)


# 43691 06-Feb-1999 fenner

Use snd_nxt, not rcv_nxt, when calculating the ISS during TIME_WAIT.
This was missed in the 4.4-Lite2 merge.

Noticed by: Mohan Parthasarathy <Mohan.Parthasarathy@eng.Sun.COM> and
jayanth@loc201.tandem.com (vijayaraghavan_jayanth)
on the tcp-impl mailing list.


# 43305 27-Jan-1999 dillon

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# 41487 03-Dec-1998 dillon

Reviewed by: freebsd-current

Add ICMP_BANDLIM option and 'net.inet.icmp.icmplim' sysctl. If option
is specified in kernel config, icmplim defaults to 100 pps. Setting it
to 0 will disable the feature. This feature limits ICMP error responses
for packets sent to bad tcp or udp ports, which does a lot to help the
machine handle network D.O.S. attacks.

The kernel will report packet rates that exceed the limit at a rate of
one kernel printf per second. There is one issue in regards to the
'tail end' of an attack... the kernel will not output the last report
until some unrelated and valid icmp error packet is return at some
point after the attack is over. This is a minor reporting issue only.


# 39078 11-Sep-1998 wollman

Fix RST validation.

PR: 7892
Submitted by: Don.Lewis@tsc.tdk.com


# 38513 24-Aug-1998 dfr

Re-implement tcp and ip fragment reassembly to not store pointers in the
ip header which can't work on alpha since pointers are too big.

Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>


# 37409 06-Jul-1998 julian

Support for IPFW based transparent forwarding.
Any packet that can be matched by a ipfw rule can be redirected
transparently to another port or machine. Redirection to another port
mostly makes sense with tcp, where a session can be set up
between a proxy and an unsuspecting client. Redirection to another machine
requires that the other machine also be expecting to receive the forwarded
packets, as their headers will not have been modified.

/sbin/ipfw must be recompiled!!!

Reviewed by: Peter Wemm <peter@freebsd.org>
Submitted by: Chrisy Luke <chrisy@flix.net>


# 36529 31-May-1998 peter

Let the sowwakeup macro decide when to call sowakeup rather than have
tcp "know" about it. A pending upcall would be missed, eg: used by NFS.

Obtained from: NetBSD


# 36161 18-May-1998 guido

Grumble...It seems I'm suffering from some mental disease. Do it correct now.


# 36159 18-May-1998 guido

Add some parenthesis for clarity and fix a bug
Pointed out by: Garrett Wollmand


# 35698 04-May-1998 guido

Refuse accellerated opens on listening sockets that have not set
the TCP_NOPUSH socket option.
This disables TAO for those services that do not know about T/TCP.

Reviewed by: Garrett Wollman
Submitted by: Peter Wemm


# 35421 24-Apr-1998 dg

At the request of Garrett, changed sysctl:

net.inet.tcp.delack_enabled -> net.inet.tcp.delayed_ack


# 35256 17-Apr-1998 des

Seventy-odd "its" / "it's" typos in comments fixed as per kern/6108.


# 35056 06-Apr-1998 phk

Remove the last traces of TUBA.

Inspired by: PR kern/3317


# 34697 20-Mar-1998 fenner

Remove the check for SYN in SYN_RECEIVED state; it breaks simultaneous
connect. This check was added as part of the defense against the "land"
attack, to prevent attacks which guess the ISS from going into ESTABLISHED.
The "src == dst" check will still prevent the single-homed case of the
"land" attack, and guessing ISS's should be hard anyway.

Submitted by: David Borman <dab@bsdi.com>


# 33846 26-Feb-1998 dg

Changes to support the addition of a new sysctl variable:
net.inet.tcp.delack_enabled
Which defaults to 1 and can be set to 0 to disable TCP delayed-ack
processing (i.e. all acks are immediate).


# 32821 27-Jan-1998 dg

Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.

Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.

These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.

Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.

WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).


# 32662 21-Jan-1998 fenner

A more complete fix for the "land" attack, removing the "quick fix" from
rev 1.66. This fix contains both belt and suspenders.

Belt: ignore packets where src == dst and srcport == dstport in TCPS_LISTEN.
These packets can only legitimately occur when connecting a socket to itself,
which doesn't go through TCPS_LISTEN (it goes CLOSED->SYN_SENT->SYN_RCVD->
ESTABLISHED). This prevents the "standard" "land" attack, although doesn't
prevent the multi-homed variation.

Suspenders: send a RST in response to a SYN/ACK in SYN_RECEIVED state.
The only packets we should get in SYN_RECEIVED are
1. A retransmitted SYN, or
2. An ack of our SYN/ACK.
The "land" attack depends on us accepting our own SYN/ACK as an ACK;
in SYN_RECEIVED state; this should prevent all "land" attacks.

We also move up the sequence number check for the ACK in SYN_RECEIVED.
This neither helps nor hurts with respect to the "land" attack, but
puts more of the validation checking in one spot.

PR: kern/5103


# 31882 19-Dec-1997 bde

Don't use ANSI string concatenation to misformat a string.


# 31323 20-Nov-1997 wollman

Add Matt Dillon's quick fix hack for the self-connect DoS.

PR: 5103


# 31016 07-Nov-1997 phk

Remove a bunch of variables which were unused both in GENERIC and LINT.

Found by: -Wunused


# 30813 28-Oct-1997 bde

Removed unused #includes.


# 30052 02-Oct-1997 dg

Killed the SYN_RECEIVED addition from rev 1.52. It results in legitimate
RST's being ignored, keeping a connection around until it times out, and
thus has the opposite effect of what was intended (which is to make the
system more robust to DoS attacks).


# 30005 30-Sep-1997 fenner

Don't consider a SYN/ACK with CC but no CCECHO a proper T/TCP
handshake.

Reviewed by: Rich Stevens <rstevens@kohala.com>


# 29514 16-Sep-1997 joerg

Make TCPDEBUG a new-style option.


# 28270 16-Aug-1997 wollman

Fix all areas of the system (or at least all those in LINT) to avoid storing
socket addresses in mbufs. (Socket buffers are the one exception.) A number
of kernel APIs needed to get fixed in order to make this happen. Also,
fix three protocol families which kept PCBs in mbufs to not malloc them
instead. Delete some old compatibility cruft while we're at it, and add
some new routines in the in_cksum family.


# 27135 01-Jul-1997 jdp

Fix a bug (apparently very old) that can cause a TCP connection to
be dropped when it has an unusual traffic pattern. For full details
as well as a test case that demonstrates the failure, see the
referenced PR.

Under certain circumstances involving the persist state, it is
possible for the receive side's tp->rcv_nxt to advance beyond its
tp->rcv_adv. This causes (tp->rcv_adv - tp->rcv_nxt) to become
negative. However, in the code affected by this fix, that difference
was interpreted as an unsigned number by max(). Since it was
negative, it was taken as a huge unsigned number. The effect was
to cause the receiver to believe that its receive window had negative
size, thereby rejecting all received segments including ACKs. As
the test case shows, this led to fruitless retransmissions and
eventually to a dropped connection. Even connections using the
loopback interface could be dropped. The fix substitutes the signed
imax() for the unsigned max() function.

PR: closes kern/3998
Reviewed by: davidg, fenner, wollman


# 25201 27-Apr-1997 wollman

The long-awaited mega-massive-network-code- cleanup. Part I.

This commit includes the following changes:
1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility
glue for them is deleted, and the kernel will panic on boot if any are compiled
in.

2) Certain protocol entry points are modified to take a process structure,
so they they can easily tell whether or not it is possible to sleep, and
also to access credentials.

3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt()
call. Protocols should use the process pointer they are now passed.

4) The PF_LOCAL and PF_ROUTE families have been updated to use the new
style, as has the `raw' skeleton family.

5) PF_LOCAL sockets now obey the process's umask when creating a socket
in the filesystem.

As a result, LINT is now broken. I'm hoping that some enterprising hacker
with a bit more time will either make the broken bits work (should be
easy for netipx) or dike them out.


# 22975 22-Feb-1997 peter

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 21673 14-Jan-1997 jkh

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 19597 10-Nov-1996 fenner

Re-enable the TCP SYN-attack protection code. I was the one who didn't
understand the socket state flag.

2.2 candidate.


# 18874 11-Oct-1996 pst

Fix two bugs I accidently put into the syn code at the last minute
(yes I had tested the hell out of this).

I've also temporarily disabled the code so that it behaves as it previously
did (tail drop's the syns) pending discussion with fenner about some socket
state flags that I don't fully understand.

Submitted by: fenner


# 18795 07-Oct-1996 dg

Improved in_pcblookuphash() to support wildcarding, and changed relavent
callers of it to take advantage of this. This reduces new connection
request overhead in the face of a large number of PCBs in the system.
Thanks to David Filo <filo@yahoo.com> for suggesting this and providing
a sample implementation (which wasn't used, but showed that it could be
done).

Reviewed by: wollman


# 18787 07-Oct-1996 pst

Increase robustness of FreeBSD against high-rate connection attempt
denial of service attacks.

Reviewed by: bde,wollman,olah
Inspired by: vjs@sgi.com


# 18437 21-Sep-1996 pst

I don't understand, I committed this fix (move a counter and fixed a typo)
this evening.

I think I'm going insane.


# 18436 21-Sep-1996 ache

Syntax error: so_incom -> so_incomp


# 18431 20-Sep-1996 pst

If the incomplete listen queue for a given socket is full,
drop the oldest entry in the queue.

There was a fair bit of discussion as to whether or not the
proper action is to drop a random entry in the queue. It's
my conclusion that a random drop is better than a head drop,
however profiling this section of code (done by John Capo)
shows that a head-drop results in a significant performance
increase.

There are scenarios where a random drop is more appropriate.
If I find one in reality, I'll add the random drop code under
a conditional.

Obtained from: discussions and code done by Vernon Schryver (vjs@sgi.com).


# 18280 13-Sep-1996 pst

Make the misnamed tcp initial keepalive timer value (which is really the
time, in seconds, that state for non-established TCP sessions stays about)
a sysctl modifyable variable.

[part 1 of two commits, I just realized I can't play with the indices as
I was typing this commit message.]


# 18278 13-Sep-1996 pst

Receipt of two SYN's are sufficient to set the t_timer[TCPT_KEEP]
to "keepidle". this should not occur unless the connection has
been established via the 3-way handshake which requires an ACK

Submitted by: jmb
Obtained from: problem discussed in Stevens vol. 3


# 15525 02-May-1996 fenner

Back out my stupid braino; I was thinking strlen and not sizeof.


# 15524 02-May-1996 fenner

Size temp var correctly; buf[4*sizeof "123"] is not long enough
to store "192.252.119.189\0".


# 15414 27-Apr-1996 ache

inet_ntoa buffer was evaluated twice in log_in_vain, fix it.
Thanx to: jdp


# 15396 26-Apr-1996 wollman

Delete #ifdef notdef blocks containing old method of srtt calculation.

Requested by: davidg


# 15154 09-Apr-1996 pst

Logging UDP and TCP connection attempts should not be enabled by default.
It's trivial to create a denial of service attack on a box so enabled.

These messages, if enabled at all, must be rate-limited. (!)


# 15038 04-Apr-1996 phk

Log TCP syn packets for ports we don't listen on.
Controlled by: sysctl net.inet.tcp.log_in_vain: 1

Log UDP syn packets for ports we don't listen on.
Controlled by: sysctl net.inet.udp.log_in_vain: 1

Suggested by: Warren Toomey <wkt@cs.adfa.oz.au>


# 14819 25-Mar-1996 wollman

Slight modification of RTO floor calculation.


# 14753 22-Mar-1996 wollman

A number of performance-reducing flaws fixed based on comments
from Larry Peterson &co. at Arizona:

- Header prediction for ACKs did not exclude Fast Retransmit/Recovery.
- srtt calculation tended to get ``stuck'' and could never decrease
when below 8. It still can't, but the scaling factors are adjusted
so that this artifact does not cause as bad an effect on the RTO
value as it used to.

The paper also points out the incr/8 error that has been long since fixed,
and the problems with ACKing frequency resulting from the use of options
which I suspect to be fixed already as well (as part of the T/TCP work).

Obtained from: Brakmo & Peterson, ``Performance Problems in BSD4.4 TCP''


# 14546 11-Mar-1996 dg

Move or add #include <queue.h> in preparation for upcoming struct socket
changes.


# 14268 26-Feb-1996 guido

Add a counter for the number of times the listen queue was overflowed to
the tcpstat structure. (netstat -s)
Reviewed by: wollman
Obtained from: Steves, TCP/IP Ill. vol.3, page 189


# 14181 22-Feb-1996 dg

Fixed bug in Path MTU Discovery that caused the system to have to re-
discover the Path MTU for each connection if the connecting host didn't
offer an initial MSS.

Submitted by: davidg & olah


# 13779 31-Jan-1996 olah

Fix a bug related to the interworking of T/TCP and window scaling:
when a connection enters the ESTBLS state using T/TCP, then window
scaling wasn't properly handled. The fix is twofold.

1) When the 3WHS completes, make sure that we update our window
scaling state variables.

2) When setting the `virtual advertized window', then make sure
that we do not try to offer a window that is larger than the maximum
window without scaling (TCP_MAXWIN).

Reviewed by: davidg
Reported by: Jerry Chen <chen@Ipsilon.COM>


# 12820 14-Dec-1995 phk

Another mega commit to staticize things.


# 12296 14-Nov-1995 phk

New style sysctl & staticize alot of stuff.


# 12172 09-Nov-1995 phk

Start adding new style sysctl here too.


# 12047 03-Nov-1995 olah

Cosmetic changes to processing of segments in the SYN_SENT state:
- remove a redundant condition;
- complete all validity checks on segment before calling
soisconnected(so).

Reviewed by: Richard Stevens, davidg, wollman


# 11458 13-Oct-1995 wollman

Routes can be asymmetric. Always offer to /accept/ an MSS of up to the
capacity of the link, even if the route's MTU indicates that we cannot
send that much in their direction. (This might actually make it possible
to test Path MTU discovery in a useful variety of cases.)


# 11150 03-Oct-1995 wollman

Finish 4.4-Lite-2 merge: randomize TCP initial sequence numbers
to make ISS-guessing spoofing attacks harder.


# 9818 31-Jul-1995 olah

Remove a redundant `if' from tcp_reass().

Correct a typo in a comment (SEND_SYN -> NEEDSYN).

Reviewed by: David Greenman


# 9470 10-Jul-1995 wollman

tcp_input.c - keep track of how many times a route contained a cached rtt
or ssthresh that we were able to use

tcp_var.h - declare tcpstat entries for above; declare tcp_{send,recv}space

in_rmx.c - fill in the MTU and pipe sizes with the defaults TCP would have
used anyway in the absence of values here


# 9373 29-Jun-1995 wollman

Keep track of the number of samples through the srtt filter so that we
know better when to cache values in the route, rather than relying on a
heuristic involving sequence numbers that broke when tcp_sendspace
was increased to 16k.


# 8876 30-May-1995 rgrimes

Remove trailing whitespace.


# 8429 11-May-1995 dg

#ifdef'd my Nagel/ACK hack with "TCP_ACK_HACK", disabled by default. I'm
currently considering reducing the TCP fasttimo to 100ms to help improve
things, but this would be done as a seperate step at some point in the
future.
This was done because it was causing some sometimes serious performance
problems with T/TCP.


# 8377 09-May-1995 olah

Fix a misspelled constant in tcp_input.c.

On Tue, 09 May 1995 04:35:27 PDT, Richard Stevens wrote:
> In tcp_dooptions() under the case TCPOPT_CC there is an assignment
>
> to->to_flag |= TCPOPT_CC;
>
> that should be
>
> to->to_flag |= TOF_CC;
>
> I haven't thought through the ramifications of what's been happening ...
>
> Rich Stevens

Submitted by: rstevens@noao.edu (Richard Stevens)


# 8235 03-May-1995 dg

Changed in_pcblookuphash() to not automatically call in_pcblookup() if
the lookup fails. Updated callers to deal with this. Call in_pcblookuphash
instead of in_pcblookup() in in_pcbconnect; this improves performance of
UDP output by about 17% in the standard case.


# 7738 10-Apr-1995 dg

Further satisfy my paranoia by making sure that the ACKNOW is only
set when ti_len is non-zero.


# 7737 10-Apr-1995 dg

Fixed bug I introduced with my Nagel hack which caused tcp_input and
tcp_output to loop endlessly. This was freefall's problem during the past
day.


# 7684 09-Apr-1995 dg

Implemented PCB hashing. Includes new functions in_pcbinshash, in_pcbrehash,
and in_pcblookuphash.


# 7634 05-Apr-1995 olah

Fix a bug in tcp_input reported by Rick Jones <raj@hpisrdq.cup.hp.com>.

If a goto findpcb occurred during the processing of a segment, the TCP and
IP headers were dropped twice from the mbuf which resulted in data acked
by TCP but not delivered to the user.
Reviewed by: davidg


# 7417 27-Mar-1995 dg

Re-apply my "breakage" to the Nagel congestion avoidence. This version
differs slightly in the logic from the previous version; packets are now
acked immediately if the sender set PUSH.


# 7090 16-Mar-1995 bde

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 6480 16-Feb-1995 wollman

Avoid deadlock situation described by Stevens using his suggested replacement
code.

Obtained from: Stevens, vol. 2, pp. 959-960


# 6475 16-Feb-1995 wollman

Transaction TCP support now standard. Hack away!


# 6363 14-Feb-1995 phk

YFfix.


# 6348 14-Feb-1995 wollman

Get rid of some unneeded #ifdef TTCP lines. Also, get rid of some
bogus commons declared in header files.


# 6283 09-Feb-1995 wollman

Merge Transaction TCP, courtesy of Andras Olah <olah@cs.utwente.nl> and
Bob Braden <braden@isi.edu>.

NB: This has not had David's TCP ACK hack re-integrated. It is not clear
what the correct solution to this problem is, if any. If a better solution
doesn't pop up in response to this message, I'll put David's code back in
(or he's welcome to do so himself).


# 3561 13-Oct-1994 wollman

As suggested by Sally Floyd, don't add the ``small fraction of the window
size'' when doing congestion avoidance.

Submitted by: Mark Andrews


# 3311 02-Oct-1994 phk

GCC cleanup.
Reviewed by:
Submitted by:
Obtained from:


# 2788 15-Sep-1994 dg

Made TCPDEBUG truely optional. Based on changes I made in FreeBSD 1.1.5.
Fixed somebody's idea of a joke - about the first half of the lines in
in_proto.c were spaced over by one space.


# 2304 26-Aug-1994 wollman

Obey RFC 793, section 3.4:

Several examples of connection initiation follow. Although these
examples do not show connection synchronization using data-carrying
segments, this is perfectly legitimate, so long as the receiving TCP
doesn't deliver the data to the user until it is clear the data is
valid (i.e., the data must be buffered at the receiver until the
connection reaches the ESTABLISHED state).


# 2112 18-Aug-1994 wollman

Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.


# 1817 02-Aug-1994 dg

Added $Id$


# 1812 01-Aug-1994 dg

Fixed bug with Nagel Congestion Avoidance where a tcp connection would
stall unnecessarily - always send an ACK when a packet len of < mss is
received.


# 1565 26-May-1994 dg

Added missing ntohl()'s that are needed before calling IN_MULTICAST in
a couple of places.
Submitted by: Johannes Helander


# 1549 25-May-1994 rgrimes

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# 1541 24-May-1994 rgrimes

BSD 4.4 Lite Kernel Sources