History log of /openbsd-current/sys/netinet/tcp_subr.c
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# 1.201 17-Apr-2024 bluhm

Use struct ipsec_level within inpcb.

Instead of passing around u_char[4], introduce struct ipsec_level
that contains 4 ipsec levels. This provides better type safety.
The embedding struct inpcb is globally visible for netstat(1), so
put struct ipsec_level outside of #ifdef _KERNEL.

OK deraadt@ mvs@


# 1.200 12-Apr-2024 bluhm

Split single TCP inpcb table into IPv4 and IPv6 parts.

With two separate TCP hash tables, each one becomes smaller. When
we remove the exclusive net lock from TCP, contention on internet
PCB table mutex will be reduced. UDP has been split earlier into
IPv4 and IPv6. Replace branch conditions based on INP_IPV6 with
assertions.

OK mvs@


Revision tags: OPENBSD_7_5_BASE
# 1.199 13-Feb-2024 bluhm

Merge struct route and struct route_in6.

Use a common struct route for both inet and inet6. Unfortunately
struct sockaddr is shorter than sockaddr_in6, so netinet/in.h has
to be exposed from net/route.h. Struct route has to be bsd visible
for userland as netstat kvm code inspects inp_route. Internet PCB
and TCP SYN cache can use a plain struct route now. All specific
sockaddr types for inet and inet6 are embeded there.

OK claudio@


# 1.198 11-Feb-2024 bluhm

Remove include netinet6/ip6_var.h from netinet/in_pcb.h.

OK mvs@


# 1.197 28-Jan-2024 bluhm

Use more specific sockaddr type for inpcb notify.

in_pcbnotifyall() is an IPv4 only function. All callers check that
sockaddr dst is in fact a sockaddr_in. Pass the more spcific type
and remove the runtime check at beginning of in_pcbnotifyall().
Use const sockaddr_in in in_pcbnotifyall() and const sockaddr_in6
in6_pcbnotify() as dst parameter.

OK millert@


# 1.196 27-Jan-2024 bluhm

Declare address parameter in TCP SYN cache const.

tcp6_ctlinput() casted a constant sockaddr_sin6 to non-const sockaddr.
sa6_src may be &sa6_any which lives in read-only data section.
Better pass down the const addresses to syn_cache_lookup(). They
are needed for hash lookup and are not modified.

OK mvs@


# 1.195 11-Jan-2024 bluhm

Fix white spaces in TCP.


# 1.194 29-Nov-2023 bluhm

Document inp_socket as immutable and remove NULL checks.

Struct inpcb field inp_socket is initialized in in_pcballoc(). It
is not NULL and never changed.

OK mvs@


# 1.193 26-Nov-2023 bluhm

Remove inp parameter from ip_output().

ip_output() received inp as parameter. This is only used to lookup
the IPsec level of the socket. Reasoning about MP locking is much
easier if only relevant data is passed around. Convert ip_output()
to receive constant inp_seclevel as argument and mark it as protected
by net lock.

OK mvs@


Revision tags: OPENBSD_7_4_BASE
# 1.192 06-Jul-2023 bluhm

Convert tcp_now() time counter to 64 bit.

After changing tcp now tick to milliseconds, 32 bits will wrap
around after 49 days of uptime. That may be a problem in some
places of our stack. Better use a 64 bit counter.

As timestamp option is 32 bit in TCP protocol, use the lower 32 bit
there. There are casts to 32 bits that should behave correctly.

Start with random 63 bit offset to avoid uptime leakage. 2^63
milliseconds result in 2.9*10^8 years of possible uptime.

OK yasuoka@


# 1.191 10-May-2023 bluhm

Implement TCP send offloading, for now in software only. This is
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@


Revision tags: OPENBSD_7_3_BASE
# 1.190 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.189 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.188 03-Sep-2022 bluhm

Initialize TCP mutex forgotten in previous commit.
found by Hrvoje Popovski with witness; OK mvs@


# 1.187 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.186 30-Aug-2022 bluhm

Refactor internet PCB lookup function. Rename in_pcbhashlookup()
so the public API is in_pcblookup() and in_pcblookup_listen(). For
internal use introduce in_pcbhash_insert() and in_pcbhash_lookup()
to avoid code duplication. Routing domain is unsigned, change the
type to u_int.
OK mvs@


# 1.185 08-Aug-2022 bluhm

To make protocol input functions MP safe, internet PCB need protection.
Use their reference counter in more places.
The in_pcb lookup functions hold the PCBs in hash tables protected
by table->inpt_mtx mutex. Whenever a result is returned, increment
the ref count before releasing the mutex. Then the inp can be used
as long as neccessary. Unref it at the end of all functions that
call in_pcb lookup.
As a shortcut, pf may also hold a reference to the PCB. When
pf_inp_lookup() returns it, it also incements the ref count and the
caller can handle it like the inp from table lookup.
OK sashan@


Revision tags: OPENBSD_7_1_BASE
# 1.184 02-Mar-2022 bluhm

The return value of in6_pcbnotify() is never used. Make it a void
function.
OK gnezdo@ mvs@ florian@ sashan@


# 1.183 02-Jan-2022 jsg

spelling
ok jmc@ reads ok tb@


# 1.182 11-Nov-2021 bluhm

Do not call ip_deliver() recursively from IPsec. As there is no
crypto task anymore, it is possible to return the next protocol.
Then ip_deliver() will walk the header chain in its loop.
IPsec bridge(4) tested by jan@
OK mvs@ tobhe@ jan@


# 1.181 23-Oct-2021 bluhm

There is an m_pullup() down in AH input. As it may free or change
the mbuf, the callers must be careful. Although there is no bug,
use the common pattern to handle this. Pass down an mbuf pointer
mp and let m_pullup() update the pointer in all callers.
It looks like the tcp signature functions should not be called.
Avoid an mbuf leak and return an error.
OK mvs@


# 1.180 13-Oct-2021 bluhm

The function ipip_output() was registered as .xf_output() xform
function. But was is never called via this pointer. It would have
immediatley crashed as mp is always NULL when called via .xf_output().
Do not set .xf_output to ipip_output. This allows to pass only the
parameters which are actually needed and the control flow is clearer.
OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.179 14-Jul-2021 bluhm

Resend the TCP packet only if the MTU locked flag appears at the
route and was not there before. This should prevent a recursion
in path MTU discovery with TCP over IPsec.
reported and tested Matthias Schmidt; tested and OK tobhe@


# 1.178 08-Jul-2021 bluhm

The xformsw array never changes. Declare struct xformsw constant
and map data read only.
OK deraadt@ mvs@ mpi@


# 1.177 30-Jun-2021 bluhm

For path MTU discovery tcp_mtudisc() should resend a TCP packet by
calling tcp_output() if the TCP maximum segment size changes. But
that did not work, as the new value was compared before tcp_mss()
had a chance to modify it. Move the comparison and change it from
not equal to greater than. It makes only sense to resend a packet
immediately if it becomes smaller and is more likely to fit.
OK sashan@ tobhe@


Revision tags: OPENBSD_6_9_BASE
# 1.176 25-Feb-2021 dlg

we don't have to cast to caddr_t when calling m_copydata anymore.

the first cut of this diff was made with coccinelle using this spatch:

@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)

i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.

ok deraadt@ bluhm@


Revision tags: OPENBSD_6_8_BASE
# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.200 12-Apr-2024 bluhm

Split single TCP inpcb table into IPv4 and IPv6 parts.

With two separate TCP hash tables, each one becomes smaller. When
we remove the exclusive net lock from TCP, contention on internet
PCB table mutex will be reduced. UDP has been split earlier into
IPv4 and IPv6. Replace branch conditions based on INP_IPV6 with
assertions.

OK mvs@


Revision tags: OPENBSD_7_5_BASE
# 1.199 13-Feb-2024 bluhm

Merge struct route and struct route_in6.

Use a common struct route for both inet and inet6. Unfortunately
struct sockaddr is shorter than sockaddr_in6, so netinet/in.h has
to be exposed from net/route.h. Struct route has to be bsd visible
for userland as netstat kvm code inspects inp_route. Internet PCB
and TCP SYN cache can use a plain struct route now. All specific
sockaddr types for inet and inet6 are embeded there.

OK claudio@


# 1.198 11-Feb-2024 bluhm

Remove include netinet6/ip6_var.h from netinet/in_pcb.h.

OK mvs@


# 1.197 28-Jan-2024 bluhm

Use more specific sockaddr type for inpcb notify.

in_pcbnotifyall() is an IPv4 only function. All callers check that
sockaddr dst is in fact a sockaddr_in. Pass the more spcific type
and remove the runtime check at beginning of in_pcbnotifyall().
Use const sockaddr_in in in_pcbnotifyall() and const sockaddr_in6
in6_pcbnotify() as dst parameter.

OK millert@


# 1.196 27-Jan-2024 bluhm

Declare address parameter in TCP SYN cache const.

tcp6_ctlinput() casted a constant sockaddr_sin6 to non-const sockaddr.
sa6_src may be &sa6_any which lives in read-only data section.
Better pass down the const addresses to syn_cache_lookup(). They
are needed for hash lookup and are not modified.

OK mvs@


# 1.195 11-Jan-2024 bluhm

Fix white spaces in TCP.


# 1.194 29-Nov-2023 bluhm

Document inp_socket as immutable and remove NULL checks.

Struct inpcb field inp_socket is initialized in in_pcballoc(). It
is not NULL and never changed.

OK mvs@


# 1.193 26-Nov-2023 bluhm

Remove inp parameter from ip_output().

ip_output() received inp as parameter. This is only used to lookup
the IPsec level of the socket. Reasoning about MP locking is much
easier if only relevant data is passed around. Convert ip_output()
to receive constant inp_seclevel as argument and mark it as protected
by net lock.

OK mvs@


Revision tags: OPENBSD_7_4_BASE
# 1.192 06-Jul-2023 bluhm

Convert tcp_now() time counter to 64 bit.

After changing tcp now tick to milliseconds, 32 bits will wrap
around after 49 days of uptime. That may be a problem in some
places of our stack. Better use a 64 bit counter.

As timestamp option is 32 bit in TCP protocol, use the lower 32 bit
there. There are casts to 32 bits that should behave correctly.

Start with random 63 bit offset to avoid uptime leakage. 2^63
milliseconds result in 2.9*10^8 years of possible uptime.

OK yasuoka@


# 1.191 10-May-2023 bluhm

Implement TCP send offloading, for now in software only. This is
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@


Revision tags: OPENBSD_7_3_BASE
# 1.190 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.189 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.188 03-Sep-2022 bluhm

Initialize TCP mutex forgotten in previous commit.
found by Hrvoje Popovski with witness; OK mvs@


# 1.187 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.186 30-Aug-2022 bluhm

Refactor internet PCB lookup function. Rename in_pcbhashlookup()
so the public API is in_pcblookup() and in_pcblookup_listen(). For
internal use introduce in_pcbhash_insert() and in_pcbhash_lookup()
to avoid code duplication. Routing domain is unsigned, change the
type to u_int.
OK mvs@


# 1.185 08-Aug-2022 bluhm

To make protocol input functions MP safe, internet PCB need protection.
Use their reference counter in more places.
The in_pcb lookup functions hold the PCBs in hash tables protected
by table->inpt_mtx mutex. Whenever a result is returned, increment
the ref count before releasing the mutex. Then the inp can be used
as long as neccessary. Unref it at the end of all functions that
call in_pcb lookup.
As a shortcut, pf may also hold a reference to the PCB. When
pf_inp_lookup() returns it, it also incements the ref count and the
caller can handle it like the inp from table lookup.
OK sashan@


Revision tags: OPENBSD_7_1_BASE
# 1.184 02-Mar-2022 bluhm

The return value of in6_pcbnotify() is never used. Make it a void
function.
OK gnezdo@ mvs@ florian@ sashan@


# 1.183 02-Jan-2022 jsg

spelling
ok jmc@ reads ok tb@


# 1.182 11-Nov-2021 bluhm

Do not call ip_deliver() recursively from IPsec. As there is no
crypto task anymore, it is possible to return the next protocol.
Then ip_deliver() will walk the header chain in its loop.
IPsec bridge(4) tested by jan@
OK mvs@ tobhe@ jan@


# 1.181 23-Oct-2021 bluhm

There is an m_pullup() down in AH input. As it may free or change
the mbuf, the callers must be careful. Although there is no bug,
use the common pattern to handle this. Pass down an mbuf pointer
mp and let m_pullup() update the pointer in all callers.
It looks like the tcp signature functions should not be called.
Avoid an mbuf leak and return an error.
OK mvs@


# 1.180 13-Oct-2021 bluhm

The function ipip_output() was registered as .xf_output() xform
function. But was is never called via this pointer. It would have
immediatley crashed as mp is always NULL when called via .xf_output().
Do not set .xf_output to ipip_output. This allows to pass only the
parameters which are actually needed and the control flow is clearer.
OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.179 14-Jul-2021 bluhm

Resend the TCP packet only if the MTU locked flag appears at the
route and was not there before. This should prevent a recursion
in path MTU discovery with TCP over IPsec.
reported and tested Matthias Schmidt; tested and OK tobhe@


# 1.178 08-Jul-2021 bluhm

The xformsw array never changes. Declare struct xformsw constant
and map data read only.
OK deraadt@ mvs@ mpi@


# 1.177 30-Jun-2021 bluhm

For path MTU discovery tcp_mtudisc() should resend a TCP packet by
calling tcp_output() if the TCP maximum segment size changes. But
that did not work, as the new value was compared before tcp_mss()
had a chance to modify it. Move the comparison and change it from
not equal to greater than. It makes only sense to resend a packet
immediately if it becomes smaller and is more likely to fit.
OK sashan@ tobhe@


Revision tags: OPENBSD_6_9_BASE
# 1.176 25-Feb-2021 dlg

we don't have to cast to caddr_t when calling m_copydata anymore.

the first cut of this diff was made with coccinelle using this spatch:

@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)

i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.

ok deraadt@ bluhm@


Revision tags: OPENBSD_6_8_BASE
# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.199 13-Feb-2024 bluhm

Merge struct route and struct route_in6.

Use a common struct route for both inet and inet6. Unfortunately
struct sockaddr is shorter than sockaddr_in6, so netinet/in.h has
to be exposed from net/route.h. Struct route has to be bsd visible
for userland as netstat kvm code inspects inp_route. Internet PCB
and TCP SYN cache can use a plain struct route now. All specific
sockaddr types for inet and inet6 are embeded there.

OK claudio@


# 1.198 11-Feb-2024 bluhm

Remove include netinet6/ip6_var.h from netinet/in_pcb.h.

OK mvs@


# 1.197 28-Jan-2024 bluhm

Use more specific sockaddr type for inpcb notify.

in_pcbnotifyall() is an IPv4 only function. All callers check that
sockaddr dst is in fact a sockaddr_in. Pass the more spcific type
and remove the runtime check at beginning of in_pcbnotifyall().
Use const sockaddr_in in in_pcbnotifyall() and const sockaddr_in6
in6_pcbnotify() as dst parameter.

OK millert@


# 1.196 27-Jan-2024 bluhm

Declare address parameter in TCP SYN cache const.

tcp6_ctlinput() casted a constant sockaddr_sin6 to non-const sockaddr.
sa6_src may be &sa6_any which lives in read-only data section.
Better pass down the const addresses to syn_cache_lookup(). They
are needed for hash lookup and are not modified.

OK mvs@


# 1.195 11-Jan-2024 bluhm

Fix white spaces in TCP.


# 1.194 29-Nov-2023 bluhm

Document inp_socket as immutable and remove NULL checks.

Struct inpcb field inp_socket is initialized in in_pcballoc(). It
is not NULL and never changed.

OK mvs@


# 1.193 26-Nov-2023 bluhm

Remove inp parameter from ip_output().

ip_output() received inp as parameter. This is only used to lookup
the IPsec level of the socket. Reasoning about MP locking is much
easier if only relevant data is passed around. Convert ip_output()
to receive constant inp_seclevel as argument and mark it as protected
by net lock.

OK mvs@


Revision tags: OPENBSD_7_4_BASE
# 1.192 06-Jul-2023 bluhm

Convert tcp_now() time counter to 64 bit.

After changing tcp now tick to milliseconds, 32 bits will wrap
around after 49 days of uptime. That may be a problem in some
places of our stack. Better use a 64 bit counter.

As timestamp option is 32 bit in TCP protocol, use the lower 32 bit
there. There are casts to 32 bits that should behave correctly.

Start with random 63 bit offset to avoid uptime leakage. 2^63
milliseconds result in 2.9*10^8 years of possible uptime.

OK yasuoka@


# 1.191 10-May-2023 bluhm

Implement TCP send offloading, for now in software only. This is
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@


Revision tags: OPENBSD_7_3_BASE
# 1.190 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.189 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.188 03-Sep-2022 bluhm

Initialize TCP mutex forgotten in previous commit.
found by Hrvoje Popovski with witness; OK mvs@


# 1.187 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.186 30-Aug-2022 bluhm

Refactor internet PCB lookup function. Rename in_pcbhashlookup()
so the public API is in_pcblookup() and in_pcblookup_listen(). For
internal use introduce in_pcbhash_insert() and in_pcbhash_lookup()
to avoid code duplication. Routing domain is unsigned, change the
type to u_int.
OK mvs@


# 1.185 08-Aug-2022 bluhm

To make protocol input functions MP safe, internet PCB need protection.
Use their reference counter in more places.
The in_pcb lookup functions hold the PCBs in hash tables protected
by table->inpt_mtx mutex. Whenever a result is returned, increment
the ref count before releasing the mutex. Then the inp can be used
as long as neccessary. Unref it at the end of all functions that
call in_pcb lookup.
As a shortcut, pf may also hold a reference to the PCB. When
pf_inp_lookup() returns it, it also incements the ref count and the
caller can handle it like the inp from table lookup.
OK sashan@


Revision tags: OPENBSD_7_1_BASE
# 1.184 02-Mar-2022 bluhm

The return value of in6_pcbnotify() is never used. Make it a void
function.
OK gnezdo@ mvs@ florian@ sashan@


# 1.183 02-Jan-2022 jsg

spelling
ok jmc@ reads ok tb@


# 1.182 11-Nov-2021 bluhm

Do not call ip_deliver() recursively from IPsec. As there is no
crypto task anymore, it is possible to return the next protocol.
Then ip_deliver() will walk the header chain in its loop.
IPsec bridge(4) tested by jan@
OK mvs@ tobhe@ jan@


# 1.181 23-Oct-2021 bluhm

There is an m_pullup() down in AH input. As it may free or change
the mbuf, the callers must be careful. Although there is no bug,
use the common pattern to handle this. Pass down an mbuf pointer
mp and let m_pullup() update the pointer in all callers.
It looks like the tcp signature functions should not be called.
Avoid an mbuf leak and return an error.
OK mvs@


# 1.180 13-Oct-2021 bluhm

The function ipip_output() was registered as .xf_output() xform
function. But was is never called via this pointer. It would have
immediatley crashed as mp is always NULL when called via .xf_output().
Do not set .xf_output to ipip_output. This allows to pass only the
parameters which are actually needed and the control flow is clearer.
OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.179 14-Jul-2021 bluhm

Resend the TCP packet only if the MTU locked flag appears at the
route and was not there before. This should prevent a recursion
in path MTU discovery with TCP over IPsec.
reported and tested Matthias Schmidt; tested and OK tobhe@


# 1.178 08-Jul-2021 bluhm

The xformsw array never changes. Declare struct xformsw constant
and map data read only.
OK deraadt@ mvs@ mpi@


# 1.177 30-Jun-2021 bluhm

For path MTU discovery tcp_mtudisc() should resend a TCP packet by
calling tcp_output() if the TCP maximum segment size changes. But
that did not work, as the new value was compared before tcp_mss()
had a chance to modify it. Move the comparison and change it from
not equal to greater than. It makes only sense to resend a packet
immediately if it becomes smaller and is more likely to fit.
OK sashan@ tobhe@


Revision tags: OPENBSD_6_9_BASE
# 1.176 25-Feb-2021 dlg

we don't have to cast to caddr_t when calling m_copydata anymore.

the first cut of this diff was made with coccinelle using this spatch:

@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)

i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.

ok deraadt@ bluhm@


Revision tags: OPENBSD_6_8_BASE
# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.198 11-Feb-2024 bluhm

Remove include netinet6/ip6_var.h from netinet/in_pcb.h.

OK mvs@


# 1.197 28-Jan-2024 bluhm

Use more specific sockaddr type for inpcb notify.

in_pcbnotifyall() is an IPv4 only function. All callers check that
sockaddr dst is in fact a sockaddr_in. Pass the more spcific type
and remove the runtime check at beginning of in_pcbnotifyall().
Use const sockaddr_in in in_pcbnotifyall() and const sockaddr_in6
in6_pcbnotify() as dst parameter.

OK millert@


# 1.196 27-Jan-2024 bluhm

Declare address parameter in TCP SYN cache const.

tcp6_ctlinput() casted a constant sockaddr_sin6 to non-const sockaddr.
sa6_src may be &sa6_any which lives in read-only data section.
Better pass down the const addresses to syn_cache_lookup(). They
are needed for hash lookup and are not modified.

OK mvs@


# 1.195 11-Jan-2024 bluhm

Fix white spaces in TCP.


# 1.194 29-Nov-2023 bluhm

Document inp_socket as immutable and remove NULL checks.

Struct inpcb field inp_socket is initialized in in_pcballoc(). It
is not NULL and never changed.

OK mvs@


# 1.193 26-Nov-2023 bluhm

Remove inp parameter from ip_output().

ip_output() received inp as parameter. This is only used to lookup
the IPsec level of the socket. Reasoning about MP locking is much
easier if only relevant data is passed around. Convert ip_output()
to receive constant inp_seclevel as argument and mark it as protected
by net lock.

OK mvs@


Revision tags: OPENBSD_7_4_BASE
# 1.192 06-Jul-2023 bluhm

Convert tcp_now() time counter to 64 bit.

After changing tcp now tick to milliseconds, 32 bits will wrap
around after 49 days of uptime. That may be a problem in some
places of our stack. Better use a 64 bit counter.

As timestamp option is 32 bit in TCP protocol, use the lower 32 bit
there. There are casts to 32 bits that should behave correctly.

Start with random 63 bit offset to avoid uptime leakage. 2^63
milliseconds result in 2.9*10^8 years of possible uptime.

OK yasuoka@


# 1.191 10-May-2023 bluhm

Implement TCP send offloading, for now in software only. This is
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@


Revision tags: OPENBSD_7_3_BASE
# 1.190 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.189 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.188 03-Sep-2022 bluhm

Initialize TCP mutex forgotten in previous commit.
found by Hrvoje Popovski with witness; OK mvs@


# 1.187 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.186 30-Aug-2022 bluhm

Refactor internet PCB lookup function. Rename in_pcbhashlookup()
so the public API is in_pcblookup() and in_pcblookup_listen(). For
internal use introduce in_pcbhash_insert() and in_pcbhash_lookup()
to avoid code duplication. Routing domain is unsigned, change the
type to u_int.
OK mvs@


# 1.185 08-Aug-2022 bluhm

To make protocol input functions MP safe, internet PCB need protection.
Use their reference counter in more places.
The in_pcb lookup functions hold the PCBs in hash tables protected
by table->inpt_mtx mutex. Whenever a result is returned, increment
the ref count before releasing the mutex. Then the inp can be used
as long as neccessary. Unref it at the end of all functions that
call in_pcb lookup.
As a shortcut, pf may also hold a reference to the PCB. When
pf_inp_lookup() returns it, it also incements the ref count and the
caller can handle it like the inp from table lookup.
OK sashan@


Revision tags: OPENBSD_7_1_BASE
# 1.184 02-Mar-2022 bluhm

The return value of in6_pcbnotify() is never used. Make it a void
function.
OK gnezdo@ mvs@ florian@ sashan@


# 1.183 02-Jan-2022 jsg

spelling
ok jmc@ reads ok tb@


# 1.182 11-Nov-2021 bluhm

Do not call ip_deliver() recursively from IPsec. As there is no
crypto task anymore, it is possible to return the next protocol.
Then ip_deliver() will walk the header chain in its loop.
IPsec bridge(4) tested by jan@
OK mvs@ tobhe@ jan@


# 1.181 23-Oct-2021 bluhm

There is an m_pullup() down in AH input. As it may free or change
the mbuf, the callers must be careful. Although there is no bug,
use the common pattern to handle this. Pass down an mbuf pointer
mp and let m_pullup() update the pointer in all callers.
It looks like the tcp signature functions should not be called.
Avoid an mbuf leak and return an error.
OK mvs@


# 1.180 13-Oct-2021 bluhm

The function ipip_output() was registered as .xf_output() xform
function. But was is never called via this pointer. It would have
immediatley crashed as mp is always NULL when called via .xf_output().
Do not set .xf_output to ipip_output. This allows to pass only the
parameters which are actually needed and the control flow is clearer.
OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.179 14-Jul-2021 bluhm

Resend the TCP packet only if the MTU locked flag appears at the
route and was not there before. This should prevent a recursion
in path MTU discovery with TCP over IPsec.
reported and tested Matthias Schmidt; tested and OK tobhe@


# 1.178 08-Jul-2021 bluhm

The xformsw array never changes. Declare struct xformsw constant
and map data read only.
OK deraadt@ mvs@ mpi@


# 1.177 30-Jun-2021 bluhm

For path MTU discovery tcp_mtudisc() should resend a TCP packet by
calling tcp_output() if the TCP maximum segment size changes. But
that did not work, as the new value was compared before tcp_mss()
had a chance to modify it. Move the comparison and change it from
not equal to greater than. It makes only sense to resend a packet
immediately if it becomes smaller and is more likely to fit.
OK sashan@ tobhe@


Revision tags: OPENBSD_6_9_BASE
# 1.176 25-Feb-2021 dlg

we don't have to cast to caddr_t when calling m_copydata anymore.

the first cut of this diff was made with coccinelle using this spatch:

@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)

i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.

ok deraadt@ bluhm@


Revision tags: OPENBSD_6_8_BASE
# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.197 28-Jan-2024 bluhm

Use more specific sockaddr type for inpcb notify.

in_pcbnotifyall() is an IPv4 only function. All callers check that
sockaddr dst is in fact a sockaddr_in. Pass the more spcific type
and remove the runtime check at beginning of in_pcbnotifyall().
Use const sockaddr_in in in_pcbnotifyall() and const sockaddr_in6
in6_pcbnotify() as dst parameter.

OK millert@


# 1.196 27-Jan-2024 bluhm

Declare address parameter in TCP SYN cache const.

tcp6_ctlinput() casted a constant sockaddr_sin6 to non-const sockaddr.
sa6_src may be &sa6_any which lives in read-only data section.
Better pass down the const addresses to syn_cache_lookup(). They
are needed for hash lookup and are not modified.

OK mvs@


# 1.195 11-Jan-2024 bluhm

Fix white spaces in TCP.


# 1.194 29-Nov-2023 bluhm

Document inp_socket as immutable and remove NULL checks.

Struct inpcb field inp_socket is initialized in in_pcballoc(). It
is not NULL and never changed.

OK mvs@


# 1.193 26-Nov-2023 bluhm

Remove inp parameter from ip_output().

ip_output() received inp as parameter. This is only used to lookup
the IPsec level of the socket. Reasoning about MP locking is much
easier if only relevant data is passed around. Convert ip_output()
to receive constant inp_seclevel as argument and mark it as protected
by net lock.

OK mvs@


Revision tags: OPENBSD_7_4_BASE
# 1.192 06-Jul-2023 bluhm

Convert tcp_now() time counter to 64 bit.

After changing tcp now tick to milliseconds, 32 bits will wrap
around after 49 days of uptime. That may be a problem in some
places of our stack. Better use a 64 bit counter.

As timestamp option is 32 bit in TCP protocol, use the lower 32 bit
there. There are casts to 32 bits that should behave correctly.

Start with random 63 bit offset to avoid uptime leakage. 2^63
milliseconds result in 2.9*10^8 years of possible uptime.

OK yasuoka@


# 1.191 10-May-2023 bluhm

Implement TCP send offloading, for now in software only. This is
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@


Revision tags: OPENBSD_7_3_BASE
# 1.190 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.189 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.188 03-Sep-2022 bluhm

Initialize TCP mutex forgotten in previous commit.
found by Hrvoje Popovski with witness; OK mvs@


# 1.187 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.186 30-Aug-2022 bluhm

Refactor internet PCB lookup function. Rename in_pcbhashlookup()
so the public API is in_pcblookup() and in_pcblookup_listen(). For
internal use introduce in_pcbhash_insert() and in_pcbhash_lookup()
to avoid code duplication. Routing domain is unsigned, change the
type to u_int.
OK mvs@


# 1.185 08-Aug-2022 bluhm

To make protocol input functions MP safe, internet PCB need protection.
Use their reference counter in more places.
The in_pcb lookup functions hold the PCBs in hash tables protected
by table->inpt_mtx mutex. Whenever a result is returned, increment
the ref count before releasing the mutex. Then the inp can be used
as long as neccessary. Unref it at the end of all functions that
call in_pcb lookup.
As a shortcut, pf may also hold a reference to the PCB. When
pf_inp_lookup() returns it, it also incements the ref count and the
caller can handle it like the inp from table lookup.
OK sashan@


Revision tags: OPENBSD_7_1_BASE
# 1.184 02-Mar-2022 bluhm

The return value of in6_pcbnotify() is never used. Make it a void
function.
OK gnezdo@ mvs@ florian@ sashan@


# 1.183 02-Jan-2022 jsg

spelling
ok jmc@ reads ok tb@


# 1.182 11-Nov-2021 bluhm

Do not call ip_deliver() recursively from IPsec. As there is no
crypto task anymore, it is possible to return the next protocol.
Then ip_deliver() will walk the header chain in its loop.
IPsec bridge(4) tested by jan@
OK mvs@ tobhe@ jan@


# 1.181 23-Oct-2021 bluhm

There is an m_pullup() down in AH input. As it may free or change
the mbuf, the callers must be careful. Although there is no bug,
use the common pattern to handle this. Pass down an mbuf pointer
mp and let m_pullup() update the pointer in all callers.
It looks like the tcp signature functions should not be called.
Avoid an mbuf leak and return an error.
OK mvs@


# 1.180 13-Oct-2021 bluhm

The function ipip_output() was registered as .xf_output() xform
function. But was is never called via this pointer. It would have
immediatley crashed as mp is always NULL when called via .xf_output().
Do not set .xf_output to ipip_output. This allows to pass only the
parameters which are actually needed and the control flow is clearer.
OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.179 14-Jul-2021 bluhm

Resend the TCP packet only if the MTU locked flag appears at the
route and was not there before. This should prevent a recursion
in path MTU discovery with TCP over IPsec.
reported and tested Matthias Schmidt; tested and OK tobhe@


# 1.178 08-Jul-2021 bluhm

The xformsw array never changes. Declare struct xformsw constant
and map data read only.
OK deraadt@ mvs@ mpi@


# 1.177 30-Jun-2021 bluhm

For path MTU discovery tcp_mtudisc() should resend a TCP packet by
calling tcp_output() if the TCP maximum segment size changes. But
that did not work, as the new value was compared before tcp_mss()
had a chance to modify it. Move the comparison and change it from
not equal to greater than. It makes only sense to resend a packet
immediately if it becomes smaller and is more likely to fit.
OK sashan@ tobhe@


Revision tags: OPENBSD_6_9_BASE
# 1.176 25-Feb-2021 dlg

we don't have to cast to caddr_t when calling m_copydata anymore.

the first cut of this diff was made with coccinelle using this spatch:

@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)

i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.

ok deraadt@ bluhm@


Revision tags: OPENBSD_6_8_BASE
# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.195 11-Jan-2024 bluhm

Fix white spaces in TCP.


# 1.194 29-Nov-2023 bluhm

Document inp_socket as immutable and remove NULL checks.

Struct inpcb field inp_socket is initialized in in_pcballoc(). It
is not NULL and never changed.

OK mvs@


# 1.193 26-Nov-2023 bluhm

Remove inp parameter from ip_output().

ip_output() received inp as parameter. This is only used to lookup
the IPsec level of the socket. Reasoning about MP locking is much
easier if only relevant data is passed around. Convert ip_output()
to receive constant inp_seclevel as argument and mark it as protected
by net lock.

OK mvs@


Revision tags: OPENBSD_7_4_BASE
# 1.192 06-Jul-2023 bluhm

Convert tcp_now() time counter to 64 bit.

After changing tcp now tick to milliseconds, 32 bits will wrap
around after 49 days of uptime. That may be a problem in some
places of our stack. Better use a 64 bit counter.

As timestamp option is 32 bit in TCP protocol, use the lower 32 bit
there. There are casts to 32 bits that should behave correctly.

Start with random 63 bit offset to avoid uptime leakage. 2^63
milliseconds result in 2.9*10^8 years of possible uptime.

OK yasuoka@


# 1.191 10-May-2023 bluhm

Implement TCP send offloading, for now in software only. This is
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@


Revision tags: OPENBSD_7_3_BASE
# 1.190 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.189 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.188 03-Sep-2022 bluhm

Initialize TCP mutex forgotten in previous commit.
found by Hrvoje Popovski with witness; OK mvs@


# 1.187 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.186 30-Aug-2022 bluhm

Refactor internet PCB lookup function. Rename in_pcbhashlookup()
so the public API is in_pcblookup() and in_pcblookup_listen(). For
internal use introduce in_pcbhash_insert() and in_pcbhash_lookup()
to avoid code duplication. Routing domain is unsigned, change the
type to u_int.
OK mvs@


# 1.185 08-Aug-2022 bluhm

To make protocol input functions MP safe, internet PCB need protection.
Use their reference counter in more places.
The in_pcb lookup functions hold the PCBs in hash tables protected
by table->inpt_mtx mutex. Whenever a result is returned, increment
the ref count before releasing the mutex. Then the inp can be used
as long as neccessary. Unref it at the end of all functions that
call in_pcb lookup.
As a shortcut, pf may also hold a reference to the PCB. When
pf_inp_lookup() returns it, it also incements the ref count and the
caller can handle it like the inp from table lookup.
OK sashan@


Revision tags: OPENBSD_7_1_BASE
# 1.184 02-Mar-2022 bluhm

The return value of in6_pcbnotify() is never used. Make it a void
function.
OK gnezdo@ mvs@ florian@ sashan@


# 1.183 02-Jan-2022 jsg

spelling
ok jmc@ reads ok tb@


# 1.182 11-Nov-2021 bluhm

Do not call ip_deliver() recursively from IPsec. As there is no
crypto task anymore, it is possible to return the next protocol.
Then ip_deliver() will walk the header chain in its loop.
IPsec bridge(4) tested by jan@
OK mvs@ tobhe@ jan@


# 1.181 23-Oct-2021 bluhm

There is an m_pullup() down in AH input. As it may free or change
the mbuf, the callers must be careful. Although there is no bug,
use the common pattern to handle this. Pass down an mbuf pointer
mp and let m_pullup() update the pointer in all callers.
It looks like the tcp signature functions should not be called.
Avoid an mbuf leak and return an error.
OK mvs@


# 1.180 13-Oct-2021 bluhm

The function ipip_output() was registered as .xf_output() xform
function. But was is never called via this pointer. It would have
immediatley crashed as mp is always NULL when called via .xf_output().
Do not set .xf_output to ipip_output. This allows to pass only the
parameters which are actually needed and the control flow is clearer.
OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.179 14-Jul-2021 bluhm

Resend the TCP packet only if the MTU locked flag appears at the
route and was not there before. This should prevent a recursion
in path MTU discovery with TCP over IPsec.
reported and tested Matthias Schmidt; tested and OK tobhe@


# 1.178 08-Jul-2021 bluhm

The xformsw array never changes. Declare struct xformsw constant
and map data read only.
OK deraadt@ mvs@ mpi@


# 1.177 30-Jun-2021 bluhm

For path MTU discovery tcp_mtudisc() should resend a TCP packet by
calling tcp_output() if the TCP maximum segment size changes. But
that did not work, as the new value was compared before tcp_mss()
had a chance to modify it. Move the comparison and change it from
not equal to greater than. It makes only sense to resend a packet
immediately if it becomes smaller and is more likely to fit.
OK sashan@ tobhe@


Revision tags: OPENBSD_6_9_BASE
# 1.176 25-Feb-2021 dlg

we don't have to cast to caddr_t when calling m_copydata anymore.

the first cut of this diff was made with coccinelle using this spatch:

@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)

i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.

ok deraadt@ bluhm@


Revision tags: OPENBSD_6_8_BASE
# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.194 29-Nov-2023 bluhm

Document inp_socket as immutable and remove NULL checks.

Struct inpcb field inp_socket is initialized in in_pcballoc(). It
is not NULL and never changed.

OK mvs@


# 1.193 26-Nov-2023 bluhm

Remove inp parameter from ip_output().

ip_output() received inp as parameter. This is only used to lookup
the IPsec level of the socket. Reasoning about MP locking is much
easier if only relevant data is passed around. Convert ip_output()
to receive constant inp_seclevel as argument and mark it as protected
by net lock.

OK mvs@


Revision tags: OPENBSD_7_4_BASE
# 1.192 06-Jul-2023 bluhm

Convert tcp_now() time counter to 64 bit.

After changing tcp now tick to milliseconds, 32 bits will wrap
around after 49 days of uptime. That may be a problem in some
places of our stack. Better use a 64 bit counter.

As timestamp option is 32 bit in TCP protocol, use the lower 32 bit
there. There are casts to 32 bits that should behave correctly.

Start with random 63 bit offset to avoid uptime leakage. 2^63
milliseconds result in 2.9*10^8 years of possible uptime.

OK yasuoka@


# 1.191 10-May-2023 bluhm

Implement TCP send offloading, for now in software only. This is
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@


Revision tags: OPENBSD_7_3_BASE
# 1.190 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.189 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.188 03-Sep-2022 bluhm

Initialize TCP mutex forgotten in previous commit.
found by Hrvoje Popovski with witness; OK mvs@


# 1.187 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.186 30-Aug-2022 bluhm

Refactor internet PCB lookup function. Rename in_pcbhashlookup()
so the public API is in_pcblookup() and in_pcblookup_listen(). For
internal use introduce in_pcbhash_insert() and in_pcbhash_lookup()
to avoid code duplication. Routing domain is unsigned, change the
type to u_int.
OK mvs@


# 1.185 08-Aug-2022 bluhm

To make protocol input functions MP safe, internet PCB need protection.
Use their reference counter in more places.
The in_pcb lookup functions hold the PCBs in hash tables protected
by table->inpt_mtx mutex. Whenever a result is returned, increment
the ref count before releasing the mutex. Then the inp can be used
as long as neccessary. Unref it at the end of all functions that
call in_pcb lookup.
As a shortcut, pf may also hold a reference to the PCB. When
pf_inp_lookup() returns it, it also incements the ref count and the
caller can handle it like the inp from table lookup.
OK sashan@


Revision tags: OPENBSD_7_1_BASE
# 1.184 02-Mar-2022 bluhm

The return value of in6_pcbnotify() is never used. Make it a void
function.
OK gnezdo@ mvs@ florian@ sashan@


# 1.183 02-Jan-2022 jsg

spelling
ok jmc@ reads ok tb@


# 1.182 11-Nov-2021 bluhm

Do not call ip_deliver() recursively from IPsec. As there is no
crypto task anymore, it is possible to return the next protocol.
Then ip_deliver() will walk the header chain in its loop.
IPsec bridge(4) tested by jan@
OK mvs@ tobhe@ jan@


# 1.181 23-Oct-2021 bluhm

There is an m_pullup() down in AH input. As it may free or change
the mbuf, the callers must be careful. Although there is no bug,
use the common pattern to handle this. Pass down an mbuf pointer
mp and let m_pullup() update the pointer in all callers.
It looks like the tcp signature functions should not be called.
Avoid an mbuf leak and return an error.
OK mvs@


# 1.180 13-Oct-2021 bluhm

The function ipip_output() was registered as .xf_output() xform
function. But was is never called via this pointer. It would have
immediatley crashed as mp is always NULL when called via .xf_output().
Do not set .xf_output to ipip_output. This allows to pass only the
parameters which are actually needed and the control flow is clearer.
OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.179 14-Jul-2021 bluhm

Resend the TCP packet only if the MTU locked flag appears at the
route and was not there before. This should prevent a recursion
in path MTU discovery with TCP over IPsec.
reported and tested Matthias Schmidt; tested and OK tobhe@


# 1.178 08-Jul-2021 bluhm

The xformsw array never changes. Declare struct xformsw constant
and map data read only.
OK deraadt@ mvs@ mpi@


# 1.177 30-Jun-2021 bluhm

For path MTU discovery tcp_mtudisc() should resend a TCP packet by
calling tcp_output() if the TCP maximum segment size changes. But
that did not work, as the new value was compared before tcp_mss()
had a chance to modify it. Move the comparison and change it from
not equal to greater than. It makes only sense to resend a packet
immediately if it becomes smaller and is more likely to fit.
OK sashan@ tobhe@


Revision tags: OPENBSD_6_9_BASE
# 1.176 25-Feb-2021 dlg

we don't have to cast to caddr_t when calling m_copydata anymore.

the first cut of this diff was made with coccinelle using this spatch:

@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)

i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.

ok deraadt@ bluhm@


Revision tags: OPENBSD_6_8_BASE
# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.193 26-Nov-2023 bluhm

Remove inp parameter from ip_output().

ip_output() received inp as parameter. This is only used to lookup
the IPsec level of the socket. Reasoning about MP locking is much
easier if only relevant data is passed around. Convert ip_output()
to receive constant inp_seclevel as argument and mark it as protected
by net lock.

OK mvs@


Revision tags: OPENBSD_7_4_BASE
# 1.192 06-Jul-2023 bluhm

Convert tcp_now() time counter to 64 bit.

After changing tcp now tick to milliseconds, 32 bits will wrap
around after 49 days of uptime. That may be a problem in some
places of our stack. Better use a 64 bit counter.

As timestamp option is 32 bit in TCP protocol, use the lower 32 bit
there. There are casts to 32 bits that should behave correctly.

Start with random 63 bit offset to avoid uptime leakage. 2^63
milliseconds result in 2.9*10^8 years of possible uptime.

OK yasuoka@


# 1.191 10-May-2023 bluhm

Implement TCP send offloading, for now in software only. This is
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@


Revision tags: OPENBSD_7_3_BASE
# 1.190 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.189 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.188 03-Sep-2022 bluhm

Initialize TCP mutex forgotten in previous commit.
found by Hrvoje Popovski with witness; OK mvs@


# 1.187 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.186 30-Aug-2022 bluhm

Refactor internet PCB lookup function. Rename in_pcbhashlookup()
so the public API is in_pcblookup() and in_pcblookup_listen(). For
internal use introduce in_pcbhash_insert() and in_pcbhash_lookup()
to avoid code duplication. Routing domain is unsigned, change the
type to u_int.
OK mvs@


# 1.185 08-Aug-2022 bluhm

To make protocol input functions MP safe, internet PCB need protection.
Use their reference counter in more places.
The in_pcb lookup functions hold the PCBs in hash tables protected
by table->inpt_mtx mutex. Whenever a result is returned, increment
the ref count before releasing the mutex. Then the inp can be used
as long as neccessary. Unref it at the end of all functions that
call in_pcb lookup.
As a shortcut, pf may also hold a reference to the PCB. When
pf_inp_lookup() returns it, it also incements the ref count and the
caller can handle it like the inp from table lookup.
OK sashan@


Revision tags: OPENBSD_7_1_BASE
# 1.184 02-Mar-2022 bluhm

The return value of in6_pcbnotify() is never used. Make it a void
function.
OK gnezdo@ mvs@ florian@ sashan@


# 1.183 02-Jan-2022 jsg

spelling
ok jmc@ reads ok tb@


# 1.182 11-Nov-2021 bluhm

Do not call ip_deliver() recursively from IPsec. As there is no
crypto task anymore, it is possible to return the next protocol.
Then ip_deliver() will walk the header chain in its loop.
IPsec bridge(4) tested by jan@
OK mvs@ tobhe@ jan@


# 1.181 23-Oct-2021 bluhm

There is an m_pullup() down in AH input. As it may free or change
the mbuf, the callers must be careful. Although there is no bug,
use the common pattern to handle this. Pass down an mbuf pointer
mp and let m_pullup() update the pointer in all callers.
It looks like the tcp signature functions should not be called.
Avoid an mbuf leak and return an error.
OK mvs@


# 1.180 13-Oct-2021 bluhm

The function ipip_output() was registered as .xf_output() xform
function. But was is never called via this pointer. It would have
immediatley crashed as mp is always NULL when called via .xf_output().
Do not set .xf_output to ipip_output. This allows to pass only the
parameters which are actually needed and the control flow is clearer.
OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.179 14-Jul-2021 bluhm

Resend the TCP packet only if the MTU locked flag appears at the
route and was not there before. This should prevent a recursion
in path MTU discovery with TCP over IPsec.
reported and tested Matthias Schmidt; tested and OK tobhe@


# 1.178 08-Jul-2021 bluhm

The xformsw array never changes. Declare struct xformsw constant
and map data read only.
OK deraadt@ mvs@ mpi@


# 1.177 30-Jun-2021 bluhm

For path MTU discovery tcp_mtudisc() should resend a TCP packet by
calling tcp_output() if the TCP maximum segment size changes. But
that did not work, as the new value was compared before tcp_mss()
had a chance to modify it. Move the comparison and change it from
not equal to greater than. It makes only sense to resend a packet
immediately if it becomes smaller and is more likely to fit.
OK sashan@ tobhe@


Revision tags: OPENBSD_6_9_BASE
# 1.176 25-Feb-2021 dlg

we don't have to cast to caddr_t when calling m_copydata anymore.

the first cut of this diff was made with coccinelle using this spatch:

@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)

i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.

ok deraadt@ bluhm@


Revision tags: OPENBSD_6_8_BASE
# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.192 06-Jul-2023 bluhm

Convert tcp_now() time counter to 64 bit.

After changing tcp now tick to milliseconds, 32 bits will wrap
around after 49 days of uptime. That may be a problem in some
places of our stack. Better use a 64 bit counter.

As timestamp option is 32 bit in TCP protocol, use the lower 32 bit
there. There are casts to 32 bits that should behave correctly.

Start with random 63 bit offset to avoid uptime leakage. 2^63
milliseconds result in 2.9*10^8 years of possible uptime.

OK yasuoka@


# 1.191 10-May-2023 bluhm

Implement TCP send offloading, for now in software only. This is
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@


Revision tags: OPENBSD_7_3_BASE
# 1.190 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.189 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.188 03-Sep-2022 bluhm

Initialize TCP mutex forgotten in previous commit.
found by Hrvoje Popovski with witness; OK mvs@


# 1.187 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.186 30-Aug-2022 bluhm

Refactor internet PCB lookup function. Rename in_pcbhashlookup()
so the public API is in_pcblookup() and in_pcblookup_listen(). For
internal use introduce in_pcbhash_insert() and in_pcbhash_lookup()
to avoid code duplication. Routing domain is unsigned, change the
type to u_int.
OK mvs@


# 1.185 08-Aug-2022 bluhm

To make protocol input functions MP safe, internet PCB need protection.
Use their reference counter in more places.
The in_pcb lookup functions hold the PCBs in hash tables protected
by table->inpt_mtx mutex. Whenever a result is returned, increment
the ref count before releasing the mutex. Then the inp can be used
as long as neccessary. Unref it at the end of all functions that
call in_pcb lookup.
As a shortcut, pf may also hold a reference to the PCB. When
pf_inp_lookup() returns it, it also incements the ref count and the
caller can handle it like the inp from table lookup.
OK sashan@


Revision tags: OPENBSD_7_1_BASE
# 1.184 02-Mar-2022 bluhm

The return value of in6_pcbnotify() is never used. Make it a void
function.
OK gnezdo@ mvs@ florian@ sashan@


# 1.183 02-Jan-2022 jsg

spelling
ok jmc@ reads ok tb@


# 1.182 11-Nov-2021 bluhm

Do not call ip_deliver() recursively from IPsec. As there is no
crypto task anymore, it is possible to return the next protocol.
Then ip_deliver() will walk the header chain in its loop.
IPsec bridge(4) tested by jan@
OK mvs@ tobhe@ jan@


# 1.181 23-Oct-2021 bluhm

There is an m_pullup() down in AH input. As it may free or change
the mbuf, the callers must be careful. Although there is no bug,
use the common pattern to handle this. Pass down an mbuf pointer
mp and let m_pullup() update the pointer in all callers.
It looks like the tcp signature functions should not be called.
Avoid an mbuf leak and return an error.
OK mvs@


# 1.180 13-Oct-2021 bluhm

The function ipip_output() was registered as .xf_output() xform
function. But was is never called via this pointer. It would have
immediatley crashed as mp is always NULL when called via .xf_output().
Do not set .xf_output to ipip_output. This allows to pass only the
parameters which are actually needed and the control flow is clearer.
OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.179 14-Jul-2021 bluhm

Resend the TCP packet only if the MTU locked flag appears at the
route and was not there before. This should prevent a recursion
in path MTU discovery with TCP over IPsec.
reported and tested Matthias Schmidt; tested and OK tobhe@


# 1.178 08-Jul-2021 bluhm

The xformsw array never changes. Declare struct xformsw constant
and map data read only.
OK deraadt@ mvs@ mpi@


# 1.177 30-Jun-2021 bluhm

For path MTU discovery tcp_mtudisc() should resend a TCP packet by
calling tcp_output() if the TCP maximum segment size changes. But
that did not work, as the new value was compared before tcp_mss()
had a chance to modify it. Move the comparison and change it from
not equal to greater than. It makes only sense to resend a packet
immediately if it becomes smaller and is more likely to fit.
OK sashan@ tobhe@


Revision tags: OPENBSD_6_9_BASE
# 1.176 25-Feb-2021 dlg

we don't have to cast to caddr_t when calling m_copydata anymore.

the first cut of this diff was made with coccinelle using this spatch:

@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)

i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.

ok deraadt@ bluhm@


Revision tags: OPENBSD_6_8_BASE
# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.191 10-May-2023 bluhm

Implement TCP send offloading, for now in software only. This is
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@


Revision tags: OPENBSD_7_3_BASE
# 1.190 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.189 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.188 03-Sep-2022 bluhm

Initialize TCP mutex forgotten in previous commit.
found by Hrvoje Popovski with witness; OK mvs@


# 1.187 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.186 30-Aug-2022 bluhm

Refactor internet PCB lookup function. Rename in_pcbhashlookup()
so the public API is in_pcblookup() and in_pcblookup_listen(). For
internal use introduce in_pcbhash_insert() and in_pcbhash_lookup()
to avoid code duplication. Routing domain is unsigned, change the
type to u_int.
OK mvs@


# 1.185 08-Aug-2022 bluhm

To make protocol input functions MP safe, internet PCB need protection.
Use their reference counter in more places.
The in_pcb lookup functions hold the PCBs in hash tables protected
by table->inpt_mtx mutex. Whenever a result is returned, increment
the ref count before releasing the mutex. Then the inp can be used
as long as neccessary. Unref it at the end of all functions that
call in_pcb lookup.
As a shortcut, pf may also hold a reference to the PCB. When
pf_inp_lookup() returns it, it also incements the ref count and the
caller can handle it like the inp from table lookup.
OK sashan@


Revision tags: OPENBSD_7_1_BASE
# 1.184 02-Mar-2022 bluhm

The return value of in6_pcbnotify() is never used. Make it a void
function.
OK gnezdo@ mvs@ florian@ sashan@


# 1.183 02-Jan-2022 jsg

spelling
ok jmc@ reads ok tb@


# 1.182 11-Nov-2021 bluhm

Do not call ip_deliver() recursively from IPsec. As there is no
crypto task anymore, it is possible to return the next protocol.
Then ip_deliver() will walk the header chain in its loop.
IPsec bridge(4) tested by jan@
OK mvs@ tobhe@ jan@


# 1.181 23-Oct-2021 bluhm

There is an m_pullup() down in AH input. As it may free or change
the mbuf, the callers must be careful. Although there is no bug,
use the common pattern to handle this. Pass down an mbuf pointer
mp and let m_pullup() update the pointer in all callers.
It looks like the tcp signature functions should not be called.
Avoid an mbuf leak and return an error.
OK mvs@


# 1.180 13-Oct-2021 bluhm

The function ipip_output() was registered as .xf_output() xform
function. But was is never called via this pointer. It would have
immediatley crashed as mp is always NULL when called via .xf_output().
Do not set .xf_output to ipip_output. This allows to pass only the
parameters which are actually needed and the control flow is clearer.
OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.179 14-Jul-2021 bluhm

Resend the TCP packet only if the MTU locked flag appears at the
route and was not there before. This should prevent a recursion
in path MTU discovery with TCP over IPsec.
reported and tested Matthias Schmidt; tested and OK tobhe@


# 1.178 08-Jul-2021 bluhm

The xformsw array never changes. Declare struct xformsw constant
and map data read only.
OK deraadt@ mvs@ mpi@


# 1.177 30-Jun-2021 bluhm

For path MTU discovery tcp_mtudisc() should resend a TCP packet by
calling tcp_output() if the TCP maximum segment size changes. But
that did not work, as the new value was compared before tcp_mss()
had a chance to modify it. Move the comparison and change it from
not equal to greater than. It makes only sense to resend a packet
immediately if it becomes smaller and is more likely to fit.
OK sashan@ tobhe@


Revision tags: OPENBSD_6_9_BASE
# 1.176 25-Feb-2021 dlg

we don't have to cast to caddr_t when calling m_copydata anymore.

the first cut of this diff was made with coccinelle using this spatch:

@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)

i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.

ok deraadt@ bluhm@


Revision tags: OPENBSD_6_8_BASE
# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.190 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.189 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.188 03-Sep-2022 bluhm

Initialize TCP mutex forgotten in previous commit.
found by Hrvoje Popovski with witness; OK mvs@


# 1.187 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.186 30-Aug-2022 bluhm

Refactor internet PCB lookup function. Rename in_pcbhashlookup()
so the public API is in_pcblookup() and in_pcblookup_listen(). For
internal use introduce in_pcbhash_insert() and in_pcbhash_lookup()
to avoid code duplication. Routing domain is unsigned, change the
type to u_int.
OK mvs@


# 1.185 08-Aug-2022 bluhm

To make protocol input functions MP safe, internet PCB need protection.
Use their reference counter in more places.
The in_pcb lookup functions hold the PCBs in hash tables protected
by table->inpt_mtx mutex. Whenever a result is returned, increment
the ref count before releasing the mutex. Then the inp can be used
as long as neccessary. Unref it at the end of all functions that
call in_pcb lookup.
As a shortcut, pf may also hold a reference to the PCB. When
pf_inp_lookup() returns it, it also incements the ref count and the
caller can handle it like the inp from table lookup.
OK sashan@


Revision tags: OPENBSD_7_1_BASE
# 1.184 02-Mar-2022 bluhm

The return value of in6_pcbnotify() is never used. Make it a void
function.
OK gnezdo@ mvs@ florian@ sashan@


# 1.183 02-Jan-2022 jsg

spelling
ok jmc@ reads ok tb@


# 1.182 11-Nov-2021 bluhm

Do not call ip_deliver() recursively from IPsec. As there is no
crypto task anymore, it is possible to return the next protocol.
Then ip_deliver() will walk the header chain in its loop.
IPsec bridge(4) tested by jan@
OK mvs@ tobhe@ jan@


# 1.181 23-Oct-2021 bluhm

There is an m_pullup() down in AH input. As it may free or change
the mbuf, the callers must be careful. Although there is no bug,
use the common pattern to handle this. Pass down an mbuf pointer
mp and let m_pullup() update the pointer in all callers.
It looks like the tcp signature functions should not be called.
Avoid an mbuf leak and return an error.
OK mvs@


# 1.180 13-Oct-2021 bluhm

The function ipip_output() was registered as .xf_output() xform
function. But was is never called via this pointer. It would have
immediatley crashed as mp is always NULL when called via .xf_output().
Do not set .xf_output to ipip_output. This allows to pass only the
parameters which are actually needed and the control flow is clearer.
OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.179 14-Jul-2021 bluhm

Resend the TCP packet only if the MTU locked flag appears at the
route and was not there before. This should prevent a recursion
in path MTU discovery with TCP over IPsec.
reported and tested Matthias Schmidt; tested and OK tobhe@


# 1.178 08-Jul-2021 bluhm

The xformsw array never changes. Declare struct xformsw constant
and map data read only.
OK deraadt@ mvs@ mpi@


# 1.177 30-Jun-2021 bluhm

For path MTU discovery tcp_mtudisc() should resend a TCP packet by
calling tcp_output() if the TCP maximum segment size changes. But
that did not work, as the new value was compared before tcp_mss()
had a chance to modify it. Move the comparison and change it from
not equal to greater than. It makes only sense to resend a packet
immediately if it becomes smaller and is more likely to fit.
OK sashan@ tobhe@


Revision tags: OPENBSD_6_9_BASE
# 1.176 25-Feb-2021 dlg

we don't have to cast to caddr_t when calling m_copydata anymore.

the first cut of this diff was made with coccinelle using this spatch:

@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)

i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.

ok deraadt@ bluhm@


Revision tags: OPENBSD_6_8_BASE
# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.189 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.188 03-Sep-2022 bluhm

Initialize TCP mutex forgotten in previous commit.
found by Hrvoje Popovski with witness; OK mvs@


# 1.187 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.186 30-Aug-2022 bluhm

Refactor internet PCB lookup function. Rename in_pcbhashlookup()
so the public API is in_pcblookup() and in_pcblookup_listen(). For
internal use introduce in_pcbhash_insert() and in_pcbhash_lookup()
to avoid code duplication. Routing domain is unsigned, change the
type to u_int.
OK mvs@


# 1.185 08-Aug-2022 bluhm

To make protocol input functions MP safe, internet PCB need protection.
Use their reference counter in more places.
The in_pcb lookup functions hold the PCBs in hash tables protected
by table->inpt_mtx mutex. Whenever a result is returned, increment
the ref count before releasing the mutex. Then the inp can be used
as long as neccessary. Unref it at the end of all functions that
call in_pcb lookup.
As a shortcut, pf may also hold a reference to the PCB. When
pf_inp_lookup() returns it, it also incements the ref count and the
caller can handle it like the inp from table lookup.
OK sashan@


Revision tags: OPENBSD_7_1_BASE
# 1.184 02-Mar-2022 bluhm

The return value of in6_pcbnotify() is never used. Make it a void
function.
OK gnezdo@ mvs@ florian@ sashan@


# 1.183 02-Jan-2022 jsg

spelling
ok jmc@ reads ok tb@


# 1.182 11-Nov-2021 bluhm

Do not call ip_deliver() recursively from IPsec. As there is no
crypto task anymore, it is possible to return the next protocol.
Then ip_deliver() will walk the header chain in its loop.
IPsec bridge(4) tested by jan@
OK mvs@ tobhe@ jan@


# 1.181 23-Oct-2021 bluhm

There is an m_pullup() down in AH input. As it may free or change
the mbuf, the callers must be careful. Although there is no bug,
use the common pattern to handle this. Pass down an mbuf pointer
mp and let m_pullup() update the pointer in all callers.
It looks like the tcp signature functions should not be called.
Avoid an mbuf leak and return an error.
OK mvs@


# 1.180 13-Oct-2021 bluhm

The function ipip_output() was registered as .xf_output() xform
function. But was is never called via this pointer. It would have
immediatley crashed as mp is always NULL when called via .xf_output().
Do not set .xf_output to ipip_output. This allows to pass only the
parameters which are actually needed and the control flow is clearer.
OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.179 14-Jul-2021 bluhm

Resend the TCP packet only if the MTU locked flag appears at the
route and was not there before. This should prevent a recursion
in path MTU discovery with TCP over IPsec.
reported and tested Matthias Schmidt; tested and OK tobhe@


# 1.178 08-Jul-2021 bluhm

The xformsw array never changes. Declare struct xformsw constant
and map data read only.
OK deraadt@ mvs@ mpi@


# 1.177 30-Jun-2021 bluhm

For path MTU discovery tcp_mtudisc() should resend a TCP packet by
calling tcp_output() if the TCP maximum segment size changes. But
that did not work, as the new value was compared before tcp_mss()
had a chance to modify it. Move the comparison and change it from
not equal to greater than. It makes only sense to resend a packet
immediately if it becomes smaller and is more likely to fit.
OK sashan@ tobhe@


Revision tags: OPENBSD_6_9_BASE
# 1.176 25-Feb-2021 dlg

we don't have to cast to caddr_t when calling m_copydata anymore.

the first cut of this diff was made with coccinelle using this spatch:

@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)

i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.

ok deraadt@ bluhm@


Revision tags: OPENBSD_6_8_BASE
# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.188 03-Sep-2022 bluhm

Initialize TCP mutex forgotten in previous commit.
found by Hrvoje Popovski with witness; OK mvs@


# 1.187 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.186 30-Aug-2022 bluhm

Refactor internet PCB lookup function. Rename in_pcbhashlookup()
so the public API is in_pcblookup() and in_pcblookup_listen(). For
internal use introduce in_pcbhash_insert() and in_pcbhash_lookup()
to avoid code duplication. Routing domain is unsigned, change the
type to u_int.
OK mvs@


# 1.185 08-Aug-2022 bluhm

To make protocol input functions MP safe, internet PCB need protection.
Use their reference counter in more places.
The in_pcb lookup functions hold the PCBs in hash tables protected
by table->inpt_mtx mutex. Whenever a result is returned, increment
the ref count before releasing the mutex. Then the inp can be used
as long as neccessary. Unref it at the end of all functions that
call in_pcb lookup.
As a shortcut, pf may also hold a reference to the PCB. When
pf_inp_lookup() returns it, it also incements the ref count and the
caller can handle it like the inp from table lookup.
OK sashan@


Revision tags: OPENBSD_7_1_BASE
# 1.184 02-Mar-2022 bluhm

The return value of in6_pcbnotify() is never used. Make it a void
function.
OK gnezdo@ mvs@ florian@ sashan@


# 1.183 02-Jan-2022 jsg

spelling
ok jmc@ reads ok tb@


# 1.182 11-Nov-2021 bluhm

Do not call ip_deliver() recursively from IPsec. As there is no
crypto task anymore, it is possible to return the next protocol.
Then ip_deliver() will walk the header chain in its loop.
IPsec bridge(4) tested by jan@
OK mvs@ tobhe@ jan@


# 1.181 23-Oct-2021 bluhm

There is an m_pullup() down in AH input. As it may free or change
the mbuf, the callers must be careful. Although there is no bug,
use the common pattern to handle this. Pass down an mbuf pointer
mp and let m_pullup() update the pointer in all callers.
It looks like the tcp signature functions should not be called.
Avoid an mbuf leak and return an error.
OK mvs@


# 1.180 13-Oct-2021 bluhm

The function ipip_output() was registered as .xf_output() xform
function. But was is never called via this pointer. It would have
immediatley crashed as mp is always NULL when called via .xf_output().
Do not set .xf_output to ipip_output. This allows to pass only the
parameters which are actually needed and the control flow is clearer.
OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.179 14-Jul-2021 bluhm

Resend the TCP packet only if the MTU locked flag appears at the
route and was not there before. This should prevent a recursion
in path MTU discovery with TCP over IPsec.
reported and tested Matthias Schmidt; tested and OK tobhe@


# 1.178 08-Jul-2021 bluhm

The xformsw array never changes. Declare struct xformsw constant
and map data read only.
OK deraadt@ mvs@ mpi@


# 1.177 30-Jun-2021 bluhm

For path MTU discovery tcp_mtudisc() should resend a TCP packet by
calling tcp_output() if the TCP maximum segment size changes. But
that did not work, as the new value was compared before tcp_mss()
had a chance to modify it. Move the comparison and change it from
not equal to greater than. It makes only sense to resend a packet
immediately if it becomes smaller and is more likely to fit.
OK sashan@ tobhe@


Revision tags: OPENBSD_6_9_BASE
# 1.176 25-Feb-2021 dlg

we don't have to cast to caddr_t when calling m_copydata anymore.

the first cut of this diff was made with coccinelle using this spatch:

@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)

i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.

ok deraadt@ bluhm@


Revision tags: OPENBSD_6_8_BASE
# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.186 30-Aug-2022 bluhm

Refactor internet PCB lookup function. Rename in_pcbhashlookup()
so the public API is in_pcblookup() and in_pcblookup_listen(). For
internal use introduce in_pcbhash_insert() and in_pcbhash_lookup()
to avoid code duplication. Routing domain is unsigned, change the
type to u_int.
OK mvs@


# 1.185 08-Aug-2022 bluhm

To make protocol input functions MP safe, internet PCB need protection.
Use their reference counter in more places.
The in_pcb lookup functions hold the PCBs in hash tables protected
by table->inpt_mtx mutex. Whenever a result is returned, increment
the ref count before releasing the mutex. Then the inp can be used
as long as neccessary. Unref it at the end of all functions that
call in_pcb lookup.
As a shortcut, pf may also hold a reference to the PCB. When
pf_inp_lookup() returns it, it also incements the ref count and the
caller can handle it like the inp from table lookup.
OK sashan@


Revision tags: OPENBSD_7_1_BASE
# 1.184 02-Mar-2022 bluhm

The return value of in6_pcbnotify() is never used. Make it a void
function.
OK gnezdo@ mvs@ florian@ sashan@


# 1.183 02-Jan-2022 jsg

spelling
ok jmc@ reads ok tb@


# 1.182 11-Nov-2021 bluhm

Do not call ip_deliver() recursively from IPsec. As there is no
crypto task anymore, it is possible to return the next protocol.
Then ip_deliver() will walk the header chain in its loop.
IPsec bridge(4) tested by jan@
OK mvs@ tobhe@ jan@


# 1.181 23-Oct-2021 bluhm

There is an m_pullup() down in AH input. As it may free or change
the mbuf, the callers must be careful. Although there is no bug,
use the common pattern to handle this. Pass down an mbuf pointer
mp and let m_pullup() update the pointer in all callers.
It looks like the tcp signature functions should not be called.
Avoid an mbuf leak and return an error.
OK mvs@


# 1.180 13-Oct-2021 bluhm

The function ipip_output() was registered as .xf_output() xform
function. But was is never called via this pointer. It would have
immediatley crashed as mp is always NULL when called via .xf_output().
Do not set .xf_output to ipip_output. This allows to pass only the
parameters which are actually needed and the control flow is clearer.
OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.179 14-Jul-2021 bluhm

Resend the TCP packet only if the MTU locked flag appears at the
route and was not there before. This should prevent a recursion
in path MTU discovery with TCP over IPsec.
reported and tested Matthias Schmidt; tested and OK tobhe@


# 1.178 08-Jul-2021 bluhm

The xformsw array never changes. Declare struct xformsw constant
and map data read only.
OK deraadt@ mvs@ mpi@


# 1.177 30-Jun-2021 bluhm

For path MTU discovery tcp_mtudisc() should resend a TCP packet by
calling tcp_output() if the TCP maximum segment size changes. But
that did not work, as the new value was compared before tcp_mss()
had a chance to modify it. Move the comparison and change it from
not equal to greater than. It makes only sense to resend a packet
immediately if it becomes smaller and is more likely to fit.
OK sashan@ tobhe@


Revision tags: OPENBSD_6_9_BASE
# 1.176 25-Feb-2021 dlg

we don't have to cast to caddr_t when calling m_copydata anymore.

the first cut of this diff was made with coccinelle using this spatch:

@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)

i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.

ok deraadt@ bluhm@


Revision tags: OPENBSD_6_8_BASE
# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.185 08-Aug-2022 bluhm

To make protocol input functions MP safe, internet PCB need protection.
Use their reference counter in more places.
The in_pcb lookup functions hold the PCBs in hash tables protected
by table->inpt_mtx mutex. Whenever a result is returned, increment
the ref count before releasing the mutex. Then the inp can be used
as long as neccessary. Unref it at the end of all functions that
call in_pcb lookup.
As a shortcut, pf may also hold a reference to the PCB. When
pf_inp_lookup() returns it, it also incements the ref count and the
caller can handle it like the inp from table lookup.
OK sashan@


Revision tags: OPENBSD_7_1_BASE
# 1.184 02-Mar-2022 bluhm

The return value of in6_pcbnotify() is never used. Make it a void
function.
OK gnezdo@ mvs@ florian@ sashan@


# 1.183 02-Jan-2022 jsg

spelling
ok jmc@ reads ok tb@


# 1.182 11-Nov-2021 bluhm

Do not call ip_deliver() recursively from IPsec. As there is no
crypto task anymore, it is possible to return the next protocol.
Then ip_deliver() will walk the header chain in its loop.
IPsec bridge(4) tested by jan@
OK mvs@ tobhe@ jan@


# 1.181 23-Oct-2021 bluhm

There is an m_pullup() down in AH input. As it may free or change
the mbuf, the callers must be careful. Although there is no bug,
use the common pattern to handle this. Pass down an mbuf pointer
mp and let m_pullup() update the pointer in all callers.
It looks like the tcp signature functions should not be called.
Avoid an mbuf leak and return an error.
OK mvs@


# 1.180 13-Oct-2021 bluhm

The function ipip_output() was registered as .xf_output() xform
function. But was is never called via this pointer. It would have
immediatley crashed as mp is always NULL when called via .xf_output().
Do not set .xf_output to ipip_output. This allows to pass only the
parameters which are actually needed and the control flow is clearer.
OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.179 14-Jul-2021 bluhm

Resend the TCP packet only if the MTU locked flag appears at the
route and was not there before. This should prevent a recursion
in path MTU discovery with TCP over IPsec.
reported and tested Matthias Schmidt; tested and OK tobhe@


# 1.178 08-Jul-2021 bluhm

The xformsw array never changes. Declare struct xformsw constant
and map data read only.
OK deraadt@ mvs@ mpi@


# 1.177 30-Jun-2021 bluhm

For path MTU discovery tcp_mtudisc() should resend a TCP packet by
calling tcp_output() if the TCP maximum segment size changes. But
that did not work, as the new value was compared before tcp_mss()
had a chance to modify it. Move the comparison and change it from
not equal to greater than. It makes only sense to resend a packet
immediately if it becomes smaller and is more likely to fit.
OK sashan@ tobhe@


Revision tags: OPENBSD_6_9_BASE
# 1.176 25-Feb-2021 dlg

we don't have to cast to caddr_t when calling m_copydata anymore.

the first cut of this diff was made with coccinelle using this spatch:

@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)

i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.

ok deraadt@ bluhm@


Revision tags: OPENBSD_6_8_BASE
# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.184 02-Mar-2022 bluhm

The return value of in6_pcbnotify() is never used. Make it a void
function.
OK gnezdo@ mvs@ florian@ sashan@


# 1.183 02-Jan-2022 jsg

spelling
ok jmc@ reads ok tb@


# 1.182 11-Nov-2021 bluhm

Do not call ip_deliver() recursively from IPsec. As there is no
crypto task anymore, it is possible to return the next protocol.
Then ip_deliver() will walk the header chain in its loop.
IPsec bridge(4) tested by jan@
OK mvs@ tobhe@ jan@


# 1.181 23-Oct-2021 bluhm

There is an m_pullup() down in AH input. As it may free or change
the mbuf, the callers must be careful. Although there is no bug,
use the common pattern to handle this. Pass down an mbuf pointer
mp and let m_pullup() update the pointer in all callers.
It looks like the tcp signature functions should not be called.
Avoid an mbuf leak and return an error.
OK mvs@


# 1.180 13-Oct-2021 bluhm

The function ipip_output() was registered as .xf_output() xform
function. But was is never called via this pointer. It would have
immediatley crashed as mp is always NULL when called via .xf_output().
Do not set .xf_output to ipip_output. This allows to pass only the
parameters which are actually needed and the control flow is clearer.
OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.179 14-Jul-2021 bluhm

Resend the TCP packet only if the MTU locked flag appears at the
route and was not there before. This should prevent a recursion
in path MTU discovery with TCP over IPsec.
reported and tested Matthias Schmidt; tested and OK tobhe@


# 1.178 08-Jul-2021 bluhm

The xformsw array never changes. Declare struct xformsw constant
and map data read only.
OK deraadt@ mvs@ mpi@


# 1.177 30-Jun-2021 bluhm

For path MTU discovery tcp_mtudisc() should resend a TCP packet by
calling tcp_output() if the TCP maximum segment size changes. But
that did not work, as the new value was compared before tcp_mss()
had a chance to modify it. Move the comparison and change it from
not equal to greater than. It makes only sense to resend a packet
immediately if it becomes smaller and is more likely to fit.
OK sashan@ tobhe@


Revision tags: OPENBSD_6_9_BASE
# 1.176 25-Feb-2021 dlg

we don't have to cast to caddr_t when calling m_copydata anymore.

the first cut of this diff was made with coccinelle using this spatch:

@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)

i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.

ok deraadt@ bluhm@


Revision tags: OPENBSD_6_8_BASE
# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.183 02-Jan-2022 jsg

spelling
ok jmc@ reads ok tb@


# 1.182 11-Nov-2021 bluhm

Do not call ip_deliver() recursively from IPsec. As there is no
crypto task anymore, it is possible to return the next protocol.
Then ip_deliver() will walk the header chain in its loop.
IPsec bridge(4) tested by jan@
OK mvs@ tobhe@ jan@


# 1.181 23-Oct-2021 bluhm

There is an m_pullup() down in AH input. As it may free or change
the mbuf, the callers must be careful. Although there is no bug,
use the common pattern to handle this. Pass down an mbuf pointer
mp and let m_pullup() update the pointer in all callers.
It looks like the tcp signature functions should not be called.
Avoid an mbuf leak and return an error.
OK mvs@


# 1.180 13-Oct-2021 bluhm

The function ipip_output() was registered as .xf_output() xform
function. But was is never called via this pointer. It would have
immediatley crashed as mp is always NULL when called via .xf_output().
Do not set .xf_output to ipip_output. This allows to pass only the
parameters which are actually needed and the control flow is clearer.
OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.179 14-Jul-2021 bluhm

Resend the TCP packet only if the MTU locked flag appears at the
route and was not there before. This should prevent a recursion
in path MTU discovery with TCP over IPsec.
reported and tested Matthias Schmidt; tested and OK tobhe@


# 1.178 08-Jul-2021 bluhm

The xformsw array never changes. Declare struct xformsw constant
and map data read only.
OK deraadt@ mvs@ mpi@


# 1.177 30-Jun-2021 bluhm

For path MTU discovery tcp_mtudisc() should resend a TCP packet by
calling tcp_output() if the TCP maximum segment size changes. But
that did not work, as the new value was compared before tcp_mss()
had a chance to modify it. Move the comparison and change it from
not equal to greater than. It makes only sense to resend a packet
immediately if it becomes smaller and is more likely to fit.
OK sashan@ tobhe@


Revision tags: OPENBSD_6_9_BASE
# 1.176 25-Feb-2021 dlg

we don't have to cast to caddr_t when calling m_copydata anymore.

the first cut of this diff was made with coccinelle using this spatch:

@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)

i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.

ok deraadt@ bluhm@


Revision tags: OPENBSD_6_8_BASE
# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.182 11-Nov-2021 bluhm

Do not call ip_deliver() recursively from IPsec. As there is no
crypto task anymore, it is possible to return the next protocol.
Then ip_deliver() will walk the header chain in its loop.
IPsec bridge(4) tested by jan@
OK mvs@ tobhe@ jan@


# 1.181 23-Oct-2021 bluhm

There is an m_pullup() down in AH input. As it may free or change
the mbuf, the callers must be careful. Although there is no bug,
use the common pattern to handle this. Pass down an mbuf pointer
mp and let m_pullup() update the pointer in all callers.
It looks like the tcp signature functions should not be called.
Avoid an mbuf leak and return an error.
OK mvs@


# 1.180 13-Oct-2021 bluhm

The function ipip_output() was registered as .xf_output() xform
function. But was is never called via this pointer. It would have
immediatley crashed as mp is always NULL when called via .xf_output().
Do not set .xf_output to ipip_output. This allows to pass only the
parameters which are actually needed and the control flow is clearer.
OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.179 14-Jul-2021 bluhm

Resend the TCP packet only if the MTU locked flag appears at the
route and was not there before. This should prevent a recursion
in path MTU discovery with TCP over IPsec.
reported and tested Matthias Schmidt; tested and OK tobhe@


# 1.178 08-Jul-2021 bluhm

The xformsw array never changes. Declare struct xformsw constant
and map data read only.
OK deraadt@ mvs@ mpi@


# 1.177 30-Jun-2021 bluhm

For path MTU discovery tcp_mtudisc() should resend a TCP packet by
calling tcp_output() if the TCP maximum segment size changes. But
that did not work, as the new value was compared before tcp_mss()
had a chance to modify it. Move the comparison and change it from
not equal to greater than. It makes only sense to resend a packet
immediately if it becomes smaller and is more likely to fit.
OK sashan@ tobhe@


Revision tags: OPENBSD_6_9_BASE
# 1.176 25-Feb-2021 dlg

we don't have to cast to caddr_t when calling m_copydata anymore.

the first cut of this diff was made with coccinelle using this spatch:

@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)

i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.

ok deraadt@ bluhm@


Revision tags: OPENBSD_6_8_BASE
# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.181 23-Oct-2021 bluhm

There is an m_pullup() down in AH input. As it may free or change
the mbuf, the callers must be careful. Although there is no bug,
use the common pattern to handle this. Pass down an mbuf pointer
mp and let m_pullup() update the pointer in all callers.
It looks like the tcp signature functions should not be called.
Avoid an mbuf leak and return an error.
OK mvs@


# 1.180 13-Oct-2021 bluhm

The function ipip_output() was registered as .xf_output() xform
function. But was is never called via this pointer. It would have
immediatley crashed as mp is always NULL when called via .xf_output().
Do not set .xf_output to ipip_output. This allows to pass only the
parameters which are actually needed and the control flow is clearer.
OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.179 14-Jul-2021 bluhm

Resend the TCP packet only if the MTU locked flag appears at the
route and was not there before. This should prevent a recursion
in path MTU discovery with TCP over IPsec.
reported and tested Matthias Schmidt; tested and OK tobhe@


# 1.178 08-Jul-2021 bluhm

The xformsw array never changes. Declare struct xformsw constant
and map data read only.
OK deraadt@ mvs@ mpi@


# 1.177 30-Jun-2021 bluhm

For path MTU discovery tcp_mtudisc() should resend a TCP packet by
calling tcp_output() if the TCP maximum segment size changes. But
that did not work, as the new value was compared before tcp_mss()
had a chance to modify it. Move the comparison and change it from
not equal to greater than. It makes only sense to resend a packet
immediately if it becomes smaller and is more likely to fit.
OK sashan@ tobhe@


Revision tags: OPENBSD_6_9_BASE
# 1.176 25-Feb-2021 dlg

we don't have to cast to caddr_t when calling m_copydata anymore.

the first cut of this diff was made with coccinelle using this spatch:

@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)

i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.

ok deraadt@ bluhm@


Revision tags: OPENBSD_6_8_BASE
# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.180 13-Oct-2021 bluhm

The function ipip_output() was registered as .xf_output() xform
function. But was is never called via this pointer. It would have
immediatley crashed as mp is always NULL when called via .xf_output().
Do not set .xf_output to ipip_output. This allows to pass only the
parameters which are actually needed and the control flow is clearer.
OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.179 14-Jul-2021 bluhm

Resend the TCP packet only if the MTU locked flag appears at the
route and was not there before. This should prevent a recursion
in path MTU discovery with TCP over IPsec.
reported and tested Matthias Schmidt; tested and OK tobhe@


# 1.178 08-Jul-2021 bluhm

The xformsw array never changes. Declare struct xformsw constant
and map data read only.
OK deraadt@ mvs@ mpi@


# 1.177 30-Jun-2021 bluhm

For path MTU discovery tcp_mtudisc() should resend a TCP packet by
calling tcp_output() if the TCP maximum segment size changes. But
that did not work, as the new value was compared before tcp_mss()
had a chance to modify it. Move the comparison and change it from
not equal to greater than. It makes only sense to resend a packet
immediately if it becomes smaller and is more likely to fit.
OK sashan@ tobhe@


Revision tags: OPENBSD_6_9_BASE
# 1.176 25-Feb-2021 dlg

we don't have to cast to caddr_t when calling m_copydata anymore.

the first cut of this diff was made with coccinelle using this spatch:

@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)

i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.

ok deraadt@ bluhm@


Revision tags: OPENBSD_6_8_BASE
# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.179 14-Jul-2021 bluhm

Resend the TCP packet only if the MTU locked flag appears at the
route and was not there before. This should prevent a recursion
in path MTU discovery with TCP over IPsec.
reported and tested Matthias Schmidt; tested and OK tobhe@


# 1.178 08-Jul-2021 bluhm

The xformsw array never changes. Declare struct xformsw constant
and map data read only.
OK deraadt@ mvs@ mpi@


# 1.177 30-Jun-2021 bluhm

For path MTU discovery tcp_mtudisc() should resend a TCP packet by
calling tcp_output() if the TCP maximum segment size changes. But
that did not work, as the new value was compared before tcp_mss()
had a chance to modify it. Move the comparison and change it from
not equal to greater than. It makes only sense to resend a packet
immediately if it becomes smaller and is more likely to fit.
OK sashan@ tobhe@


Revision tags: OPENBSD_6_9_BASE
# 1.176 25-Feb-2021 dlg

we don't have to cast to caddr_t when calling m_copydata anymore.

the first cut of this diff was made with coccinelle using this spatch:

@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)

i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.

ok deraadt@ bluhm@


Revision tags: OPENBSD_6_8_BASE
# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.178 08-Jul-2021 bluhm

The xformsw array never changes. Declare struct xformsw constant
and map data read only.
OK deraadt@ mvs@ mpi@


# 1.177 30-Jun-2021 bluhm

For path MTU discovery tcp_mtudisc() should resend a TCP packet by
calling tcp_output() if the TCP maximum segment size changes. But
that did not work, as the new value was compared before tcp_mss()
had a chance to modify it. Move the comparison and change it from
not equal to greater than. It makes only sense to resend a packet
immediately if it becomes smaller and is more likely to fit.
OK sashan@ tobhe@


Revision tags: OPENBSD_6_9_BASE
# 1.176 25-Feb-2021 dlg

we don't have to cast to caddr_t when calling m_copydata anymore.

the first cut of this diff was made with coccinelle using this spatch:

@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)

i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.

ok deraadt@ bluhm@


Revision tags: OPENBSD_6_8_BASE
# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.177 30-Jun-2021 bluhm

For path MTU discovery tcp_mtudisc() should resend a TCP packet by
calling tcp_output() if the TCP maximum segment size changes. But
that did not work, as the new value was compared before tcp_mss()
had a chance to modify it. Move the comparison and change it from
not equal to greater than. It makes only sense to resend a packet
immediately if it becomes smaller and is more likely to fit.
OK sashan@ tobhe@


Revision tags: OPENBSD_6_9_BASE
# 1.176 25-Feb-2021 dlg

we don't have to cast to caddr_t when calling m_copydata anymore.

the first cut of this diff was made with coccinelle using this spatch:

@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)

i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.

ok deraadt@ bluhm@


Revision tags: OPENBSD_6_8_BASE
# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.176 25-Feb-2021 dlg

we don't have to cast to caddr_t when calling m_copydata anymore.

the first cut of this diff was made with coccinelle using this spatch:

@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)

i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.

ok deraadt@ bluhm@


Revision tags: OPENBSD_6_8_BASE
# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.175 24-Jul-2020 cheloha

netinet: tcp_close(): delay reaper timeout by one tick

Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().

bluhm@ says waiting a tick for the reaper shouldn't break anything.

ok bluhm@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.174 04-Oct-2018 bluhm

Revert the inpcb table mutex commit. It triggers a witness panic
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.173 20-Sep-2018 bluhm

As a step towards per inpcb or socket locks, remove the net lock
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.172 14-Jun-2018 yasuoka

Use mbuf (not cluster) always for t_template of tcpcb.

ok bluhm


# 1.171 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.170 02-Apr-2018 dhill

Use memcpy on freshly allocated memory and add the free size.

OK millert@


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


Revision tags: OPENBSD_6_3_BASE
# 1.169 18-Mar-2018 bluhm

Refactor tcp_mtudisc() like NetBSD did. Do the route lookup only
if the tcpcb exits.
OK mpi@


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.168 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.167 07-Dec-2017 mikeb

Initialize tcp_secret in tcp_init

The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.

OK deraadt, visa, mpi


# 1.166 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.165 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.164 18-May-2017 mpi

Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> in
<netinet/tcp_debug.h>.

The IPv6 variant was always included and the IPv4 version is not
present on all systems.

Most of the offending ports are already fixed, thanks to sthen@!


# 1.163 09-May-2017 bluhm

Convert diagnostic panic to compile time assert in tcp6_ctlinput().
No binary change.
OK mpi@


# 1.162 04-May-2017 bluhm

Introduce sstosa() for converting sockaddr_storage with a type safe
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@


# 1.161 19-Apr-2017 bluhm

Use the rt_rmx defines that hide the struct rt_kmetrics indirection.
No binary change.
OK mpi@


Revision tags: OPENBSD_6_1_BASE
# 1.160 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.159 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.158 10-Jan-2017 mpi

Remove NULL checks before m_free(9), it deals with it.

ok bluhm@, kettenis@


# 1.157 20-Dec-2016 mpi

No need for splsoftnet()/splx() dance around a pool_put() if the pool
has IPL_SOFTNET as ipl.

ok mikeb@, kettenis@


# 1.156 24-Sep-2016 naddy

ANSIfy netinet/; from David Hill


# 1.155 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.154 06-Sep-2016 dlg

pool_setipl for various netinet and netinet6 bits

thank you to everyone who helped reviewed these diffs

ok mpi@


# 1.153 03-Sep-2016 bluhm

Reduce the factor of the limits derived form NMBCLUSTERS. We want
the additional clusters in the socket buffer and not elsewhere.
OK claudio@


# 1.152 31-Aug-2016 mpi

Use 'sc_route{4,6}' directly instead of casting them to 'struct route *'.

This is another little step towards deprecating 'struct route{,_in6}'.

ok florian@


Revision tags: OPENBSD_6_0_BASE
# 1.151 07-Mar-2016 naddy

Sync no-argument function declaration and definition by adding (void).
ok mpi@ millert@


Revision tags: OPENBSD_5_9_BASE
# 1.150 24-Oct-2015 mpi

Ignore Router Advertisment's current hop limit.

Appart from the usual inet6 axe murdering exercise to keep you fit, this
allows us to get rid of a lot of layer violation due to the use of per-
ifp variables to store the current hop limit.

Imputs from bluhm@, ok phessler@, florian@, bluhm@


# 1.149 02-Oct-2015 tedu

add a comment above the rfc1948 code that mentions the rfc so it's easy to find


# 1.148 11-Sep-2015 claudio

Kill yet another argument to functions in IPv6. This time ip6_output's
ifpp - XXX: just for statistics
ifpp is always NULL in all callers so that statistic confirms ifpp is
dying
OK mpi@


# 1.147 01-Sep-2015 bluhm

Replace sockaddr casts with the proper satosin(), ... calls.
From David Hill; OK mpi@; tested kspillner@; tweaks bluhm@


# 1.146 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.145 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_8_BASE
# 1.144 16-Jul-2015 mpi

Expand ancient NTOHL/NTOHS/HTONS/HTONL macros.

ok guenther@, henning@


# 1.143 16-Jun-2015 mpi

Store a unique ID, an interface index, rather than a pointer to the
receiving interface in the packet header of every mbuf.

The interface pointer should now be retrieved when necessary with
if_get(). If a NULL pointer is returned by if_get(), the interface
has probably been destroy/removed and the mbuf should be freed.

Such mechanism will simplify garbage collection of mbufs and limit
problems with dangling ifp pointers.

Tested by jmatthew@ and krw@, discussed with many.

ok mikeb@, bluhm@, dlg@


# 1.142 13-May-2015 jsg

test mbuf pointers against NULL not 0
ok krw@ miod@


# 1.141 07-May-2015 mikeb

Include the timestamp TCP option in keep alive packets as well.

According to RFC 7323 "once TSopt has been successfully negotiated,
... [it] MUST be sent in every non-<RST> segment for the duration
of the connection." Which means that keep alives which are just
ACK packets must include that too.

Pointed out and tested by Lauri Tirkkonen <lotheac at iki ! fi>, thanks!
ok mpi


# 1.140 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.139 19-Dec-2014 tedu

unifdef INET in net code as a precursor to removing the pretend option.
long live the one true internet.
ok henning mikeb


# 1.138 18-Nov-2014 tedu

move arc4random prototype to systm.h. more appropriate for most code
to include that than rdnvar.h. ok deraadt dlg


# 1.137 16-Nov-2014 tedu

remove now unnecessary casts from hash update calls.


# 1.136 06-Nov-2014 mpi

Let's just call a rdomain a rdomain.

ok dlg@


# 1.135 06-Nov-2014 dlg

mix the rtable into the hash for tcp sequence number generation.

ok tedu@ claudio@


# 1.134 04-Nov-2014 mpi

Remove "pl" suffix on pool names.

ok dlg@, uebayasi@, mikeb@


# 1.133 20-Oct-2014 tedu

use sha512 instead of md5 for tcp isn. ok deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.132 22-Jul-2014 mpi

Fewer <netinet/in_systm.h> !


# 1.131 12-Jul-2014 yasuoka

Resize the pcb hashtable automatically. The table size will be doubled
when the number of the hash entries reaches 75% of the table size.

ok dlg henning, 'commit in' claudio


# 1.130 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.129 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.128 21-Apr-2014 henning

we'll do fine without casting NULL to struct foo * / void *
ok gcc & md5 (alas, no binary change)


# 1.127 18-Apr-2014 henning

tcp_respond: let the stack worry about the cksum instead of doing it
manually, ok naddy (in january)


# 1.126 14-Apr-2014 mpi

"struct pkthdr" holds a routing table ID, not a routing domain one.
Avoid the confusion by using an appropriate name for the variable.

Note that since routing domain IDs are a subset of the set of routing
table IDs, the following idiom is correct:

rtableid = rdomain

But to get the routing domain ID corresponding to a given routing table
ID, you must call rtable_l2(9).

claudio@ likes it, ok mikeb@


Revision tags: OPENBSD_5_5_BASE
# 1.125 24-Oct-2013 mpi

Remove the number of in6_var.h inclusions by moving some functions and
global variables to in6.h.

ok deraadt@


# 1.124 23-Oct-2013 mpi

Remove the number of in_var.h inclusions by moving some functions and
global variables to in.h.

ok mikeb@, deraadt@


# 1.123 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.122 20-Oct-2013 phessler

Put a large chunk of the IPv6 rdomain support in-tree.

Still some important missing pieces, and this is not yet enabled.

OK bluhm@


# 1.121 19-Oct-2013 henning

make in_proto_cksum_out not rely on the pseudo header checksum to be
already there, just compute it - it's dirt cheap. since that happens
very late in ip_output, the rest of the stack doesn't have to care about
checksums at all any more, if something needs to be checksummed, just
set the flag on the pkthdr mbuf to indicate so.
stop pre-computing the pseudo header checksum and incrementally updating it
in the tcp and udp stacks.
ok lteo florian


Revision tags: OPENBSD_5_4_BASE
# 1.120 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.119 31-May-2013 bluhm

The function rip6_ctlinput() claims that sa6_src is constant to
allow the assingment of &sa6_any. But rip6_ctlinput() could not
guarantee that as it casted away the const attribute when it passes
the pointer to in6_pcbnotify(). Replace sockaddr with const
sockaddr_in6 in the in6_pcbnotify() parameters. This reduces the
number of casts. Also adjust in6_pcbhashlookup() to handle the
const attribute correctly.
Input and OK claudio@


# 1.118 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


# 1.117 02-Apr-2013 bluhm

Use macros sotoinpcb() and intotcpcb() instead of casts. Use NULL
instead of 0 for pointers. No binary change.
OK mpi@


# 1.116 28-Mar-2013 tedu

code that calls timeout functions should include timeout.h
slipped by on i386, but the zaurus doesn't automagically pick it up.
spotted by patrick


# 1.115 28-Mar-2013 tedu

no need for a lot of code to include proc.h


Revision tags: OPENBSD_5_3_BASE
# 1.114 28-Dec-2012 gsoares

change the malloc(9) flags from M_DONTWAIT to M_NOWAIT; OK millert@


Revision tags: OPENBSD_5_2_BASE
# 1.113 10-Mar-2012 claudio

Increase TCP's initial window to 10 * MSS or 14600 bytes as proposed in
draft-ietf-tcpm-initcwnd. net.inet.tcp.rfc3390 defaults to 2 now which
uses the 10*MSS, setting it back to 1 brings back the old default of 4*MSS.
OK sperreault@, henning@, sthen@, markus@


Revision tags: OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.112 11-Jan-2011 deraadt

for key material that is being being discarded, convert bzero() to
explicit_bzero() where required
ok markus mikeb


Revision tags: OPENBSD_4_8_BASE
# 1.111 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.110 15-Jan-2010 chl

Replace pool_get() + bzero() with pool_get(..., PR_ZERO).

With input from oga@ and krw@

ok oga@ krw@ thib@ markus@ mk@


# 1.109 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.108 03-Nov-2009 claudio

rtables are stacked on rdomains (it is possible to have multiple routing
tables on top of a rdomain) but until now our code was a crazy mix so that
it was impossible to correctly use rtables in that case. Additionally pf(4)
only knows about rtables and not about rdomains. This is especially bad when
tracking (possibly conflicting) states in various domains.
This diff fixes all or most of these issues. It adds a lookup function to
get the rdomain id based on a rtable id. Makes pf understand rdomains and
allows pf to move packets between rdomains (it is similar to NAT).
Because pf states now track the rdomain id as well it is necessary to modify
the pfsync wire format. So old and new systems will not sync up.
A lot of help by dlg@, tested by sthen@, jsg@ and probably more
OK dlg@, mpf@, deraadt@


# 1.107 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.106 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.105 09-Jun-2008 djm

rename arc4random_bytes => arc4random_buf to match libc's nicer name;
ok deraadt@


# 1.104 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.103 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.102 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.101 27-Nov-2007 deraadt

TCP_COMPAT_42 was last used in 1997. Kill it.
ok millert


# 1.100 18-Sep-2007 djm

arc4random_bytes() is the preferred interface for generating nonces;
"looks ok" markus@


# 1.99 01-Sep-2007 henning

since the
MGET* macros were changed to function calls, there wasn't any
need for the pool declarations and the inclusion of pool.h
From: tbert <bret.lambert@gmail.com>


Revision tags: OPENBSD_4_2_BASE
# 1.98 25-Jun-2007 markus

branches: 1.98.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.97 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


# 1.96 01-Jun-2007 henning

apply the "skip ipsec if there are no flows" speedup diff to IPv6 too.
we need a pointer to the inpcb to decide, which was not previously
passed to ip6_output, so this diff is a little bigger.
from itojun, ok ryan


# 1.95 09-May-2007 deraadt

tcp_iss usage is ifdef TCP_COMPAT_42, so the variable decl can be too


# 1.94 08-May-2007 deraadt

variables used by #ifdef code should be inside #ifdef too


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.93 04-Mar-2006 brad

branches: 1.93.4;
With the exception of two other small uncommited diffs this moves
the remainder of the network stack from splimp to splnet.

ok miod@


Revision tags: OPENBSD_3_9_BASE
# 1.92 28-Sep-2005 brad

Enable RFC3390 by default and remove a few compile time options which
can be changed via sysctl's.

ok markus@


Revision tags: OPENBSD_3_8_BASE
# 1.91 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.90 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.89 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


Revision tags: OPENBSD_3_7_BASE
# 1.88 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.87 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.86 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.85 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.84 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


Revision tags: OPENBSD_3_6_BASE
# 1.83 10-Aug-2004 markus

branches: 1.83.2;
verify th_seq in icmp errors; report Fernando Gont; ok mcbride@, dhartmei@


# 1.82 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.81 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.80 07-May-2004 millert

Replace RSA-derived md5 code with code derived from Colin Plumb's PD version.
This moves md5.c out of libkern and into sys/crypto where it belongs (as
requested by markus@). Note that md5.c is still mandatory (dev/rnd.c uses it).
Verified with IPsec + hmac-md5 and tcp md5sig. OK henning@ and hshoexer@


# 1.79 04-May-2004 claudio

The tcp specific routing metrics are almost never used so reduce the routing
table from these metrics. struct rt_msghdr used by the routing socket is not
affected and so most userland apps don't need to be changed.
some man page polishing by jmc@
OK henning@ markus@ theo@


# 1.78 26-Apr-2004 frantzen

- allow the user to force the TCP mss below the fail-safe 216 with a low
interface MTU.
- break a tcp_output() -> tcp_mtudisc() -> tcp_output() infinite recursion
when the TCP mss ends up larger than the interface MTU (when the if_mtu is
smaller than the tcp header). connections will still stall
feedback from itojun@, claudio@ and provos and testing from beck@


Revision tags: OPENBSD_3_5_BASE
# 1.77 02-Mar-2004 markus

branches: 1.77.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.76 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.75 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.74 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.73 09-Jan-2004 markus

don't restrict tcp signature keys to ascii; ok mcbride


# 1.72 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


# 1.71 10-Dec-2003 itojun

de-register. deraadt ok


# 1.70 04-Nov-2003 markus

add in(6)_pcblookup_listen() and replace all calls to in_pcblookup()
with either in(6)_pcbhashlookup() or in(6)_pcblookup_listen();
in_pcblookup is now only used by bind(2); speeds up pcb lookup for
listening sockets; from Claudio Jeker


# 1.69 01-Oct-2003 itojun

use random number generator to generate IPv6 fragment ID/flowlabel.
cleanup IPv6 flowlabel handling. deraadt ok


Revision tags: OPENBSD_3_4_BASE
# 1.68 09-Jul-2003 itojun

branches: 1.68.2;
do not flip ip_len/ip_off in netinet stack. deraadt ok.
(please test, especially PF portion)


# 1.67 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.66 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.65 28-Aug-2002 pefo

branches: 1.65.4;
Fix a problem where passing NULL as a pointer with varargs does not promote
NULL to full 64 bits on a 64 bit address system. Soultion is to add a
(void *) cast before NULL. This makes a 64 bit MIPS kernel work and will
probably help future 64 bit ports as well.

OK from art@


# 1.64 09-Jun-2002 itojun

whitespace


# 1.63 07-Jun-2002 itojun

avoid is_ipv6 construct. a step towards IPv4-less kernel


# 1.62 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.61 14-Mar-2002 millert

First round of __P removal in sys


# 1.60 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.59 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.58 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.57 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


# 1.56 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.55 15-Jan-2002 provos

allocate sackholes with pool


# 1.54 15-Jan-2002 provos

change tcpcb allocation to pool


# 1.53 14-Jan-2002 provos

use macros to manage tcp timers; based on netbsd


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.52 21-Jul-2001 itojun

branches: 1.52.4;
repair IPv6 TCP. th_sum has to be initialized to 0 on template.
(older code had "th_sum = 0" at the bottom of the function, which was
removed during TCP hardware checksumming change)


# 1.51 18-Jul-2001 marc

zero tcp checksum field before calculating new value.
Fixes problem with bad checksums on keepalives
OK provos@


# 1.50 03-Jul-2001 angelos

Pointer arithmetic fixes work better when you get the casting right.


# 1.49 26-Jun-2001 aaron

Appease gcc by not using void pointers in arithmetic operations.


# 1.48 25-Jun-2001 angelos

Always defer output TCP checksumming until ip_output() (or hardware,
if it exists). Cuts down on code a bit, and we don't need to look at
the routing entry at TCP. Based on NetBSD. UDP case to follow.


# 1.47 23-Jun-2001 angelos

Add comment on why checksum deferral is not useful in tcp_respond()


# 1.46 08-Jun-2001 angelos

Cut down on include files.


# 1.45 05-Jun-2001 deraadt

repair copyright notices for NRL & cmetz; cmetz


# 1.44 04-Jun-2001 mickey

use faster arc4random() in tcp_rndiss_next; niels ok


# 1.43 31-May-2001 angelos

Match IPSEC output prototypes.


# 1.42 01-May-2001 fgsch

Fix tcp_signature_tdb_input decl; kernel compiles again if TCP_SIGNATURE
option is used. Note that this does not work.


Revision tags: OPENBSD_2_9_BASE
# 1.41 06-Apr-2001 csapuntz

Move offsetof define into sys/param.h


# 1.40 14-Mar-2001 mickey

provide a random start for tcp timestamps; niels@ ok


# 1.39 16-Feb-2001 itojun

pull in new pcb notification code from kame. better handling of scope address.


# 1.38 21-Dec-2000 itojun

correct ipv6 path mtu discovery.


# 1.37 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.36 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.35 13-Oct-2000 itojun

validate mbuf chain length on *_ctlinput. remote node may be able to
transmit a truncated icmp6 packet and panic the system. sync with kame.


# 1.34 10-Oct-2000 provos

verify payload of the icmp need fragment message at the tcp layer. okay itojun@


# 1.33 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.32 20-Sep-2000 provos

correctly calculate mss


# 1.31 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.30 11-Jul-2000 provos

forgot to reset rscale


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 05-Jul-2000 itojun

more cleanup for IPv4 mapped address support. there seem to be some
inconsistency in corner cases (from NRL I believe).
todd (fries) and I have seen panic, with the following call chain:
ip6_input -> tcp_input -> tcp_respond -> ip_input -> bang!

more cleanups should be done, to decrease complexity.
for example, INP_IPV6_MAPPED should be nuked.


# 1.27 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.26 03-Jun-2000 itojun

correctly handle ctlinput messages for IPv6.


Revision tags: OPENBSD_2_7_BASE
# 1.25 21-Mar-2000 angelos

Fix function to comply with prototype. Kind of moot, as tcp signatures
don't work yet anyhow, so there's no point compiling them in.


# 1.24 29-Feb-2000 itojun

ensure tcp window size does not overflow (16bit unsigned after window scale).
FreeBSD PR: 16914


Revision tags: SMP_BASE
# 1.23 29-Dec-1999 mickey

branches: 1.23.2;
fix _input/_output proto changes for tcp_signature; angelos@ ok


# 1.22 21-Dec-1999 provos

enable SACK again


Revision tags: kame_19991208
# 1.21 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


# 1.20 29-Oct-1999 angelos

Get rid of unnecessary third argument in *_output routines of IPsec.


Revision tags: OPENBSD_2_6_BASE
# 1.19 27-Aug-1999 millert

Disable SACK for now, it has problems, deraadt@


# 1.18 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.17 06-Jul-1999 cmetz

Removed bogus ifdef/define lines that resulted from an over-aggressive M-x.


# 1.16 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


# 1.15 02-Jul-1999 cmetz

Significant cleanups in the way TCP is made to handle multiple network
protocols.

"struct tcpiphdr" is now gone from much of the code, as are separate pointers
for ti and ti6. The result is fewer variables, which is generally a good thing.

Simple if(is_ipv6) ... else ... tests are gone in favor of a
switch(protocol family), which allows future new protocols to be added easily.
This also makes it possible for someone so inclined to re-implement TUBA (TCP
over CLNP?) and do it right instead of the kluged way it was done in 4.4.

The TCP header template is now referenced through a mbuf rather than done
through a data pointer and dtom()ed as needed. This is partly because dtom() is
evil and partly because max_linkhdr + IPv6 + TCP + MSS/TS/SACK opts won't fit
inside a packet header mbuf, so we need to grab a cluster for that (which the
code now does, if needed).


Revision tags: OPENBSD_2_5_BASE
# 1.14 17-Feb-1999 deraadt

inet6 indent


# 1.13 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.12 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.11 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.10 18-May-1998 provos

first step to the setsockopt/getsockopt interface as described in
draft-mcdonald-simple-ipsec-api, kernel notifies (EMT_REQUESTSA) signal
userland key management applications when security services are requested.
this is only for outgoing connections at the moment, incoming packets
are not yet checked against the selected socket policy.


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.9 26-Aug-1997 deraadt

indent


Revision tags: OPENBSD_2_1_BASE
# 1.8 05-Feb-1997 deraadt

use arc4random()


Revision tags: OPENBSD_2_0_BASE
# 1.7 29-Jul-1996 niklas

Remove random() prototype, as it's not needed. Besides it was wrong for the alpha :-)


# 1.6 29-Jul-1996 tholo

Make TCP ISS increment by random amounts


# 1.5 15-May-1996 mickey

remove unnecessary "XXX it should be sysctl()'ed"


# 1.4 15-May-1996 mickey

fix NetBSD PR#854.
allow to overwrite rfc1323 option in config file.


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision