History log of /openbsd-current/sys/netinet/ip_var.h
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# 1.119 02-Jul-2024 bluhm

Read IPsec forwarding information once.

Fix MP race between reading ip_forwarding in ip_input() and checking
ip_forwarding == 2 in ip_output(). In theory ip_forwarding could
be 2 during ip_input() and later 0 in ip_output(). Then a packet
would be forwarded that was never allowed. Currently exclusive
netlock in sysctl(2) prevents all races.

Introduce IP_FORWARDING_IPSEC and pass it with the flags parameter
that was introduced for IP_FORWARDING.

Instead of calling m_tag_find(), traversing the list, and comparing
with NULL, just check the PACKET_TAG_IPSEC_IN_DONE bit. Reading
ipsec_in_use in ip_output() is a performance hack that is not
necessary. New code only checks tree bits.

OK mvs@


# 1.118 07-Jun-2024 bluhm

Read IP forwarding variables only once.

Do not assume that ip_forwarding and ip_directedbcast cannot change
while processing one packet. Read it once and pass down its value
with a flag. This is necessary for unlocking the sysctl path.
There are a few places where a consistent value does not really
matter, they are unchanged. Use a proper ip_ prefix for the global
variable.

OK claudio@


# 1.117 17-Apr-2024 bluhm

Use struct ipsec_level within inpcb.

Instead of passing around u_char[4], introduce struct ipsec_level
that contains 4 ipsec levels. This provides better type safety.
The embedding struct inpcb is globally visible for netstat(1), so
put struct ipsec_level outside of #ifdef _KERNEL.

OK deraadt@ mvs@


# 1.116 16-Apr-2024 bluhm

Use route cache function in IP input.

Instaed of passing a struct rtentry from ip_input() to ip_forward()
and then embed it into a struct route for ip_output(), start with
struct route and pass it along. Then the route cache is used
consistently. Also the route cache hit and missed counters should
reflect reality after this commit.

There is a small difference in the code. in_ouraddr() checks for
NULL and not rtisvalid(). Previous discussion showed that the route
RTF_UP flag should only be considered for multipath routing.
Otherwise it does not mean anything. Especially the local and
broadcast check in in_ouraddr() should not be affected by interface
link status.

When doing cache lookups, route must be valid, but after rtalloc_mpath()
lookup, use any route that route_mpath() returns.

OK claudio@


# 1.115 14-Apr-2024 bluhm

Run raw IP input in parallel.

Running raw IPv4 input with shared net lock in parallel is less
complex than UDP. Especially there is no socket splicing.

New ip_deliver() may run with shared or exclusive net lock. The
last parameter indicates the mode. If is is running with shared
netlock and encounters a protocol that needs exclusive lock, the
packet is queued. Old ip_ours() always queued the packet. Now it
calls ip_deliver() with shared net lock, and if that cannot handle
the packet completely, the packet is queued and later processed
with exclusive net lock.

In case of an IPv6 header chain, that switches from shared to
exclusive processing, the next protocol and mbuf offset are stored
in a mbuf tag.

OK mvs@


Revision tags: OPENBSD_7_5_BASE
# 1.114 05-Mar-2024 bluhm

Validate IPv4 packet options in divert output.

When sending raw packets over divert socket, IP options were not
validated. Fragment code tries to copy them and crashes. Raw IP
output has a similar feature, but uses rip_chkhdr() to prevent
invalid packets from userland. Call this funtion also from
divert_output() for strict user input validation.

Reported-by: syzbot+b1ba3a2a8ef13e5b4698@syzkaller.appspotmail.com
OK dlg@ deraadt@ mvs@


# 1.113 13-Feb-2024 bluhm

Merge struct route and struct route_in6.

Use a common struct route for both inet and inet6. Unfortunately
struct sockaddr is shorter than sockaddr_in6, so netinet/in.h has
to be exposed from net/route.h. Struct route has to be bsd visible
for userland as netstat kvm code inspects inp_route. Internet PCB
and TCP SYN cache can use a plain struct route now. All specific
sockaddr types for inet and inet6 are embeded there.

OK claudio@


# 1.112 05-Feb-2024 bluhm

Add netstat counter for route cache.

To optimize route caching, count cache hits and misses. This is
shown in netstat -s for both inet and inet6. Reuse the old IPv6
forward cache counter. Sort ip6s_wrongif consistently. For now
only IPv4 cache counter has been implemented.

OK mvs@


# 1.111 03-Feb-2024 mvs

Rework socket buffers locking for shared netlock.

Shared netlock is not sufficient to call so{r,w}wakeup(). The following
sowakeup() modifies `sb_flags' and knote(9) stuff. Unfortunately, we
can't call so{r,w}wakeup() with `inp_mtx' mutex(9) because sowakeup()
also calls pgsigio() which grabs kernel lock.

However, `so*_filtops' callbacks only perform read-only access to the
socket stuff, so it is enough to hold shared netlock only, but the klist
stuff needs to be protected.

This diff introduces `sb_mtx' mutex(9) to protect sockbuf. This time
`sb_mtx' used to protect only `sb_flags' and `sb_klist'.

Now we have soassertlocked_readonly() and soassertlocked(). The first
one is happy if only shared netlock is held, meanwhile the second wants
`so_lock' or pru_lock() be held together with shared netlock.

To keep soassertlocked*() assertions soft, we need to know mutex(9)
state, so new mtx_owned() macro was introduces. Also, the new optional
(*pru_locked)() handler brings the state of pru_lock().

Tests and ok from bluhm.


# 1.110 26-Nov-2023 bluhm

Remove inp parameter from ip_output().

ip_output() received inp as parameter. This is only used to lookup
the IPsec level of the socket. Reasoning about MP locking is much
easier if only relevant data is passed around. Convert ip_output()
to receive constant inp_seclevel as argument and mark it as protected
by net lock.

OK mvs@


Revision tags: OPENBSD_7_4_BASE
# 1.109 05-Apr-2023 bluhm

ARP has a sysctl to show the number of packets waiting for an arp
response. Implement analog sysctl net.inet6.icmp6.nd6_queued for
ND6 to reduce places where mbufs can hide within the kernel.
Atomic operations operate on unsigned int. Make the type of total
hold queue length consistent.
Use atomic load to read the value for the sysctl. This clarifies
why no lock around sysctl_rdint() is needed.
OK mvs@ kn@


Revision tags: OPENBSD_7_3_BASE
# 1.108 17-Nov-2022 mvs

style(9) fix. No functional change.


# 1.107 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.106 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.105 13-Sep-2022 mvs

Do soreceive() with shared netlock for raw sockets.

ok bluhm@


# 1.104 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.103 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.102 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.101 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.100 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.99 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.98 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.97 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.96 12-Aug-2022 bluhm

There are some places in ip and ip6 input where operations fail due
to out of memory. Use a generic idropped counter for those.
OK mvs@


# 1.95 04-Aug-2022 bluhm

Use 16 bit variable to store more fragment flag. This avoids loss
of significant bits on big endian machines. Bug has been introduced
in previous commit by removing the =! 0 check.
OK mvs@


# 1.94 25-Jul-2022 bluhm

The IPv4 reassembly code is MP safe, so we can run it in parallel.
Note that ip_ours() runs with shared netlock, while ip_local() has
exclusive netlock after queuing. Move existing the code into
function ip_fragcheck() and call it from ip_ours().
OK mvs@


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.118 07-Jun-2024 bluhm

Read IP forwarding variables only once.

Do not assume that ip_forwarding and ip_directedbcast cannot change
while processing one packet. Read it once and pass down its value
with a flag. This is necessary for unlocking the sysctl path.
There are a few places where a consistent value does not really
matter, they are unchanged. Use a proper ip_ prefix for the global
variable.

OK claudio@


# 1.117 17-Apr-2024 bluhm

Use struct ipsec_level within inpcb.

Instead of passing around u_char[4], introduce struct ipsec_level
that contains 4 ipsec levels. This provides better type safety.
The embedding struct inpcb is globally visible for netstat(1), so
put struct ipsec_level outside of #ifdef _KERNEL.

OK deraadt@ mvs@


# 1.116 16-Apr-2024 bluhm

Use route cache function in IP input.

Instaed of passing a struct rtentry from ip_input() to ip_forward()
and then embed it into a struct route for ip_output(), start with
struct route and pass it along. Then the route cache is used
consistently. Also the route cache hit and missed counters should
reflect reality after this commit.

There is a small difference in the code. in_ouraddr() checks for
NULL and not rtisvalid(). Previous discussion showed that the route
RTF_UP flag should only be considered for multipath routing.
Otherwise it does not mean anything. Especially the local and
broadcast check in in_ouraddr() should not be affected by interface
link status.

When doing cache lookups, route must be valid, but after rtalloc_mpath()
lookup, use any route that route_mpath() returns.

OK claudio@


# 1.115 14-Apr-2024 bluhm

Run raw IP input in parallel.

Running raw IPv4 input with shared net lock in parallel is less
complex than UDP. Especially there is no socket splicing.

New ip_deliver() may run with shared or exclusive net lock. The
last parameter indicates the mode. If is is running with shared
netlock and encounters a protocol that needs exclusive lock, the
packet is queued. Old ip_ours() always queued the packet. Now it
calls ip_deliver() with shared net lock, and if that cannot handle
the packet completely, the packet is queued and later processed
with exclusive net lock.

In case of an IPv6 header chain, that switches from shared to
exclusive processing, the next protocol and mbuf offset are stored
in a mbuf tag.

OK mvs@


Revision tags: OPENBSD_7_5_BASE
# 1.114 05-Mar-2024 bluhm

Validate IPv4 packet options in divert output.

When sending raw packets over divert socket, IP options were not
validated. Fragment code tries to copy them and crashes. Raw IP
output has a similar feature, but uses rip_chkhdr() to prevent
invalid packets from userland. Call this funtion also from
divert_output() for strict user input validation.

Reported-by: syzbot+b1ba3a2a8ef13e5b4698@syzkaller.appspotmail.com
OK dlg@ deraadt@ mvs@


# 1.113 13-Feb-2024 bluhm

Merge struct route and struct route_in6.

Use a common struct route for both inet and inet6. Unfortunately
struct sockaddr is shorter than sockaddr_in6, so netinet/in.h has
to be exposed from net/route.h. Struct route has to be bsd visible
for userland as netstat kvm code inspects inp_route. Internet PCB
and TCP SYN cache can use a plain struct route now. All specific
sockaddr types for inet and inet6 are embeded there.

OK claudio@


# 1.112 05-Feb-2024 bluhm

Add netstat counter for route cache.

To optimize route caching, count cache hits and misses. This is
shown in netstat -s for both inet and inet6. Reuse the old IPv6
forward cache counter. Sort ip6s_wrongif consistently. For now
only IPv4 cache counter has been implemented.

OK mvs@


# 1.111 03-Feb-2024 mvs

Rework socket buffers locking for shared netlock.

Shared netlock is not sufficient to call so{r,w}wakeup(). The following
sowakeup() modifies `sb_flags' and knote(9) stuff. Unfortunately, we
can't call so{r,w}wakeup() with `inp_mtx' mutex(9) because sowakeup()
also calls pgsigio() which grabs kernel lock.

However, `so*_filtops' callbacks only perform read-only access to the
socket stuff, so it is enough to hold shared netlock only, but the klist
stuff needs to be protected.

This diff introduces `sb_mtx' mutex(9) to protect sockbuf. This time
`sb_mtx' used to protect only `sb_flags' and `sb_klist'.

Now we have soassertlocked_readonly() and soassertlocked(). The first
one is happy if only shared netlock is held, meanwhile the second wants
`so_lock' or pru_lock() be held together with shared netlock.

To keep soassertlocked*() assertions soft, we need to know mutex(9)
state, so new mtx_owned() macro was introduces. Also, the new optional
(*pru_locked)() handler brings the state of pru_lock().

Tests and ok from bluhm.


# 1.110 26-Nov-2023 bluhm

Remove inp parameter from ip_output().

ip_output() received inp as parameter. This is only used to lookup
the IPsec level of the socket. Reasoning about MP locking is much
easier if only relevant data is passed around. Convert ip_output()
to receive constant inp_seclevel as argument and mark it as protected
by net lock.

OK mvs@


Revision tags: OPENBSD_7_4_BASE
# 1.109 05-Apr-2023 bluhm

ARP has a sysctl to show the number of packets waiting for an arp
response. Implement analog sysctl net.inet6.icmp6.nd6_queued for
ND6 to reduce places where mbufs can hide within the kernel.
Atomic operations operate on unsigned int. Make the type of total
hold queue length consistent.
Use atomic load to read the value for the sysctl. This clarifies
why no lock around sysctl_rdint() is needed.
OK mvs@ kn@


Revision tags: OPENBSD_7_3_BASE
# 1.108 17-Nov-2022 mvs

style(9) fix. No functional change.


# 1.107 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.106 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.105 13-Sep-2022 mvs

Do soreceive() with shared netlock for raw sockets.

ok bluhm@


# 1.104 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.103 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.102 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.101 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.100 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.99 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.98 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.97 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.96 12-Aug-2022 bluhm

There are some places in ip and ip6 input where operations fail due
to out of memory. Use a generic idropped counter for those.
OK mvs@


# 1.95 04-Aug-2022 bluhm

Use 16 bit variable to store more fragment flag. This avoids loss
of significant bits on big endian machines. Bug has been introduced
in previous commit by removing the =! 0 check.
OK mvs@


# 1.94 25-Jul-2022 bluhm

The IPv4 reassembly code is MP safe, so we can run it in parallel.
Note that ip_ours() runs with shared netlock, while ip_local() has
exclusive netlock after queuing. Move existing the code into
function ip_fragcheck() and call it from ip_ours().
OK mvs@


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.117 17-Apr-2024 bluhm

Use struct ipsec_level within inpcb.

Instead of passing around u_char[4], introduce struct ipsec_level
that contains 4 ipsec levels. This provides better type safety.
The embedding struct inpcb is globally visible for netstat(1), so
put struct ipsec_level outside of #ifdef _KERNEL.

OK deraadt@ mvs@


# 1.116 16-Apr-2024 bluhm

Use route cache function in IP input.

Instaed of passing a struct rtentry from ip_input() to ip_forward()
and then embed it into a struct route for ip_output(), start with
struct route and pass it along. Then the route cache is used
consistently. Also the route cache hit and missed counters should
reflect reality after this commit.

There is a small difference in the code. in_ouraddr() checks for
NULL and not rtisvalid(). Previous discussion showed that the route
RTF_UP flag should only be considered for multipath routing.
Otherwise it does not mean anything. Especially the local and
broadcast check in in_ouraddr() should not be affected by interface
link status.

When doing cache lookups, route must be valid, but after rtalloc_mpath()
lookup, use any route that route_mpath() returns.

OK claudio@


# 1.115 14-Apr-2024 bluhm

Run raw IP input in parallel.

Running raw IPv4 input with shared net lock in parallel is less
complex than UDP. Especially there is no socket splicing.

New ip_deliver() may run with shared or exclusive net lock. The
last parameter indicates the mode. If is is running with shared
netlock and encounters a protocol that needs exclusive lock, the
packet is queued. Old ip_ours() always queued the packet. Now it
calls ip_deliver() with shared net lock, and if that cannot handle
the packet completely, the packet is queued and later processed
with exclusive net lock.

In case of an IPv6 header chain, that switches from shared to
exclusive processing, the next protocol and mbuf offset are stored
in a mbuf tag.

OK mvs@


Revision tags: OPENBSD_7_5_BASE
# 1.114 05-Mar-2024 bluhm

Validate IPv4 packet options in divert output.

When sending raw packets over divert socket, IP options were not
validated. Fragment code tries to copy them and crashes. Raw IP
output has a similar feature, but uses rip_chkhdr() to prevent
invalid packets from userland. Call this funtion also from
divert_output() for strict user input validation.

Reported-by: syzbot+b1ba3a2a8ef13e5b4698@syzkaller.appspotmail.com
OK dlg@ deraadt@ mvs@


# 1.113 13-Feb-2024 bluhm

Merge struct route and struct route_in6.

Use a common struct route for both inet and inet6. Unfortunately
struct sockaddr is shorter than sockaddr_in6, so netinet/in.h has
to be exposed from net/route.h. Struct route has to be bsd visible
for userland as netstat kvm code inspects inp_route. Internet PCB
and TCP SYN cache can use a plain struct route now. All specific
sockaddr types for inet and inet6 are embeded there.

OK claudio@


# 1.112 05-Feb-2024 bluhm

Add netstat counter for route cache.

To optimize route caching, count cache hits and misses. This is
shown in netstat -s for both inet and inet6. Reuse the old IPv6
forward cache counter. Sort ip6s_wrongif consistently. For now
only IPv4 cache counter has been implemented.

OK mvs@


# 1.111 03-Feb-2024 mvs

Rework socket buffers locking for shared netlock.

Shared netlock is not sufficient to call so{r,w}wakeup(). The following
sowakeup() modifies `sb_flags' and knote(9) stuff. Unfortunately, we
can't call so{r,w}wakeup() with `inp_mtx' mutex(9) because sowakeup()
also calls pgsigio() which grabs kernel lock.

However, `so*_filtops' callbacks only perform read-only access to the
socket stuff, so it is enough to hold shared netlock only, but the klist
stuff needs to be protected.

This diff introduces `sb_mtx' mutex(9) to protect sockbuf. This time
`sb_mtx' used to protect only `sb_flags' and `sb_klist'.

Now we have soassertlocked_readonly() and soassertlocked(). The first
one is happy if only shared netlock is held, meanwhile the second wants
`so_lock' or pru_lock() be held together with shared netlock.

To keep soassertlocked*() assertions soft, we need to know mutex(9)
state, so new mtx_owned() macro was introduces. Also, the new optional
(*pru_locked)() handler brings the state of pru_lock().

Tests and ok from bluhm.


# 1.110 26-Nov-2023 bluhm

Remove inp parameter from ip_output().

ip_output() received inp as parameter. This is only used to lookup
the IPsec level of the socket. Reasoning about MP locking is much
easier if only relevant data is passed around. Convert ip_output()
to receive constant inp_seclevel as argument and mark it as protected
by net lock.

OK mvs@


Revision tags: OPENBSD_7_4_BASE
# 1.109 05-Apr-2023 bluhm

ARP has a sysctl to show the number of packets waiting for an arp
response. Implement analog sysctl net.inet6.icmp6.nd6_queued for
ND6 to reduce places where mbufs can hide within the kernel.
Atomic operations operate on unsigned int. Make the type of total
hold queue length consistent.
Use atomic load to read the value for the sysctl. This clarifies
why no lock around sysctl_rdint() is needed.
OK mvs@ kn@


Revision tags: OPENBSD_7_3_BASE
# 1.108 17-Nov-2022 mvs

style(9) fix. No functional change.


# 1.107 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.106 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.105 13-Sep-2022 mvs

Do soreceive() with shared netlock for raw sockets.

ok bluhm@


# 1.104 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.103 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.102 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.101 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.100 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.99 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.98 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.97 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.96 12-Aug-2022 bluhm

There are some places in ip and ip6 input where operations fail due
to out of memory. Use a generic idropped counter for those.
OK mvs@


# 1.95 04-Aug-2022 bluhm

Use 16 bit variable to store more fragment flag. This avoids loss
of significant bits on big endian machines. Bug has been introduced
in previous commit by removing the =! 0 check.
OK mvs@


# 1.94 25-Jul-2022 bluhm

The IPv4 reassembly code is MP safe, so we can run it in parallel.
Note that ip_ours() runs with shared netlock, while ip_local() has
exclusive netlock after queuing. Move existing the code into
function ip_fragcheck() and call it from ip_ours().
OK mvs@


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.116 16-Apr-2024 bluhm

Use route cache function in IP input.

Instaed of passing a struct rtentry from ip_input() to ip_forward()
and then embed it into a struct route for ip_output(), start with
struct route and pass it along. Then the route cache is used
consistently. Also the route cache hit and missed counters should
reflect reality after this commit.

There is a small difference in the code. in_ouraddr() checks for
NULL and not rtisvalid(). Previous discussion showed that the route
RTF_UP flag should only be considered for multipath routing.
Otherwise it does not mean anything. Especially the local and
broadcast check in in_ouraddr() should not be affected by interface
link status.

When doing cache lookups, route must be valid, but after rtalloc_mpath()
lookup, use any route that route_mpath() returns.

OK claudio@


# 1.115 14-Apr-2024 bluhm

Run raw IP input in parallel.

Running raw IPv4 input with shared net lock in parallel is less
complex than UDP. Especially there is no socket splicing.

New ip_deliver() may run with shared or exclusive net lock. The
last parameter indicates the mode. If is is running with shared
netlock and encounters a protocol that needs exclusive lock, the
packet is queued. Old ip_ours() always queued the packet. Now it
calls ip_deliver() with shared net lock, and if that cannot handle
the packet completely, the packet is queued and later processed
with exclusive net lock.

In case of an IPv6 header chain, that switches from shared to
exclusive processing, the next protocol and mbuf offset are stored
in a mbuf tag.

OK mvs@


Revision tags: OPENBSD_7_5_BASE
# 1.114 05-Mar-2024 bluhm

Validate IPv4 packet options in divert output.

When sending raw packets over divert socket, IP options were not
validated. Fragment code tries to copy them and crashes. Raw IP
output has a similar feature, but uses rip_chkhdr() to prevent
invalid packets from userland. Call this funtion also from
divert_output() for strict user input validation.

Reported-by: syzbot+b1ba3a2a8ef13e5b4698@syzkaller.appspotmail.com
OK dlg@ deraadt@ mvs@


# 1.113 13-Feb-2024 bluhm

Merge struct route and struct route_in6.

Use a common struct route for both inet and inet6. Unfortunately
struct sockaddr is shorter than sockaddr_in6, so netinet/in.h has
to be exposed from net/route.h. Struct route has to be bsd visible
for userland as netstat kvm code inspects inp_route. Internet PCB
and TCP SYN cache can use a plain struct route now. All specific
sockaddr types for inet and inet6 are embeded there.

OK claudio@


# 1.112 05-Feb-2024 bluhm

Add netstat counter for route cache.

To optimize route caching, count cache hits and misses. This is
shown in netstat -s for both inet and inet6. Reuse the old IPv6
forward cache counter. Sort ip6s_wrongif consistently. For now
only IPv4 cache counter has been implemented.

OK mvs@


# 1.111 03-Feb-2024 mvs

Rework socket buffers locking for shared netlock.

Shared netlock is not sufficient to call so{r,w}wakeup(). The following
sowakeup() modifies `sb_flags' and knote(9) stuff. Unfortunately, we
can't call so{r,w}wakeup() with `inp_mtx' mutex(9) because sowakeup()
also calls pgsigio() which grabs kernel lock.

However, `so*_filtops' callbacks only perform read-only access to the
socket stuff, so it is enough to hold shared netlock only, but the klist
stuff needs to be protected.

This diff introduces `sb_mtx' mutex(9) to protect sockbuf. This time
`sb_mtx' used to protect only `sb_flags' and `sb_klist'.

Now we have soassertlocked_readonly() and soassertlocked(). The first
one is happy if only shared netlock is held, meanwhile the second wants
`so_lock' or pru_lock() be held together with shared netlock.

To keep soassertlocked*() assertions soft, we need to know mutex(9)
state, so new mtx_owned() macro was introduces. Also, the new optional
(*pru_locked)() handler brings the state of pru_lock().

Tests and ok from bluhm.


# 1.110 26-Nov-2023 bluhm

Remove inp parameter from ip_output().

ip_output() received inp as parameter. This is only used to lookup
the IPsec level of the socket. Reasoning about MP locking is much
easier if only relevant data is passed around. Convert ip_output()
to receive constant inp_seclevel as argument and mark it as protected
by net lock.

OK mvs@


Revision tags: OPENBSD_7_4_BASE
# 1.109 05-Apr-2023 bluhm

ARP has a sysctl to show the number of packets waiting for an arp
response. Implement analog sysctl net.inet6.icmp6.nd6_queued for
ND6 to reduce places where mbufs can hide within the kernel.
Atomic operations operate on unsigned int. Make the type of total
hold queue length consistent.
Use atomic load to read the value for the sysctl. This clarifies
why no lock around sysctl_rdint() is needed.
OK mvs@ kn@


Revision tags: OPENBSD_7_3_BASE
# 1.108 17-Nov-2022 mvs

style(9) fix. No functional change.


# 1.107 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.106 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.105 13-Sep-2022 mvs

Do soreceive() with shared netlock for raw sockets.

ok bluhm@


# 1.104 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.103 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.102 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.101 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.100 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.99 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.98 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.97 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.96 12-Aug-2022 bluhm

There are some places in ip and ip6 input where operations fail due
to out of memory. Use a generic idropped counter for those.
OK mvs@


# 1.95 04-Aug-2022 bluhm

Use 16 bit variable to store more fragment flag. This avoids loss
of significant bits on big endian machines. Bug has been introduced
in previous commit by removing the =! 0 check.
OK mvs@


# 1.94 25-Jul-2022 bluhm

The IPv4 reassembly code is MP safe, so we can run it in parallel.
Note that ip_ours() runs with shared netlock, while ip_local() has
exclusive netlock after queuing. Move existing the code into
function ip_fragcheck() and call it from ip_ours().
OK mvs@


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.114 05-Mar-2024 bluhm

Validate IPv4 packet options in divert output.

When sending raw packets over divert socket, IP options were not
validated. Fragment code tries to copy them and crashes. Raw IP
output has a similar feature, but uses rip_chkhdr() to prevent
invalid packets from userland. Call this funtion also from
divert_output() for strict user input validation.

Reported-by: syzbot+b1ba3a2a8ef13e5b4698@syzkaller.appspotmail.com
OK dlg@ deraadt@ mvs@


# 1.113 13-Feb-2024 bluhm

Merge struct route and struct route_in6.

Use a common struct route for both inet and inet6. Unfortunately
struct sockaddr is shorter than sockaddr_in6, so netinet/in.h has
to be exposed from net/route.h. Struct route has to be bsd visible
for userland as netstat kvm code inspects inp_route. Internet PCB
and TCP SYN cache can use a plain struct route now. All specific
sockaddr types for inet and inet6 are embeded there.

OK claudio@


# 1.112 05-Feb-2024 bluhm

Add netstat counter for route cache.

To optimize route caching, count cache hits and misses. This is
shown in netstat -s for both inet and inet6. Reuse the old IPv6
forward cache counter. Sort ip6s_wrongif consistently. For now
only IPv4 cache counter has been implemented.

OK mvs@


# 1.111 03-Feb-2024 mvs

Rework socket buffers locking for shared netlock.

Shared netlock is not sufficient to call so{r,w}wakeup(). The following
sowakeup() modifies `sb_flags' and knote(9) stuff. Unfortunately, we
can't call so{r,w}wakeup() with `inp_mtx' mutex(9) because sowakeup()
also calls pgsigio() which grabs kernel lock.

However, `so*_filtops' callbacks only perform read-only access to the
socket stuff, so it is enough to hold shared netlock only, but the klist
stuff needs to be protected.

This diff introduces `sb_mtx' mutex(9) to protect sockbuf. This time
`sb_mtx' used to protect only `sb_flags' and `sb_klist'.

Now we have soassertlocked_readonly() and soassertlocked(). The first
one is happy if only shared netlock is held, meanwhile the second wants
`so_lock' or pru_lock() be held together with shared netlock.

To keep soassertlocked*() assertions soft, we need to know mutex(9)
state, so new mtx_owned() macro was introduces. Also, the new optional
(*pru_locked)() handler brings the state of pru_lock().

Tests and ok from bluhm.


# 1.110 26-Nov-2023 bluhm

Remove inp parameter from ip_output().

ip_output() received inp as parameter. This is only used to lookup
the IPsec level of the socket. Reasoning about MP locking is much
easier if only relevant data is passed around. Convert ip_output()
to receive constant inp_seclevel as argument and mark it as protected
by net lock.

OK mvs@


Revision tags: OPENBSD_7_4_BASE
# 1.109 05-Apr-2023 bluhm

ARP has a sysctl to show the number of packets waiting for an arp
response. Implement analog sysctl net.inet6.icmp6.nd6_queued for
ND6 to reduce places where mbufs can hide within the kernel.
Atomic operations operate on unsigned int. Make the type of total
hold queue length consistent.
Use atomic load to read the value for the sysctl. This clarifies
why no lock around sysctl_rdint() is needed.
OK mvs@ kn@


Revision tags: OPENBSD_7_3_BASE
# 1.108 17-Nov-2022 mvs

style(9) fix. No functional change.


# 1.107 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.106 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.105 13-Sep-2022 mvs

Do soreceive() with shared netlock for raw sockets.

ok bluhm@


# 1.104 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.103 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.102 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.101 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.100 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.99 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.98 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.97 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.96 12-Aug-2022 bluhm

There are some places in ip and ip6 input where operations fail due
to out of memory. Use a generic idropped counter for those.
OK mvs@


# 1.95 04-Aug-2022 bluhm

Use 16 bit variable to store more fragment flag. This avoids loss
of significant bits on big endian machines. Bug has been introduced
in previous commit by removing the =! 0 check.
OK mvs@


# 1.94 25-Jul-2022 bluhm

The IPv4 reassembly code is MP safe, so we can run it in parallel.
Note that ip_ours() runs with shared netlock, while ip_local() has
exclusive netlock after queuing. Move existing the code into
function ip_fragcheck() and call it from ip_ours().
OK mvs@


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.113 13-Feb-2024 bluhm

Merge struct route and struct route_in6.

Use a common struct route for both inet and inet6. Unfortunately
struct sockaddr is shorter than sockaddr_in6, so netinet/in.h has
to be exposed from net/route.h. Struct route has to be bsd visible
for userland as netstat kvm code inspects inp_route. Internet PCB
and TCP SYN cache can use a plain struct route now. All specific
sockaddr types for inet and inet6 are embeded there.

OK claudio@


# 1.112 05-Feb-2024 bluhm

Add netstat counter for route cache.

To optimize route caching, count cache hits and misses. This is
shown in netstat -s for both inet and inet6. Reuse the old IPv6
forward cache counter. Sort ip6s_wrongif consistently. For now
only IPv4 cache counter has been implemented.

OK mvs@


# 1.111 03-Feb-2024 mvs

Rework socket buffers locking for shared netlock.

Shared netlock is not sufficient to call so{r,w}wakeup(). The following
sowakeup() modifies `sb_flags' and knote(9) stuff. Unfortunately, we
can't call so{r,w}wakeup() with `inp_mtx' mutex(9) because sowakeup()
also calls pgsigio() which grabs kernel lock.

However, `so*_filtops' callbacks only perform read-only access to the
socket stuff, so it is enough to hold shared netlock only, but the klist
stuff needs to be protected.

This diff introduces `sb_mtx' mutex(9) to protect sockbuf. This time
`sb_mtx' used to protect only `sb_flags' and `sb_klist'.

Now we have soassertlocked_readonly() and soassertlocked(). The first
one is happy if only shared netlock is held, meanwhile the second wants
`so_lock' or pru_lock() be held together with shared netlock.

To keep soassertlocked*() assertions soft, we need to know mutex(9)
state, so new mtx_owned() macro was introduces. Also, the new optional
(*pru_locked)() handler brings the state of pru_lock().

Tests and ok from bluhm.


# 1.110 26-Nov-2023 bluhm

Remove inp parameter from ip_output().

ip_output() received inp as parameter. This is only used to lookup
the IPsec level of the socket. Reasoning about MP locking is much
easier if only relevant data is passed around. Convert ip_output()
to receive constant inp_seclevel as argument and mark it as protected
by net lock.

OK mvs@


Revision tags: OPENBSD_7_4_BASE
# 1.109 05-Apr-2023 bluhm

ARP has a sysctl to show the number of packets waiting for an arp
response. Implement analog sysctl net.inet6.icmp6.nd6_queued for
ND6 to reduce places where mbufs can hide within the kernel.
Atomic operations operate on unsigned int. Make the type of total
hold queue length consistent.
Use atomic load to read the value for the sysctl. This clarifies
why no lock around sysctl_rdint() is needed.
OK mvs@ kn@


Revision tags: OPENBSD_7_3_BASE
# 1.108 17-Nov-2022 mvs

style(9) fix. No functional change.


# 1.107 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.106 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.105 13-Sep-2022 mvs

Do soreceive() with shared netlock for raw sockets.

ok bluhm@


# 1.104 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.103 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.102 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.101 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.100 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.99 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.98 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.97 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.96 12-Aug-2022 bluhm

There are some places in ip and ip6 input where operations fail due
to out of memory. Use a generic idropped counter for those.
OK mvs@


# 1.95 04-Aug-2022 bluhm

Use 16 bit variable to store more fragment flag. This avoids loss
of significant bits on big endian machines. Bug has been introduced
in previous commit by removing the =! 0 check.
OK mvs@


# 1.94 25-Jul-2022 bluhm

The IPv4 reassembly code is MP safe, so we can run it in parallel.
Note that ip_ours() runs with shared netlock, while ip_local() has
exclusive netlock after queuing. Move existing the code into
function ip_fragcheck() and call it from ip_ours().
OK mvs@


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.112 05-Feb-2024 bluhm

Add netstat counter for route cache.

To optimize route caching, count cache hits and misses. This is
shown in netstat -s for both inet and inet6. Reuse the old IPv6
forward cache counter. Sort ip6s_wrongif consistently. For now
only IPv4 cache counter has been implemented.

OK mvs@


# 1.111 03-Feb-2024 mvs

Rework socket buffers locking for shared netlock.

Shared netlock is not sufficient to call so{r,w}wakeup(). The following
sowakeup() modifies `sb_flags' and knote(9) stuff. Unfortunately, we
can't call so{r,w}wakeup() with `inp_mtx' mutex(9) because sowakeup()
also calls pgsigio() which grabs kernel lock.

However, `so*_filtops' callbacks only perform read-only access to the
socket stuff, so it is enough to hold shared netlock only, but the klist
stuff needs to be protected.

This diff introduces `sb_mtx' mutex(9) to protect sockbuf. This time
`sb_mtx' used to protect only `sb_flags' and `sb_klist'.

Now we have soassertlocked_readonly() and soassertlocked(). The first
one is happy if only shared netlock is held, meanwhile the second wants
`so_lock' or pru_lock() be held together with shared netlock.

To keep soassertlocked*() assertions soft, we need to know mutex(9)
state, so new mtx_owned() macro was introduces. Also, the new optional
(*pru_locked)() handler brings the state of pru_lock().

Tests and ok from bluhm.


# 1.110 26-Nov-2023 bluhm

Remove inp parameter from ip_output().

ip_output() received inp as parameter. This is only used to lookup
the IPsec level of the socket. Reasoning about MP locking is much
easier if only relevant data is passed around. Convert ip_output()
to receive constant inp_seclevel as argument and mark it as protected
by net lock.

OK mvs@


Revision tags: OPENBSD_7_4_BASE
# 1.109 05-Apr-2023 bluhm

ARP has a sysctl to show the number of packets waiting for an arp
response. Implement analog sysctl net.inet6.icmp6.nd6_queued for
ND6 to reduce places where mbufs can hide within the kernel.
Atomic operations operate on unsigned int. Make the type of total
hold queue length consistent.
Use atomic load to read the value for the sysctl. This clarifies
why no lock around sysctl_rdint() is needed.
OK mvs@ kn@


Revision tags: OPENBSD_7_3_BASE
# 1.108 17-Nov-2022 mvs

style(9) fix. No functional change.


# 1.107 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.106 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.105 13-Sep-2022 mvs

Do soreceive() with shared netlock for raw sockets.

ok bluhm@


# 1.104 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.103 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.102 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.101 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.100 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.99 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.98 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.97 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.96 12-Aug-2022 bluhm

There are some places in ip and ip6 input where operations fail due
to out of memory. Use a generic idropped counter for those.
OK mvs@


# 1.95 04-Aug-2022 bluhm

Use 16 bit variable to store more fragment flag. This avoids loss
of significant bits on big endian machines. Bug has been introduced
in previous commit by removing the =! 0 check.
OK mvs@


# 1.94 25-Jul-2022 bluhm

The IPv4 reassembly code is MP safe, so we can run it in parallel.
Note that ip_ours() runs with shared netlock, while ip_local() has
exclusive netlock after queuing. Move existing the code into
function ip_fragcheck() and call it from ip_ours().
OK mvs@


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.111 03-Feb-2024 mvs

Rework socket buffers locking for shared netlock.

Shared netlock is not sufficient to call so{r,w}wakeup(). The following
sowakeup() modifies `sb_flags' and knote(9) stuff. Unfortunately, we
can't call so{r,w}wakeup() with `inp_mtx' mutex(9) because sowakeup()
also calls pgsigio() which grabs kernel lock.

However, `so*_filtops' callbacks only perform read-only access to the
socket stuff, so it is enough to hold shared netlock only, but the klist
stuff needs to be protected.

This diff introduces `sb_mtx' mutex(9) to protect sockbuf. This time
`sb_mtx' used to protect only `sb_flags' and `sb_klist'.

Now we have soassertlocked_readonly() and soassertlocked(). The first
one is happy if only shared netlock is held, meanwhile the second wants
`so_lock' or pru_lock() be held together with shared netlock.

To keep soassertlocked*() assertions soft, we need to know mutex(9)
state, so new mtx_owned() macro was introduces. Also, the new optional
(*pru_locked)() handler brings the state of pru_lock().

Tests and ok from bluhm.


# 1.110 26-Nov-2023 bluhm

Remove inp parameter from ip_output().

ip_output() received inp as parameter. This is only used to lookup
the IPsec level of the socket. Reasoning about MP locking is much
easier if only relevant data is passed around. Convert ip_output()
to receive constant inp_seclevel as argument and mark it as protected
by net lock.

OK mvs@


Revision tags: OPENBSD_7_4_BASE
# 1.109 05-Apr-2023 bluhm

ARP has a sysctl to show the number of packets waiting for an arp
response. Implement analog sysctl net.inet6.icmp6.nd6_queued for
ND6 to reduce places where mbufs can hide within the kernel.
Atomic operations operate on unsigned int. Make the type of total
hold queue length consistent.
Use atomic load to read the value for the sysctl. This clarifies
why no lock around sysctl_rdint() is needed.
OK mvs@ kn@


Revision tags: OPENBSD_7_3_BASE
# 1.108 17-Nov-2022 mvs

style(9) fix. No functional change.


# 1.107 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.106 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.105 13-Sep-2022 mvs

Do soreceive() with shared netlock for raw sockets.

ok bluhm@


# 1.104 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.103 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.102 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.101 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.100 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.99 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.98 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.97 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.96 12-Aug-2022 bluhm

There are some places in ip and ip6 input where operations fail due
to out of memory. Use a generic idropped counter for those.
OK mvs@


# 1.95 04-Aug-2022 bluhm

Use 16 bit variable to store more fragment flag. This avoids loss
of significant bits on big endian machines. Bug has been introduced
in previous commit by removing the =! 0 check.
OK mvs@


# 1.94 25-Jul-2022 bluhm

The IPv4 reassembly code is MP safe, so we can run it in parallel.
Note that ip_ours() runs with shared netlock, while ip_local() has
exclusive netlock after queuing. Move existing the code into
function ip_fragcheck() and call it from ip_ours().
OK mvs@


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.110 26-Nov-2023 bluhm

Remove inp parameter from ip_output().

ip_output() received inp as parameter. This is only used to lookup
the IPsec level of the socket. Reasoning about MP locking is much
easier if only relevant data is passed around. Convert ip_output()
to receive constant inp_seclevel as argument and mark it as protected
by net lock.

OK mvs@


Revision tags: OPENBSD_7_4_BASE
# 1.109 05-Apr-2023 bluhm

ARP has a sysctl to show the number of packets waiting for an arp
response. Implement analog sysctl net.inet6.icmp6.nd6_queued for
ND6 to reduce places where mbufs can hide within the kernel.
Atomic operations operate on unsigned int. Make the type of total
hold queue length consistent.
Use atomic load to read the value for the sysctl. This clarifies
why no lock around sysctl_rdint() is needed.
OK mvs@ kn@


Revision tags: OPENBSD_7_3_BASE
# 1.108 17-Nov-2022 mvs

style(9) fix. No functional change.


# 1.107 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.106 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.105 13-Sep-2022 mvs

Do soreceive() with shared netlock for raw sockets.

ok bluhm@


# 1.104 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.103 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.102 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.101 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.100 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.99 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.98 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.97 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.96 12-Aug-2022 bluhm

There are some places in ip and ip6 input where operations fail due
to out of memory. Use a generic idropped counter for those.
OK mvs@


# 1.95 04-Aug-2022 bluhm

Use 16 bit variable to store more fragment flag. This avoids loss
of significant bits on big endian machines. Bug has been introduced
in previous commit by removing the =! 0 check.
OK mvs@


# 1.94 25-Jul-2022 bluhm

The IPv4 reassembly code is MP safe, so we can run it in parallel.
Note that ip_ours() runs with shared netlock, while ip_local() has
exclusive netlock after queuing. Move existing the code into
function ip_fragcheck() and call it from ip_ours().
OK mvs@


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.109 05-Apr-2023 bluhm

ARP has a sysctl to show the number of packets waiting for an arp
response. Implement analog sysctl net.inet6.icmp6.nd6_queued for
ND6 to reduce places where mbufs can hide within the kernel.
Atomic operations operate on unsigned int. Make the type of total
hold queue length consistent.
Use atomic load to read the value for the sysctl. This clarifies
why no lock around sysctl_rdint() is needed.
OK mvs@ kn@


Revision tags: OPENBSD_7_3_BASE
# 1.108 17-Nov-2022 mvs

style(9) fix. No functional change.


# 1.107 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.106 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.105 13-Sep-2022 mvs

Do soreceive() with shared netlock for raw sockets.

ok bluhm@


# 1.104 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.103 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.102 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.101 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.100 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.99 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.98 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.97 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.96 12-Aug-2022 bluhm

There are some places in ip and ip6 input where operations fail due
to out of memory. Use a generic idropped counter for those.
OK mvs@


# 1.95 04-Aug-2022 bluhm

Use 16 bit variable to store more fragment flag. This avoids loss
of significant bits on big endian machines. Bug has been introduced
in previous commit by removing the =! 0 check.
OK mvs@


# 1.94 25-Jul-2022 bluhm

The IPv4 reassembly code is MP safe, so we can run it in parallel.
Note that ip_ours() runs with shared netlock, while ip_local() has
exclusive netlock after queuing. Move existing the code into
function ip_fragcheck() and call it from ip_ours().
OK mvs@


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.108 17-Nov-2022 mvs

style(9) fix. No functional change.


# 1.107 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.106 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.105 13-Sep-2022 mvs

Do soreceive() with shared netlock for raw sockets.

ok bluhm@


# 1.104 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.103 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.102 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.101 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.100 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.99 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.98 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.97 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.96 12-Aug-2022 bluhm

There are some places in ip and ip6 input where operations fail due
to out of memory. Use a generic idropped counter for those.
OK mvs@


# 1.95 04-Aug-2022 bluhm

Use 16 bit variable to store more fragment flag. This avoids loss
of significant bits on big endian machines. Bug has been introduced
in previous commit by removing the =! 0 check.
OK mvs@


# 1.94 25-Jul-2022 bluhm

The IPv4 reassembly code is MP safe, so we can run it in parallel.
Note that ip_ours() runs with shared netlock, while ip_local() has
exclusive netlock after queuing. Move existing the code into
function ip_fragcheck() and call it from ip_ours().
OK mvs@


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.107 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.106 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.105 13-Sep-2022 mvs

Do soreceive() with shared netlock for raw sockets.

ok bluhm@


# 1.104 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.103 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.102 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.101 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.100 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.99 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.98 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.97 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.96 12-Aug-2022 bluhm

There are some places in ip and ip6 input where operations fail due
to out of memory. Use a generic idropped counter for those.
OK mvs@


# 1.95 04-Aug-2022 bluhm

Use 16 bit variable to store more fragment flag. This avoids loss
of significant bits on big endian machines. Bug has been introduced
in previous commit by removing the =! 0 check.
OK mvs@


# 1.94 25-Jul-2022 bluhm

The IPv4 reassembly code is MP safe, so we can run it in parallel.
Note that ip_ours() runs with shared netlock, while ip_local() has
exclusive netlock after queuing. Move existing the code into
function ip_fragcheck() and call it from ip_ours().
OK mvs@


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.106 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.105 13-Sep-2022 mvs

Do soreceive() with shared netlock for raw sockets.

ok bluhm@


# 1.104 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.103 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.102 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.101 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.100 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.99 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.98 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.97 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.96 12-Aug-2022 bluhm

There are some places in ip and ip6 input where operations fail due
to out of memory. Use a generic idropped counter for those.
OK mvs@


# 1.95 04-Aug-2022 bluhm

Use 16 bit variable to store more fragment flag. This avoids loss
of significant bits on big endian machines. Bug has been introduced
in previous commit by removing the =! 0 check.
OK mvs@


# 1.94 25-Jul-2022 bluhm

The IPv4 reassembly code is MP safe, so we can run it in parallel.
Note that ip_ours() runs with shared netlock, while ip_local() has
exclusive netlock after queuing. Move existing the code into
function ip_fragcheck() and call it from ip_ours().
OK mvs@


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.105 13-Sep-2022 mvs

Do soreceive() with shared netlock for raw sockets.

ok bluhm@


# 1.104 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.103 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.102 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.101 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.100 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.99 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.98 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.97 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.96 12-Aug-2022 bluhm

There are some places in ip and ip6 input where operations fail due
to out of memory. Use a generic idropped counter for those.
OK mvs@


# 1.95 04-Aug-2022 bluhm

Use 16 bit variable to store more fragment flag. This avoids loss
of significant bits on big endian machines. Bug has been introduced
in previous commit by removing the =! 0 check.
OK mvs@


# 1.94 25-Jul-2022 bluhm

The IPv4 reassembly code is MP safe, so we can run it in parallel.
Note that ip_ours() runs with shared netlock, while ip_local() has
exclusive netlock after queuing. Move existing the code into
function ip_fragcheck() and call it from ip_ours().
OK mvs@


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.104 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.103 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.102 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.101 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.100 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.99 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.98 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.97 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.96 12-Aug-2022 bluhm

There are some places in ip and ip6 input where operations fail due
to out of memory. Use a generic idropped counter for those.
OK mvs@


# 1.95 04-Aug-2022 bluhm

Use 16 bit variable to store more fragment flag. This avoids loss
of significant bits on big endian machines. Bug has been introduced
in previous commit by removing the =! 0 check.
OK mvs@


# 1.94 25-Jul-2022 bluhm

The IPv4 reassembly code is MP safe, so we can run it in parallel.
Note that ip_ours() runs with shared netlock, while ip_local() has
exclusive netlock after queuing. Move existing the code into
function ip_fragcheck() and call it from ip_ours().
OK mvs@


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.103 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.102 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.101 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.100 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.99 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.98 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.97 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.96 12-Aug-2022 bluhm

There are some places in ip and ip6 input where operations fail due
to out of memory. Use a generic idropped counter for those.
OK mvs@


# 1.95 04-Aug-2022 bluhm

Use 16 bit variable to store more fragment flag. This avoids loss
of significant bits on big endian machines. Bug has been introduced
in previous commit by removing the =! 0 check.
OK mvs@


# 1.94 25-Jul-2022 bluhm

The IPv4 reassembly code is MP safe, so we can run it in parallel.
Note that ip_ours() runs with shared netlock, while ip_local() has
exclusive netlock after queuing. Move existing the code into
function ip_fragcheck() and call it from ip_ours().
OK mvs@


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.102 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.101 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.100 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.99 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.98 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.97 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.96 12-Aug-2022 bluhm

There are some places in ip and ip6 input where operations fail due
to out of memory. Use a generic idropped counter for those.
OK mvs@


# 1.95 04-Aug-2022 bluhm

Use 16 bit variable to store more fragment flag. This avoids loss
of significant bits on big endian machines. Bug has been introduced
in previous commit by removing the =! 0 check.
OK mvs@


# 1.94 25-Jul-2022 bluhm

The IPv4 reassembly code is MP safe, so we can run it in parallel.
Note that ip_ours() runs with shared netlock, while ip_local() has
exclusive netlock after queuing. Move existing the code into
function ip_fragcheck() and call it from ip_ours().
OK mvs@


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.101 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.100 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.99 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.98 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.97 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.96 12-Aug-2022 bluhm

There are some places in ip and ip6 input where operations fail due
to out of memory. Use a generic idropped counter for those.
OK mvs@


# 1.95 04-Aug-2022 bluhm

Use 16 bit variable to store more fragment flag. This avoids loss
of significant bits on big endian machines. Bug has been introduced
in previous commit by removing the =! 0 check.
OK mvs@


# 1.94 25-Jul-2022 bluhm

The IPv4 reassembly code is MP safe, so we can run it in parallel.
Note that ip_ours() runs with shared netlock, while ip_local() has
exclusive netlock after queuing. Move existing the code into
function ip_fragcheck() and call it from ip_ours().
OK mvs@


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.100 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.99 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.98 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.97 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.96 12-Aug-2022 bluhm

There are some places in ip and ip6 input where operations fail due
to out of memory. Use a generic idropped counter for those.
OK mvs@


# 1.95 04-Aug-2022 bluhm

Use 16 bit variable to store more fragment flag. This avoids loss
of significant bits on big endian machines. Bug has been introduced
in previous commit by removing the =! 0 check.
OK mvs@


# 1.94 25-Jul-2022 bluhm

The IPv4 reassembly code is MP safe, so we can run it in parallel.
Note that ip_ours() runs with shared netlock, while ip_local() has
exclusive netlock after queuing. Move existing the code into
function ip_fragcheck() and call it from ip_ours().
OK mvs@


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.98 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.97 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.96 12-Aug-2022 bluhm

There are some places in ip and ip6 input where operations fail due
to out of memory. Use a generic idropped counter for those.
OK mvs@


# 1.95 04-Aug-2022 bluhm

Use 16 bit variable to store more fragment flag. This avoids loss
of significant bits on big endian machines. Bug has been introduced
in previous commit by removing the =! 0 check.
OK mvs@


# 1.94 25-Jul-2022 bluhm

The IPv4 reassembly code is MP safe, so we can run it in parallel.
Note that ip_ours() runs with shared netlock, while ip_local() has
exclusive netlock after queuing. Move existing the code into
function ip_fragcheck() and call it from ip_ours().
OK mvs@


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.97 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.96 12-Aug-2022 bluhm

There are some places in ip and ip6 input where operations fail due
to out of memory. Use a generic idropped counter for those.
OK mvs@


# 1.95 04-Aug-2022 bluhm

Use 16 bit variable to store more fragment flag. This avoids loss
of significant bits on big endian machines. Bug has been introduced
in previous commit by removing the =! 0 check.
OK mvs@


# 1.94 25-Jul-2022 bluhm

The IPv4 reassembly code is MP safe, so we can run it in parallel.
Note that ip_ours() runs with shared netlock, while ip_local() has
exclusive netlock after queuing. Move existing the code into
function ip_fragcheck() and call it from ip_ours().
OK mvs@


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.96 12-Aug-2022 bluhm

There are some places in ip and ip6 input where operations fail due
to out of memory. Use a generic idropped counter for those.
OK mvs@


# 1.95 04-Aug-2022 bluhm

Use 16 bit variable to store more fragment flag. This avoids loss
of significant bits on big endian machines. Bug has been introduced
in previous commit by removing the =! 0 check.
OK mvs@


# 1.94 25-Jul-2022 bluhm

The IPv4 reassembly code is MP safe, so we can run it in parallel.
Note that ip_ours() runs with shared netlock, while ip_local() has
exclusive netlock after queuing. Move existing the code into
function ip_fragcheck() and call it from ip_ours().
OK mvs@


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.95 04-Aug-2022 bluhm

Use 16 bit variable to store more fragment flag. This avoids loss
of significant bits on big endian machines. Bug has been introduced
in previous commit by removing the =! 0 check.
OK mvs@


# 1.94 25-Jul-2022 bluhm

The IPv4 reassembly code is MP safe, so we can run it in parallel.
Note that ip_ours() runs with shared netlock, while ip_local() has
exclusive netlock after queuing. Move existing the code into
function ip_fragcheck() and call it from ip_ours().
OK mvs@


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.94 25-Jul-2022 bluhm

The IPv4 reassembly code is MP safe, so we can run it in parallel.
Note that ip_ours() runs with shared netlock, while ip_local() has
exclusive netlock after queuing. Move existing the code into
function ip_fragcheck() and call it from ip_ours().
OK mvs@


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.93 05-May-2022 claudio

Use static objects for struct rttimer_queue instead of dynamically
allocate them.

Currently there are 6 rttimer_queues and not many more will follow. So
change rt_timer_queue_create() to rt_timer_queue_init() which now takes
a struct rttimer_queue * as argument which will be initialized.
Since this changes the gloabl vars from pointer to struct adjust other
callers as well.
OK bluhm@


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.92 28-Apr-2022 bluhm

Decouple IP input and forwarding from protocol input. This allows
to have parallel IP processing while the upper layers are still not
MP safe. Introduce ip_ours() that enqueues the packets and ipintr()
that dequeues and processes them with an exclusive netlock.
Note that we still have only one softnet task. Running IP processing
on multiple CPU will be the next step.
lots of testing Hrvoje Popovski; OK sashan@


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.91 20-Apr-2022 bluhm

Route timeout was a mixture of int, u_int and long. Use type int
for timeout, add sysctl bounds checking between 0 and max int, and
use time_t for absolute times.

Some code assumes that the route timeout queue can be NULL and at
some places this was checked. Better make sure that all queues
always exist. The pool_get for struct rttimer_queue is only called
from initialization and from syscall, so PR_WAITOK is possible.

Keep the special hack when ip_mtudisc is set to 0. Destroy the
queue and generate an empty one.

If redirect timeout is 0, it should not time out. Check the value
in IPv6 to make the behavior like IPv4.

Sysctl net.inet6.icmp6.redirtimeout had no effect as the queue
timeout was not modified. Make icmp6_sysctl() look like icmp_sysctl().

OK claudio@


Revision tags: OPENBSD_7_1_BASE
# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.90 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.89 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.88 30-Mar-2021 sashan

[ICMP] IP options lead to malformed reply

icmp_send() must update IP header length if IP optaions are appended.
Such packet also has to be dispatched with IP_RAWOUTPUT flags.

Bug reported and fix co-designed by Dominik Schreilechner _at_ siemens _dot_ com

OK bluhm@


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.87 01-Mar-2021 bluhm

Refactor ip_fragment() and ip6_fragment(). Use a mbuf list to
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.86 08-Dec-2019 sashan

Make sure packet destination address matches interface address,
where such packet is bound to. This check is enforced if and only
IP forwarding is disabled.

Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@

OK bluhm@, claudio@, tobhe@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.85 15-Nov-2017 mpi

Unbreak ENCDEBUG kernels by declaring `encdebug' in ip_ipsp.h


# 1.84 05-Nov-2017 florian

Finish off pr_drain functions, they haven't been used since 2006.
OK mpi


# 1.83 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


Revision tags: OPENBSD_6_2_BASE
# 1.82 05-Sep-2017 visa

Serialize access to IP reassembly queue with a mutex. This lets
ip_local(), ip_slowtimo() and ip_drain() run without KERNEL_LOCK()
and NET_LOCK().

Input and OK mpi@, bluhm@


# 1.81 01-Sep-2017 mpi

Change sosetopt() to no longer free the mbuf it receives and change
all the callers to call m_freem(9).

Support from deraadt@ and tedu@, ok visa@, bluhm@


# 1.80 14-Jul-2017 tedu

kernels don't build without MROUTING because ip_var.h only sometimes
introduces a forward decl for socket. turns out the affected file doesn't
need ip_var.h, so remove it. then move the decl to the bottom to prevent
the problem from recurring.
bug report by Nick Briggs
ok mpi


# 1.79 26-Jun-2017 bluhm

Convert ip_input() to a pr_input style function. Goal is to process
IPsec packets without additional enqueueing.
OK mpi@


# 1.78 31-May-2017 mpi

Move IPv4 & IPv6 incoming/forwarding path, PIPEX ppp processing and
IPv4 & IPv6 dispatch functions outside the KERNEL_LOCK().

We currently rely on the NET_LOCK() serializing access to most global
data structures for that. IP input queues are no longer used in the
forwarding case. They still exist as boundary between the network and
transport layers because TCP/UDP & friends still need the KERNEL_LOCK().

Since we do not want to grab the NET_LOCK() for every packet, the
softnet thread will do it once before processing a batch. That means
the L2 processing path, which is currently running without lock, will
now run with the NET_LOCK().

IPsec isn't ready to run without KERNEL_LOCK(), so the softnet thread
will grab the KERNEL_LOCK() as soon as ``ipsec_in_use'' is set.

Tested by Hrvoje Popovski.

ok visa@, bluhm@, henning@


# 1.77 30-May-2017 mpi

Introduce ipv{4,6}_input(), two wrappers around IP queues.

This will help transitionning to an un-KERNEL_LOCK()ed IP
forwarding path.

Disucssed with bluhm@, ok claudio@


# 1.76 28-May-2017 bluhm

Rename ip_local() to ip_deliver() and give it the same parameters
as the pr_input functions. Add an assert that IPv4 delivery ends
in IP proto done to assure that IPv4 protocol functions work like
IPv6.
OK mpi@


# 1.75 22-May-2017 bluhm

Move IPsec forward and local policy check functions to ipsec_input.c
and give them better names.
input and OK mikeb@


# 1.74 22-May-2017 bluhm

Use the IPsec policy check from IPv4 also when doing local delivery
in ip6_local() to our IPv6 stack.
OK mikeb@


# 1.73 12-May-2017 bluhm

IPsec packets were passed through ip_input() a second time after
they have been decrypted. That means that all the IP header fields
were checked twice. Also fragment reassembly was tried twice.
At pf incoming packets in tunnel mode appeared twice on the enc0
interface, once as IP-in-IP and once as the inner packet. In the
outgoing path pf only sees the inner packet. Asymmetry is bad for
stateful filtering.
IPv6 shows that IPsec works without that. After decrypting immediately
continue with local delivery. In tunnel mode the IP-in-IP protocol
functions pass the inner header to ip6_input(). In transport mode
only pf_test() has to be called for the enc0 device.
Introduce ip_local() to avoid needless processing and cleaner pf
behavior in IPv4 IPsec.
OK mikeb@


# 1.72 12-May-2017 bluhm

Use the IPsec policy check from ipv4_input() also when forwarding
in ip6_input(). While there avoid an ugly #ifdef in ipv4_input().
OK mikeb@


# 1.71 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.70 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.69 03-Mar-2017 bluhm

Convert the variable argument list of the pr_output functions to
fixed parameters.
OK mpi@ claudio@ dhill@


# 1.68 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.67 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.66 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.65 19-Dec-2016 rzalamena

Extend the multicast sockets and multicast hash table support to multiple
domains. This is one step towards supporting to run more than one multicast
socket in different domains at the same time.

ok mpi@


# 1.64 28-Nov-2016 bluhm

Path MTU discovery and traceroute did not always work with pf af-to.
If an incoming packet is directly put into the output path, sending
the icmp error packet is never done. As this is basically forwarding,
calling ip_forward() for such packets does everything that is needed.
OK mikeb@


# 1.63 14-Nov-2016 dlg

turn ipstat into a set of percpu counters.

each counter is identified by an enum value which correspond to the
original members of the ipstat struct.

ipstat_inc(ips_foo) replaces ipstat.ips_foo++ for the actual updates.
ipstat_inc is a thin wrapper around counters_inc.

counters are still returned to userland via the ipstat struct for now.

ok mpi@ mikeb@


Revision tags: OPENBSD_6_0_BASE
# 1.62 15-Apr-2016 mpi

Kill in_rtaddr() and use rtalloc(9) directly in ip_dooptions().

This brings ip_dooptions() closer to mp-safeness by ensuring that
``ifa'' is dereferenced before calling rtfree(9).

ok mikeb@


Revision tags: OPENBSD_5_9_BASE
# 1.61 03-Dec-2015 sashan

ip_send()/ip6_send() allow PF to send response packet in ipsoftnet task.
this avoids current recursion to pf_test() function. the change also
switches icmp_error()/icmp6_error() to use ip_send()/ip6_send() so
they are safe for PF.

The idea comes from Markus Friedl. bluhm, mikeb and mpi helped me
a lot to get it into shape.

OK bluhm@, mpi@


Revision tags: OPENBSD_5_8_BASE
# 1.60 16-Jul-2015 mpi

Kill IP_ROUTETOETHER.

This pseudo-option is a hack to support return-rst on bridge(4). It
passes Ethernet information via a "struct route" through ip_output().

"struct route" is slowly dying...

ok claudio@, benno@


Revision tags: OPENBSD_5_7_BASE
# 1.59 17-Dec-2014 mpi

Remove the "multicast_" prefix from the fields a multicast-only struct.

Prodded by claudio@ and mikeb@


# 1.58 17-Dec-2014 mpi

Use an interface index instead of a pointer for multicast options.

Output interface (port) selection for multicast traffic is not done via
route lookups. Instead the output ifp is registred when setsockopt(2)
is called with the IP{V6,}_MULTICAST_IF option. But since there is no
mechanism to invalidate such pointer stored in a pcb when an interface
is destroyed/removed, it might lead your kernel to fault.

Prevent a fault upon resume reported by frantisek holop, thanks!

ok mikeb@, claudio@


# 1.57 05-Nov-2014 mpi

Kill in_iawithaddr() and use ifa_ifwithaddr() directly.

Note that ifa_ifwithaddr() might return a broadcast address, so if you
don't want one make sure to filter them out.

ok mikeb@


Revision tags: OPENBSD_5_6_BASE
# 1.56 21-Apr-2014 henning

ip_output() using varargs always struck me as bizarre, esp since it's only
ever used to pass on uint32 (for ipsec). stop that madness and just pass
the uint32, 0 in all cases but the two that pass the ipsec flowinfo.
ok deraadt reyk guenther


# 1.55 07-Apr-2014 mpi

Retire kernel support for SO_DONTROUTE, this time without breaking
localhost connections.

The plan is to always use the routing table for addresses and routes
resolutions, so there is no future for an option that wants to bypass
it. This option has never been implemented for IPv6 anyway, so let's
just remove the IPv4 bits that you weren't aware of.

Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@


# 1.54 28-Mar-2014 sthen

revert "Retire kernel support for SO_DONTROUTE" diff, which does bad things
for localhost connections. discussed with deraadt@


# 1.53 27-Mar-2014 mpi

Retire kernel support for SO_DONTROUTE, since the plan is to always
use the routing table there's no future for an option that wants to
bypass it. This option has never been implemented for IPv6 anyway,
so let's just remove the IPv4 bits that you weren't aware of.

Tested by florian@, man pages inputs from jmc@, ok benno@


# 1.52 27-Mar-2014 mpi

Stop dereferencing the ifp pointer present in the packet header all
over the input path since it is going to die. Should be no functional
change.

ok mikeb@, lteo@, benno@


Revision tags: OPENBSD_5_5_BASE
# 1.51 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.50 17-Dec-2013 matthew

Change ip_output()'s non-optional arguments to be standard arguments
instead of variable arguments.

Allows stricter type checking by the compiler at call sites and also
saves a bit of code size on some platforms (e.g., ~200 bytes on
amd64).

ok mikeb


# 1.49 17-Nov-2013 bluhm

Instead of stripping the IP options manually in icmp_reflect(),
just call ip_stripoptions(). Remove an unneeded parameter and
adjust the ip length in ip_stripoptions().
from FreeBSD; OK deraadt@ henninh@ lteo@


# 1.48 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL
This is a more substantial change than the others commited minutes ago,
so it is seperate. More structs get hidden.
ok various


# 1.47 21-Oct-2013 deraadt

There are gasps of shock! Add a pmtu delay sysctl BUTTON for netinet6,
making the code the same as netinet4 along the way.
ok bluhm phessler


# 1.46 13-Aug-2013 mpi

When net.inet.ip.sourceroute is enable, store the source route
of incoming IPv4 packets with the SSRR or LSRR header option in
a m_tag rather than in a single static entry.

Use a new m_tag type, PACKET_TAG_SRCROUTE, for this and bump
PACKET_TAG_MAXSIZE accordingly.

Adapted from FreeBSD r135274 with inputs from bluhm@.

ok bluhm@, mikeb@


Revision tags: OPENBSD_5_4_BASE
# 1.45 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.44 16-Jul-2012 markus

add IP_IPSECFLOWINFO option to sendmsg() and recvmsg(), so npppd(4)
can use this to select the IPsec tunnel for sending L2TP packets.
this fixes Windows (always binding to 1701) and Android clients
(negotiating wildcard flows); feedback mpf@ and yasuoka@;
ok henning@ and yasuoka@; ok jmc@ for the manpage


# 1.43 17-Mar-2012 dlg

remove IP_JUMBO, SO_JUMBO, and RTF_JUMBO.

no objection from mcbride@ krw@ markus@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.42 19-Apr-2011 dlg

reintroduce using the RB tree for local address lookups. this is
confusing because both addresses and broadcast addresses are put
into the tree.

there are two types of local address lookup. the first is when the
socket layer wants a local address, the second is in ip_input when
the kernel is figuring out the packet is for it to process or
forward.

ip_input considers local addresses and broadcast addresses as local,
however, the handling of broadcast addresses is different depending
on whether ip_directedbcast is set. if if ip_directbcast is unset
then a packet coming in on any interface to any of the systems
broadcast addresses is considered local, otherwise the broadcast
packet must exist on the interface it was received on.

the code also needs to consider classful broadcast addresses so we
can continue some legacy applications (eg, netbooting old sparcs
that use rarp and bootparam requests to classful broadcast addresses
as per PR6382). this diff maintains that support, but restricts it
to packets that are broadcast on the link layer (eg, ethernet
broadcasted packets), and it only looks up addresses on the local
interface. we now only support classful broadcast addresses on local
interfaces to avoid weird side effects with packets routed to us.

the ip4 socket layer does lookups for local addresses with a wrapper
around the global address tree that rejects matches against broadcast
addresses. we now no longer support bind sockets to broadcast
addresses, no matter what the value of ip_directedbcast is.

ok henning@
testing (and possibly ok) claudio@


# 1.41 14-Apr-2011 claudio

Backout the in_iawithaddr() -> ifa_ifwithaddr() change.
There is a massive issue with broadcast addrs because ifa_ifwithaddr()
handles them differently then in_iawithaddr().


# 1.40 04-Apr-2011 henning

make in_iawithaddr a wrapper for ifa_ifwithaddr plus a hack for old ancient
classful broadcast so we can still netboot sparc and the like.
compat hack untested, i will deal with the fallout if there is any later
at the same time stop exporting in_iawithaddr, everything but ip_input
should (and now does) use ifa_ifwithaddr directly
ok dlg sthen and agreement from many


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.39 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.38 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


Revision tags: OPENBSD_4_3_BASE
# 1.37 18-Sep-2007 markus

allow 4095 instead of 20 multicast group memberships per socket (you need
one entry for each multicast group and interface combination). this allows
you to run OSPF with more than 10 interfaces.
adapted from freebsd; ok claudio, henning, mpf


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE
# 1.36 29-May-2006 claudio

Make savecontrol functions more generic and use them now for raw IP too.
Additionally add the IP_RECVIF option which returns the interface a packet
was received on. OK markus@ norby@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.35 11-Aug-2005 mpf

New counter for not joined IPv4 multicast groups.
Don't count link local scope multicast as not forwardable.
This stops ips_cantforward growing on carp(4) networks.
tested and ok mcbride@, ok markus@.


# 1.34 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.33 27-May-2005 mcbride

Experimental support for opportunitic use of jumbograms where only some hosts
on the local network support them.

This adds a new socket option, SO_JUMBO, and a new route flag,
RTF_JUMBO. If _both_ the socket option is set and the route for the host
has RTF_JUMBO set, ip_output will fragment the packet to the largest
possible size for the link, ignoring the card's MTU.

The semantics of this feature will be evolving rapidly; talk to us
if you intend to use it.

ok deraadt@ marius@


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.32 22-Jun-2004 cedric

Pull the plug on source-based routing until remaining bugs are eradicated.
No need to reconfig kernel or rebuild userland stuff.
requested deraadt@, help beck@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.31 06-Jun-2004 cedric

extend routing table to be able to match and route packets based on
their *source* IP address in addition to their destination address.
routing table "destination" now contains a "struct sockaddr_rtin"
for IPv4 instead of a "struct sockaddr_in".
the routing socket has been extended in a backward-compatible way.
todo: PMTU enhancements, IPv6. ok deraadt@ mcbride@


# 1.30 28-Apr-2004 cedric

make return-rst work on pure bridges. ok dhartmei@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.29 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.28 12-Feb-2003 jason

Remove commons; inspired by netbsd.


# 1.27 09-Dec-2002 millert

From Andrushock, s/sucess/success/g


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.26 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.25 09-Jun-2002 itojun

whitespace


# 1.24 31-May-2002 itojun

respect rmx_mtu (cached PMTUD result) on outbound. deraadt/angelos ok


# 1.23 28-May-2002 jasoni

Factor out IP fragmentation code into its own function so it can be
reused.
- ok jason@, dhartmei@


Revision tags: OPENBSD_3_1_BASE
# 1.22 14-Mar-2002 millert

First round of __P removal in sys


# 1.21 24-Jan-2002 provos

allocate tcp reassembly queue via pool; based on netbsd; okay art@ angelos@


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.20 23-Jun-2001 angelos

branches: 1.20.4;
Hardware checksumming stats.


# 1.19 09-Jun-2001 angelos

Inclusion protection.


# 1.18 28-May-2001 angelos

IP_ENCAPSULATED is deprecated.


# 1.17 20-May-2001 fgsch

Remove varargs from ipv4_input; cmetz@ deraadt@ ok.


# 1.16 01-May-2001 provos

get rid of dtom(), okay itojun@ angelos@ mickey@ millert@


Revision tags: OPENBSD_2_9_BASE
# 1.15 03-Mar-2001 itojun

drop packets with 127.0.0.0/8 in header field, if the packet is from outside.
under RFC1122 sender rule 127.0.0.8 must not appear on the wire.
count incidents by ipstat.ips_badaddr. sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.14 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.13 02-Jan-2000 angelos

branches: 1.13.2;
Remove the ifdef for IP_ENCAPSULATED.


Revision tags: kame_19991208
# 1.12 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_5_BASE OPENBSD_2_6_BASE
# 1.11 17-Feb-1999 deraadt

add fragment flood protection; configureable using sysctl ip.maxqueue


# 1.10 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.9 26-Dec-1998 provos

make ip_id random but ensure that ids dont repeat for some period.


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.8 14-Feb-1998 mickey

wildcard ifaces; finally, after HE said it's ok


# 1.7 01-Feb-1998 deraadt

undo wildcard loopback stuff; it was not checked by other developers


# 1.6 01-Feb-1998 mickey

support wildcard loopbacks. that is, setting up lo1 like:
ifconfig lo1 inet 192.168.1.1 netmask 255.255.255.0 link1
would force it to act like all the addresses from net 192.168.1 were
added to the interface.
todo: man lo


Revision tags: OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.5 20-Feb-1997 deraadt

IPSEC package by John Ioannidis and Angelos D. Keromytis. Written in
Greece. From ftp.funet.fi:/pub/unix/security/net/ip/BSDipsec.tar.gz


# 1.4 26-Jan-1997 tholo

Make ip_len and ip_off unsigned values; don't transmit or accept packets
larger than the maximum IP packet size. From NetBSD.


Revision tags: OPENBSD_2_0_BASE
# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision